Thursday, December 14, 2017

Mining with AMDGPU-PRO 17.40 on Linux


A 17.40 beta was released on October 16, with a final release following on October 30th.  There have been some issues with corrupt versions of the final release, but I think they are resolved now.  I encountered lots of problems with this release, which was much of the motivation for making this post.

Until earlier this year, the AMDGPU-PRO drivers were targeted at the new Polaris cards, and support for even relatively recent Tonga was lacking.  Because of this, I was using the fglrx drivers for Tonga and Pitcairn cards.  The primary reason for upgrading now is for large page support, which improves performance on algorithms that use a large amount (2GB or more) of memory.  With the promise of better performance, and since fglrx is no longer being maintained, I decided to upgrade.

I've been using AMDGPU-PRO with kernel 4.10.5 for my Rx 470 cards, so I decided to use the same kernel.  I can't say there is any problems with using a newer kernel like 4.10.17 or even 4.14.5, so they might work just as well.  I left the on-board video enabled (i915), so I would not have to be connecting and disconnecting video cables when testing the GPUs.  After installing Ubuntu 16.04.3, I updated the kernel and rebooted.  For installing the AMDGPU-PRO drivers, I used the px option (amdgpu-pro-install --px), as it is supposed to support mixed iGPU/dGPU use.

My normal procedure for bringing up a multi-GPU machine is to start with a single GPU in the 16x motherboard slot, as this avoids potential issues with flaky risers.  Even with just one R9 380 card in the 16x slot, I was having problems with powerplay.  When it is working, pp_dpm_sclk will show the current clock rate with an asterisk, but this was not happening.  After two days of troubleshooting, I concluded there is a bug with powerplay and some motherboards when using the 16x slot.  When using only the 1x slots, powerplay works fine.

Since I wasn't able to use the 16x motherboard slot, testing card and riser combinations was more difficult.  Normally when I have a problem with a card and riser, I'll move the card to the 16x slot.  If the problems go away, I'll mark the riser as likely defective.  Mining algorithms like ethash use little bandwidth between the CPU and GPU, so there is no performance loss to using 1x risers.  Even the slowest PCIe 1.1 transfer rate is sufficient for mining.  Using "lspci -vv",  I could see the link speed was 5.0GT/s (LnkSta:), which is PCIe gen2 speed.  Reducing the speed to gen1 would mean lower quality risers could be used without encountering errors.

My first thought was to try to set the PCIe speed in the motherboard BIOS.  Setting gen1 in the chipset options made no difference, so perhaps it is only the speed used during boot-up before the OS takes over control of the PCIe bus.  Next, using "modinfo amdgpu", I noticed some module options related to PCIe.  Adding "amdgpu.pcie_gen2=0" had no effect.  Apparently the module no longer supports that option.  I could not find any documentation for the "pcie_gen_cap", but luckily the open-source amdgpu module supports the same module parameter.  By looking at amd_pcie.h in the kernel source code, I determined "0x10001" will limit the link to gen1.  I added "pcie_gen_cap=0x10001" to /etc/default/grub, ran update-grub, and rebooted.  With lspci I was able to see that all the GPUs were running at 2.5GT/s.

For clock control, and monitoring I've previously written about ROC-smi.
====================    ROCm System Management Interface    ====================
================================================================================
 GPU  DID    Temp     AvgPwr   SCLK     MCLK     Fan      Perf    OverDrive  ECC
  3   6938   66.0c    100.172W 858Mhz   1550Mhz  44.71%   manual    0%       N/A
  1   6939   64.0c    112.21W  846Mhz   1550Mhz  42.75%   manual    0%       N/A
  4   6939   62.0c    118.135W 839Mhz   1500Mhz  47.84%   manual    0%       N/A
  2   6939   77.0c    123.78W  839Mhz   1550Mhz  64.71%   manual    0%       N/A
GPU[0]          : PowerPlay not enabled - Cannot get supported clocks
GPU[0]          : PowerPlay not enabled - Cannot get supported clocks
  0   0402   N/A      N/A      N/A      N/A      None%              N/A      N/A
================================================================================
====================           End of ROCm SMI Log          ====================

I also use Kristy's utility to set specific clock rates:
ohgodatool -i 1 --mem-state 3 --mem-clock 1550

Unfortunately ethminer-nr doesn't work with this setup.  I suspect the new driver doesn't support some old OpenCL option, so the fix should be relatively simple, once I make the time to debug it.

Wednesday, December 6, 2017

Powering GPU mining rigs


Since I started mining ethereum almost two years ago, I have found that power distribution is important not just for equipment safety, but also for system stability.  When I started mining I thought my rigs should be fine as long as I used a robust server PSU to power the GPUs, with heavy 16 or 18AWG cables.  After frying one motherboard and more than a couple ATX PSUs, I've learned a lot of careful design and testing is required.

Using Dell, IBM, or HP server power supplies for mining rigs is not a new idea, so I won't go into too much detail about them.  I do recommend making an interlock connector so the server PSU turns on at the same time as the motherboard.  I also recommend only connecting the server PSU to power the GPU PCIe power connectors, as they are isolated from the 12V supply for the motherboard.  If you try to power ribbon risers, the 12V from the ATX and server PSUs will be interconnected and can lead to feedback problems.  Server PSUs are very robust and unlikely to be harmed, but I have killed a cheap 450W ATX PSU this way.  If you use USB risers, they are isolated from the motherboard's 12V supply, and therefore can be safely powered from the server PSU.

In the photo above, you might notice the grounding wire connecting all the cards, which then connects to a server PSU.  I recently added this to the rig after measuring higher current flowing through two of the ground wires connected to the 6-pin PCIe power plugs.  As I mentioned in my post about GPU PCIe power connections, there are only two ground pins, with the third ground wire being connected to the sense pin.  With two ground pins and three power pins, the ground wires carry 50% more current than the 12V wires.  Although the ground wires weren't heating up from the extra current, the connector was.  Adding the ground bypass wire reduced the connector temperature to a reasonable level.

For ATX PSUs, I've used a few of the EVGA 500B, and do not recommend them.  While even my cheap old 300W power supplies use 18AWG wire for the hard drive power connectors, the SATA and molex power cables on the 500B are only 20AWG.  Powering more than one or two risers with a 20AWG cable is a recipe for trouble.  I burned the 12V hard drive power wire on two 500B supplies before I realized this.  I recently purchased a Rosewill 500W 80plus gold PSU that was on sale at Newegg, and it is much better than the EVGA 500B.  The Rosewill uses 18AWG wire in the hard drive cables, and it also has a 12V sense wire in the ATX power connector.  This allows it to compensate for the voltage drop in the cable from the PSU to the motherboard.  The sense wire is the thinner yellow wire in the photo below.

Speaking of voltage drop, I recommend checking the voltage at the PCIe power connector to ensure it is close to 12V.  Most of my cards do not have a back plate, so I can use a multi-meter to measure at the 12V pins of the the power connector where they are soldered to the GPU PCB.  I also recommend checking the temperature of power connectors since good quality low-resistance connectors are just as important as heavy gauge wires.  Warm connectors are OK, but if they so hot that you can't hold your fingers to them, that's a problem.

My last recommendation is for people in North America (and some other places) where 120V AC power is the norm.  Wire up the outlets for your mining rigs for 240 instead of 120.  Power supplies are slightly more efficient at 240V, and will draw half as much current compared to 120V.  Lower current draw means less line loss going to the power supply and therefore less heat generated in power cords and plugs.  Properly designed AC power cables and plugs should never overheat below 10-15 Amps, however I have seen melted and burned connectors at barely over 10A of steady current draw.


Friday, June 23, 2017

Server PSU interlock


On my multi-GPU rigs, I use server PSUs like the Dell N750P to provide the 12V power to the PCI-E connectors.  These PSUs do not have power switches, so initially I would just pull the power cord out when I wanted to power them down.  After experimenting with the PSU control pins, I realized they have an active low "power on" pin.  Instead of using a jumper to connect it to ground, I decided to use an electronic switch to power the server PSU when the motherboard powers up.

The switch I used is a common, cheap model 817 optocoupler (pdf datasheet).  When current flows from pin 1 to 2, the optocoupler is turned on, creating a short from pin 4 to pin 3.  For my small circuit shown above, pin 4 is connected to the PS_ON signal, and pin 3 is connected to ground on the server PSU.  Pin 1 is connected to 12V (from the 4-pin 3.5" floppy drive power connector), and pin 2 is connected to ground.  On the back of the board is a 1K current-limiting resistor in series with the red LED which is a power on indicator.

I also made an even simpler interlock using only an optocoupler with the pins straightened and 0.1" header pins:
I connect pins 1 and 2 to the motherboard's power LED pins, which would normally light up a LED  when the motherboard powers up.  The motherboard already has a current-limiting resistor for the power LED, which typically limits the current to around 10mA.

Friday, May 12, 2017

Dummy plugs for headless GPU rigs


I've read about people claiming they needed to plug a monitor (or dummy plug) into one GPU card or else they couldn't use the card.  I had never encountered any problems with either fglrx or AMDGPU-Pro drivers until recently.  I moved a 4GB R9 380 card from an Ubuntu 14.04/fglrx rig to a Ubuntu 16.04/AMDGPU-Pro rig.  The remaining cards are 2GB R7 370 cards, and I started getting memory allocation errors for the primary card.  After checking with "ethminer --list-devices", I noticed the first card had about half the maximum memory allocation limit of the others:
Genoil's ethminer 0.9.41-genoil-1.2.0nr
=====================================================================
Forked from github.com/ethereum/cpp-ethereum
CUDA kernel ported from Tim Hughes' OpenCL kernel
With contributions from nicehash, nerdralph, RoBiK and sp_

Please consider a donation to:
ETH: 0xeb9310b185455f863f526dab3d245809f6854b4d

[OPENCL]:
Listing OpenCL devices.
FORMAT: [deviceID] deviceName
[0] Pitcairn
        CL_DEVICE_TYPE: GPU
        CL_DEVICE_GLOBAL_MEM_SIZE: 1920991232
        CL_DEVICE_MAX_MEM_ALLOC_SIZE: 970981376
        CL_DEVICE_MAX_WORK_GROUP_SIZE: 256
[1] Pitcairn
        CL_DEVICE_TYPE: GPU
        CL_DEVICE_GLOBAL_MEM_SIZE: 2095054848
        CL_DEVICE_MAX_MEM_ALLOC_SIZE: 1868562432
        CL_DEVICE_MAX_WORK_GROUP_SIZE: 256

I have an old VGA LCD monitor that I connected using a HDMI-VGA adapter.  After connecting the monitor, nearly the full amount became available:
Genoil's ethminer 0.9.41-genoil-1.2.0nr
=====================================================================
Forked from github.com/ethereum/cpp-ethereum
CUDA kernel ported from Tim Hughes' OpenCL kernel
With contributions from nicehash, nerdralph, RoBiK and sp_

Please consider a donation to:
ETH: 0xeb9310b185455f863f526dab3d245809f6854b4d

[OPENCL]:
Listing OpenCL devices.
FORMAT: [deviceID] deviceName
[0] Pitcairn
        CL_DEVICE_TYPE: GPU
        CL_DEVICE_GLOBAL_MEM_SIZE: 1969225728
        CL_DEVICE_MAX_MEM_ALLOC_SIZE: 1750073344
        CL_DEVICE_MAX_WORK_GROUP_SIZE: 256
[1] Pitcairn
        CL_DEVICE_TYPE: GPU
        CL_DEVICE_GLOBAL_MEM_SIZE: 1968177152
        CL_DEVICE_MAX_MEM_ALLOC_SIZE: 1750073344
        CL_DEVICE_MAX_WORK_GROUP_SIZE: 256

I also found the monitor doesn't have to be plugged in, just the HDMI-VGA adapter.  While there might be a way to configure fglrx so that the full memory is available without the adapter, I'm more interested in learning more about AMDGPU-Pro.

Wednesday, May 10, 2017

GDDR5 memory timing details



In my Advanced Tonga BIOS editing post, I discussed some basic memory timing information, but did not get into the details.  GDDR5 memory is much more complex than the asynchronous DRAM of 20 years ago.  There are many sources of information on SDRAM, while GDDR information is harder to come by.  Although a thorough description of GDDR5 can be found in the spec published by JEDEC, neither nVIDIA nor AMD share information on how their memory controllers are programmed with memory timing information.  By analyzing the AMD video driver source, and with help from people contributing to a discussion on bitcointalk, I have come to understand most of the workings of AMD BIOS timing straps.

When a modern (R9 series and Rx series) AMD GPU card boots up, memory timing information (straps) are copied from the BIOS to registers in the memory controller.  Some timing information such as refresh frequency is not dependent on the memory speed and therefore is not contained in the memory strap table, but much of the important timing information is.  The memory controller registers are 32-bits wide, and so the 48-byte memory straps map to 12 different memory controller registers.  The shift masks in the Linux driver source are therefore non-functional, and can only be taken as hints as to the meaning of the individual bits.  Due to an apparently bureaucratic process for releasing open-source code, AMD engineers are generally reluctant to update such code.

Jumping right to the code, here's a C structure definition for the Rx memory straps:
SEQ_WR_CTL_D1_FORMAT SEQ_WR_CTL_D1;
SEQ_WR_CTL_2_FORMAT SEQ_WR_CTL_2;
SEQ_PMG_TIMING_FORMAT SEQ_PMG_TIMING;
SEQ_RAS_TIMING_FORMAT SEQ_RAS_TIMING;
SEQ_CAS_TIMING_FORMAT SEQ_CAS_TIMING;
SEQ_MISC_TIMING_FORMAT SEQ_MISC_TIMING;
SEQ_MISC_TIMING2_FORMAT SEQ_MISC_TIMING2;
uint32_t SEQ_MISC1;
uint32_t SEQ_MISC3;
uint32_t SEQ_MISC8;
ARB_DRAM_TIMING_FORMAT ARB_DRAM_TIMING;
ARB_DRAM_TIMING2_FORMAT ARB_DRAM_TIMING2;

Looking at the RAS timing, it consists of 6 fields: RCDW, RCDWA, RCDR, RCDRA, RRD, and RC.  The full field definitions can be found in my fork of Kristy-Leigh's code.  Many of the "pad" fields are likely the high bits of the preceding field that are not currently used.  I tested a couple pad fields already (MISC RP_RDA & RP), confirming that the pad bits were actually the high bits of the fields.


For GDDR5, some timing values have both Long and Short versions that apply for access within a bank group or to different bank groups.  The RRD field of RAS timing is likely RRDL, because the values typically seen for this field are 5 and 6.  If RRDS was 5, this would mean at most one page could be opened every five cycles, limiting 32-byte random read performance to 2/5 or 40% of the maximum interface speed.  From my work with Ethereum mining, I know that RRDS can be no more than 4.  In addition, performance tests with RRD timing reduced to 5 from 6 are congruent with it being RRDL.  The actual value of RRDS used by the memory controller does not seem to be contained in the timing strap.  The default 1750Mhz strap for Samsung K4G4 memory has a value of 10 for FAW, which can be no more than 4 * RRDS.  Therefore RRDS is most likely less than 4, and possibly as low as 2.

To simplify the process of modifying memory straps for improved performance, I wrote strapmod.  I also wrote a cgi wrapper for the program, which you can run from my server http://45.62.227.192/cgi-bin/strapmod.  For example, this is the output with the 1750Mhz strap for Samsung K4G4 memory:
Rx strap detected
Old, new RRD: 6 , 5
Old, new FAW: A , 0
Old, new 32AW: 7 , 0
Old, new ACTRD: 19 , 0x10
777000000000000022CC1C0010626C49D0571016B50BD509004AE700140514207A8900A003000000191131399D2C3617
777000000000000022CC1C0010625C49D0571016B50BD50900400700140514207A8900A003000000101131399D2C3617

Saturday, March 25, 2017

AMDGPU-Pro 16.60 on Ubuntu kernel 4.10.5 with ROCM-smi


Although AMDGPU-Pro 16.40 with kernel 4.8 has been working fine for me, I decided to try 16.60 with kernel 4.10.  After my problems with 16.60 on 4.8, I read a few reports claiming it works well with kernel 4.10.

I started with a fresh Ubuntu desktop 16.04.2 install, and then installed 4.10.5 from the Ubuntu ppa.  Although the process is not very complicated, I wrote a small script which downloads the files and installs them.  After rebooting, I downloaded and installed the AMDGPU-Pro 16.60 drivers according to the instructions.  Finally, I installed ROC-smi, a utility which simplifies clock control using the sysfs interface.  To test the install, run "rocm-smi -a" which will show all info for any amdgpu cards installed.

Unfortunately, the new drivers no longer work with my ethminer fork, but sgminer-gm 5.5.5 works as was well as it did with 4.8/16.40.  On GCN3 and newer cards like Tonga and Polaris, the optimal core clock for mining ETH is often between 55% and 56% of the memory clock.  On my Sapphire Rx470 I have the memory overclocked to 2100Mhz, so dpm 6 at 1169Mhz is a perfect fit:
./rocm-smi -d 0 --setsclk 6

Once sgminer was running for a couple minutes, the speed settled at about 29.1Mh/s.  Note that the clock setting is only temporary for the next opencl program to run.  Just run the rocm-smi command each time.

Update 2017-04-08

 4.10.9 was uploaded to the Ubuntu ppa today, so I would recommend it instead of 4.10.5.

Tuesday, March 14, 2017

Riser Recycling


If you build multi-GPU servers, you'll likely encounter flaky or bad risers.  I've had a bad riser where I could see a burned trace on the PCB, and I've had flaky risers that appeared to be caused by poor soldering of the ribbon cable.  While the problem risers may not work with a GPU, chances are the power connectors are still good.  The riser shown above has a 6-pin PCI-e and a 4-pin molex connector, both of which I tested for continuity with a multi-meter.  With some fresh flux I was able to desolder the ribbon cable, so I could re-use the riser as a PCI-e to molex power adapter.  If you are wondering what I would use it for, look at the photo below.

Heat has caused the yellow 12V line to turn brown.  The cable was plugged into the motherboard's supplemental PCI-e power which is used when more than two GPUs are plugged in.  Each GPU will usually draw between 50 and 75 watts over the PCI-e bus, which is pushing the 18AWG (or even 20AWG on some power supplies) cable well beyond it's recommended rating.  By plugging the next molex connector in the chain into the riser, and by providing power to the 6-pin connector on the same riser, current will flow into the motherboard molex connector from both directions.

With the current through the brown wire cut in half, the power dissipated (and therefore the heat generated) is reduced by 75%, since P = I^2 * R.

Supplemental mod

Bitcointalk user BChydro questioned the current-carrying ability of the riser PCB, which turns out to be rather poor for the 12V trace.  The solder mask over the 12V trace was starting to turn brown after only a couple days of use, and a thermal image shows the trace getting hot.

To solve the problem I added a 18AWG jumper wire between the 12V pins: