Zotac 1060K - fallen off the bus

Hi

We are running a C++ program on Ubuntu 18.04 LTS.

The app runs fine on a Zotac 1060 however when we run it on Zotac 1060K it appears to freeze the display. We can not switch sessions (eg. Ctrl ALt F1 does not work etc).

The GPU "falls off the bus" causing the freeze usually after a couple hours, but sometimes upto 16-20 hours.

Our application is 32-bit application and CPU intensive, used about 1/4 of the 4 cores (100% out of 400%).

The OS is 64-bit but we built a 32-bit OS and this causes the same issue.

The error in syslog says "GPU fallen off the bus".

We have used the same image on both boxes however the 1060K is the one that freezes.

This happens on multiple 1060K's but not on 1060s. It is not restricted to one box.

Amongst the many tests we have tried to determine the root cause are:

* updated Nvidia drivers
* updated motherboard BIOS
* disabled turbo mode (via bios and via linux)
* downgraded the video card bios on the 1060K to match vbios on the 1060
* upgraded the linux kernel (from 4.15 to 5.1)
* used a high-spec power supply
* disabled the internal ethernet port and used a usb to ethernet adapter
* set the internal fans to 100% to reduce any overheating (logging temp in syslog and it's not going over 70%)
* swapped the memory from a 1060 to 1060K
* set the system to use 1 core in bios (rather than all 4)
* disabled usb ports
* increased the number of linux open files (though our reporting indicates this is not an issue)
* disabled power management (pcie_aspm=off)
* run the application with a lower priority via nice

We can no longer source any Zotac 1060's and need to use Zotac 1060K's.

Any assistance provided would be greatly appreciated

Thanks in advance
Seek2019
 
Often, when one sees that error, the initial assumption/issue is heating and power related, but given that you've thoroughly examined that, the only other thing I can vaguely recall that may have something to do with it is driver persistence:

https://docs.nvidia.com/deploy/driver-persistence/index.html

Odd that it's only happening on the EN1060K units and not the EN1060s; there's barely any difference between them.
 
Although it doesn't work on my Titan X, it may be worth seeing if NVIDIA's power monitoring tool works on your machines:

https://www.nvidia.com/en-gb/geforc...er-and-performance-benchmarking-app-download/

Although I wouldn't expect the EN1060Ks to be using any more power than the previous model, it's worth seeing if the Ks are doing something odd (although there's no reason why they should in a CPU-heavy process).

Edit: Just noticed that you'd said that you have tried using the daemon "off" - if I remember correctly, actually having it "on" resolved some issues of the GPU coming off the bus.
 
Back