VOGONS


First post, by Wild-E

User metadata
Rank Newbie
Rank
Newbie

Hi!

I had some weird instability problems with my Super Socket 7 board, which took quite some time to fix (TL;DR at end).

The box was sitting unused for a year because of time constraints and it being a bit unstable at some situations. Just now I had the time, and want to share my findings / experience. There is nothing special here, just some "gotchas" you might come across, are simple in hindsight but were quite difficult to pinpoint and made me waste time and do some stupid things.

Setup: Epox MVP3G2 and a AMD K6-2 (upgraded to K6-2+ recently, which didn't really make a difference for these stability issues; I got three different processors to test with). DOS (actually 7.1 and Windows98SE) installed.

The beginning: the computer had weird problems. It might hang at BIOS screen (more on this later) and sometimes do weird things, as if RAM was corrupted or something within Windows - driver installations would fail, the computer became unstable after installing chipset drivers etc.... I presumed the MB is faulty (I got three different processors to test and all gave identical symptoms).

I noticed the computer is stable if I only use DOS and run games there, but that might just be because it doesn't use all of the RAM (I thought)... then I noticed it is more stable if I change the heatsink to a less heavy one (see this post in another thread). So, while troubleshooting all these problems above, I was convinced the MB or the socket is damaged, and dismantled it (and breaking some of the plastic in the process). Turns out that the issues co-inciding with changing the cooler was just luck/red herring!

I tried to install Linux on it since I could more easily do maintenance / file transfers that way, backups and whatnot - and couldn't get Windows 98 SE to be stable anyways (though might be related to poor chipset drivers - and I was not planning on gaming on Windows at this time anyways, maybe later). Anyways, I though trying Linux was worth giving a shot. The computer was stable in a chroot environment but ... after booting into actual Linux installation, I had problems in Linux not recognizing the root partition, with ATA errors in log (which I needed to read with the help of serial console and null-modem cable).

During this I needed to do reboots and troubleshooting this thing ... and noticed the BIOS hangs only happen while in chipset features screen (this was an important notice - it never hang in other menus!). I though that is weird, since all the computer is doing is showing a menu, it shouldn't be polling anything special ... except that then I realized it actually was polling something special, since that screen has FAN and Voltages on that screen. Then I remembered I was screwing around the case while it was powered on years ago, and accidentally removed a fan from the MB header. After that all but one of the headers would not work (as in give any voltage) at all, if a fan was inserted! I was still using the seemingly working one, but noticed it always reported 0RPM in this screen before the hang... well, I started adding things together in my head: could it possibly be a faulty HW monitoring chip getting input might screw some values, and as the values are way off / unexpected, the BIOS code hangs? Turns out I was correct, as removing the FAN from the MB to take power directly from the PSU (no more RPM return going to the chip) -> the BIOS setup screen has not hung ever since (and I've been there many, many times EDIT: and, the hang - when it used to happen - was very easy to reproduce and consistent once I realized it was related to that menu. All it took was 2-10 seconds in the screen with fan RPMs causes the hang - possibly just until it updates the values)!

I also thought about the ATA errors and trouble Linux was having recognizing block devices (I also had been having a corrupted MBR a few times, which might be related to plop, which I needed for booting from USB) ... and realised that a chrooted live Linux environment might not use PATA at full speed. Neither does DOS, nor Windows9X unless the chipset drivers are installed... and realised I'm using 40-threaded IDE cable, which I had changed previously to another one (again: 40-threaded) to rule out faulty one. Then I realized I was only presuming these don't need 80-threaded ones since the faster technologies came later on (since there were way more recent and faster MBs using IDE, for Pentium3 and whatever) ... boy, was I surprised that BIOS actually does have all the UDMA and whatnots! I changed to a 80-threaded cable (the next step in the plan to disable all UDMA settings in BIOS). But after changing the cable -> things started to just work, Linux Kernels detect block devices properly without any problems!

Also, I've determined one of the RAM sticks - but only one - might be faulty or dirty. Not sure which one, might need to swap them around to reproduce some problems I've been having with posting intermittently. But Linux is stable (as in= compiling for >72hours straight without a hickup - but compiling might not use all the RAM). It could also be dirty contacts - but many of the problems I was having while running Windows, were probably caused by a bad IDE cable (and the faulty HW monitoring chip causing hangs while in BIOS setup confusing my thoughts even more). Will make sure later on (with Memtest and testing the modules individually - unless there are no more stability problems).

But most of the problems (hanging in BIOS screen, weird problems) might have been caused by 1) faulty HW monitoring chip and 2) 40-pin IDE cable and UDMA being enabled. It is certainly possible the intermittent problems were ... just intermittent (or perhaps the 40-pin cable works better if it is in different positions, or my changing of the CPU heatsink just co-incided with doing different, less/more I/O taxing tasks / not using accelerated modes ...). Only unexplained thing is some weird non-post situations I've had, but they seem to be gone for now.

TL;DR:
While troubleshooting intermittent problems with old Super Socket 7 board, dismantled (as in break while doing that to be able to examine / clean it) the MB CPU socket since thought it is dirty / damaged (still works after doing this, though, since only some plastic was broken).
But it turns out there was nothing wrong with the socket, but problems were caused by:

  • a faulty HW monitoring chip
  • 40-pin IDE cables instead of 80-pin cables - always use 80-pin (or disable faster transfer technologies)
  • Dirty memory DIMM modules and/or one of the modules is faulty

EDIT: some TYPOs and clearing out possibly confusing / ambiguous parts