Bad RAM or bad CPU + random freezing


Results 1 to 6 of 6

Thread: Bad RAM or bad CPU + random freezing

  1. #1
    Join Date
    Aug 2011
    Posts
    33

    Bad RAM or bad CPU + random freezing

    Hey everyone,

    I have two issues I believe are unrelated, but I will hopefully get them resolved in this thread.

    My first issue is one that is more of a lack of skill. I'm Googling around and looking through the logs I have and I don't see anything that is helping me find an answer to my issue.

    I have a system with 2x Opteron 2435 and 8x1GB of DDR2 400 registered RAM. I am getting a LOT of ECC correction messages. My first though was to take all the memory out and try the system with each DIMM individually. I did that and I thought I found the culprit 2 DIMMs. I don't think that is the case anymore. Here is what I noticed.

    When I boot my system everything usually goes fine, sometimes I will notice an ECC correction message during boot but I only saw that once. Once I boot the machine, I start X and browse the web for a bit. It will usually happen within the first 5 minutes. For each DIMM, I spent an hour, or more, browsing the net. I found that with 2 DIMMs, I get ECC messages within 5 minutes and after about 10 minutes after the first message, there are about 50 corrections. The other 6 DIMMs seem fine.

    So I put the 6 DIMMs in and I go about my merry way. I boot the system and within 5 minutes, again, ECC correction messages. So I tried 1 CPU and then I tested the 6 good DIMMs again. They all tested fine. So I tested them in pairs. That seemed to be fine. Once I go beyond 2 DIMMs it seems that ECC messages like to pop up.

    Here is 5 minutes of errors. Coincidentally the first error happens at nearly 1 second into the 5th minute of the system running. This is just a coincidence. You will notice that sometimes it's only one message and sometimes it's many messages.

    Code:
    [  300.704037] [Hardware Error]: MC4_STATUS: Corrected error, other errors lost: yes, CPU context corrupt: no, CECC Error
    [  300.704044] [Hardware Error]: Northbridge Error, node 0: DRAM ECC error detected on the NB.
    [  300.704048] [Hardware Error]: Transaction: RD (MEM), no timeout, Cache Level: L3/GEN, Participating Processor: SRC
    [  300.704052] [Hardware Error]: Machine check events logged
    [  450.704032] [Hardware Error]: MC4_STATUS: Corrected error, other errors lost: yes, CPU context corrupt: no, CECC Error
    [  450.704040] [Hardware Error]: Northbridge Error, node 0: DRAM ECC error detected on the NB.
    [  450.704044] [Hardware Error]: Transaction: RD (MEM), no timeout, Cache Level: L3/GEN, Participating Processor: SRC
    [  450.704047] [Hardware Error]: Machine check events logged
    [  525.704033] [Hardware Error]: MC4_STATUS: Corrected error, other errors lost: yes, CPU context corrupt: no, CECC Error
    [  525.704040] [Hardware Error]: Northbridge Error, node 0: DRAM ECC error detected on the NB.
    [  525.704045] [Hardware Error]: Transaction: RD (MEM), no timeout, Cache Level: L3/GEN, Participating Processor: SRC
    [  525.704048] [Hardware Error]: Machine check events logged
    [  563.204023] [Hardware Error]: MC4_STATUS: Corrected error, other errors lost: yes, CPU context corrupt: no, CECC Error
    [  563.204031] [Hardware Error]: Northbridge Error, node 0: DRAM ECC error detected on the NB.
    [  563.204035] [Hardware Error]: Transaction: RD (MEM), no timeout, Cache Level: L3/GEN, Participating Processor: SRC
    [  581.954032] [Hardware Error]: MC4_STATUS: Corrected error, other errors lost: yes, CPU context corrupt: no, CECC Error
    [  581.954040] [Hardware Error]: Northbridge Error, node 0: DRAM ECC error detected on the NB.
    [  581.954044] [Hardware Error]: Transaction: RD (MEM), no timeout, Cache Level: L3/GEN, Participating Processor: SRC
    [  591.329039] [Hardware Error]: MC4_STATUS: Corrected error, other errors lost: yes, CPU context corrupt: no, CECC Error
    [  591.329047] [Hardware Error]: Northbridge Error, node 0: DRAM ECC error detected on the NB.
    [  591.329051] [Hardware Error]: Transaction: RD (MEM), no timeout, Cache Level: L3/GEN, Participating Processor: SRC
    [  591.329056] [Hardware Error]: Machine check events logged
    [  596.016008] [Hardware Error]: MC4_STATUS: Corrected error, other errors lost: yes, CPU context corrupt: no, CECC Error
    [  596.016016] [Hardware Error]: Northbridge Error, node 0: DRAM ECC error detected on the NB.
    [  596.016021] [Hardware Error]: Transaction: RD (MEM), no timeout, Cache Level: L3/GEN, Participating Processor: SRC
    [  596.016024] [Hardware Error]: Machine check events logged
    [  598.359023] [Hardware Error]: MC4_STATUS: Corrected error, other errors lost: yes, CPU context corrupt: no, CECC Error
    [  598.359032] [Hardware Error]: Northbridge Error, node 0: DRAM ECC error detected on the NB.
    [  598.359036] [Hardware Error]: Transaction: RD (MEM), no timeout, Cache Level: L3/GEN, Participating Processor: SRC
    [  599.530020] [Hardware Error]: MC4_STATUS: Corrected error, other errors lost: yes, CPU context corrupt: no, CECC Error
    [  599.530028] [Hardware Error]: Northbridge Error, node 0: DRAM ECC error detected on the NB.
    [  599.530033] [Hardware Error]: Transaction: RD (MEM), no timeout, Cache Level: L3/GEN, Participating Processor: RES
    [  600.115020] [Hardware Error]: MC4_STATUS: Corrected error, other errors lost: yes, CPU context corrupt: no, CECC Error
    [  600.115031] [Hardware Error]: Northbridge Error, node 0: DRAM ECC error detected on the NB.
    [  600.115036] [Hardware Error]: Transaction: RD (MEM), no timeout, Cache Level: L3/GEN, Participating Processor: SRC
    [  600.407041] [Hardware Error]: MC4_STATUS: Corrected error, other errors lost: yes, CPU context corrupt: no, CECC Error
    [  600.407065] [Hardware Error]: Northbridge Error, node 0: DRAM ECC error detected on the NB.
    [  600.407085] [Hardware Error]: Transaction: RD (MEM), no timeout, Cache Level: L3/GEN, Participating Processor: SRC
    [  600.553030] [Hardware Error]: MC4_STATUS: Corrected error, other errors lost: yes, CPU context corrupt: no, CECC Error
    [  600.553037] [Hardware Error]: Northbridge Error, node 0: DRAM ECC error detected on the NB.
    [  600.553041] [Hardware Error]: Transaction: RD (MEM), no timeout, Cache Level: L3/GEN, Participating Processor: SRC
    [  600.626036] [Hardware Error]: MC4_STATUS: Corrected error, other errors lost: yes, CPU context corrupt: no, CECC Error
    [  600.626044] [Hardware Error]: Northbridge Error, node 0: DRAM ECC error detected on the NB.
    [  600.626048] [Hardware Error]: Transaction: RD (MEM), no timeout, Cache Level: L3/GEN, Participating Processor: SRC
    [  600.663010] [Hardware Error]: MC4_STATUS: Corrected error, other errors lost: yes, CPU context corrupt: no, CECC Error
    [  600.663030] [Hardware Error]: Northbridge Error, node 0: DRAM ECC error detected on the NB.
    [  600.663036] [Hardware Error]: Transaction: RD (MEM), no timeout, Cache Level: L3/GEN, Participating Processor: SRC
    [  600.682002] [Hardware Error]: MC4_STATUS: Corrected error, other errors lost: yes, CPU context corrupt: no, CECC Error
    [  600.682005] [Hardware Error]: Northbridge Error, node 0: DRAM ECC error detected on the NB.
    [  600.682013] [Hardware Error]: Transaction: RD (MEM), no timeout, Cache Level: L3/GEN, Participating Processor: SRC
    [  600.692002] [Hardware Error]: MC4_STATUS: Corrected error, other errors lost: yes, CPU context corrupt: no, CECC Error
    [  600.692005] [Hardware Error]: Northbridge Error, node 0: DRAM ECC error detected on the NB.
    [  600.692007] [Hardware Error]: Transaction: RD (MEM), no timeout, Cache Level: L3/GEN, Participating Processor: SRC
    [  600.702006] [Hardware Error]: MC4_STATUS: Corrected error, other errors lost: yes, CPU context corrupt: no, CECC Error
    [  600.702021] [Hardware Error]: Northbridge Error, node 0: DRAM ECC error detected on the NB.
    [  600.702029] [Hardware Error]: Transaction: RD (MEM), no timeout, Cache Level: L3/GEN, Participating Processor: SRC
    [  600.712041] [Hardware Error]: MC4_STATUS: Corrected error, other errors lost: yes, CPU context corrupt: no, CECC Error
    [  600.712049] [Hardware Error]: Northbridge Error, node 0: DRAM ECC error detected on the NB.
    [  600.712054] [Hardware Error]: Transaction: RD (MEM), no timeout, Cache Level: L3/GEN, Participating Processor: SRC
    [  600.722044] [Hardware Error]: MC4_STATUS: Corrected error, other errors lost: yes, CPU context corrupt: no, CECC Error
    [  600.722052] [Hardware Error]: Northbridge Error, node 0: DRAM ECC error detected on the NB.
    [  600.722056] [Hardware Error]: Transaction: RD (MEM), no timeout, Cache Level: L3/GEN, Participating Processor: SRC
    [  600.732028] [Hardware Error]: MC4_STATUS: Corrected error, other errors lost: yes, CPU context corrupt: no, CECC Error
    [  600.732035] [Hardware Error]: Northbridge Error, node 0: DRAM ECC error detected on the NB.
    [  600.732039] [Hardware Error]: Transaction: RD (MEM), no timeout, Cache Level: L3/GEN, Participating Processor: RES
    [  600.742025] [Hardware Error]: MC4_STATUS: Corrected error, other errors lost: yes, CPU context corrupt: no, CECC Error
    [  600.742033] [Hardware Error]: Northbridge Error, node 0: DRAM ECC error detected on the NB.
    [  600.742037] [Hardware Error]: Transaction: RD (MEM), no timeout, Cache Level: L3/GEN, Participating Processor: SRC
    [  600.752046] [Hardware Error]: MC4_STATUS: Corrected error, other errors lost: yes, CPU context corrupt: no, CECC Error
    [  600.752052] [Hardware Error]: Northbridge Error, node 0: DRAM ECC error detected on the NB.
    [  600.752056] [Hardware Error]: Transaction: RD (MEM), no timeout, Cache Level: L3/GEN, Participating Processor: SRC
    [  600.762514] [Hardware Error]: MC4_STATUS: Corrected error, other errors lost: yes, CPU context corrupt: no, CECC Error
    [  600.762522] [Hardware Error]: Northbridge Error, node 0: DRAM ECC error detected on the NB.
    [  600.762526] [Hardware Error]: Transaction: RD (MEM), no timeout, Cache Level: L3/GEN, Participating Processor: SRC
    [  600.772038] [Hardware Error]: MC4_STATUS: Corrected error, other errors lost: yes, CPU context corrupt: no, CECC Error
    [  600.772046] [Hardware Error]: Northbridge Error, node 0: DRAM ECC error detected on the NB.
    [  600.772051] [Hardware Error]: Transaction: RD (MEM), no timeout, Cache Level: L3/GEN, Participating Processor: SRC
    [  600.782034] [Hardware Error]: MC4_STATUS: Corrected error, other errors lost: yes, CPU context corrupt: no, CECC Error
    [  600.782041] [Hardware Error]: Northbridge Error, node 0: DRAM ECC error detected on the NB.
    [  600.782045] [Hardware Error]: Transaction: RD (MEM), no timeout, Cache Level: L3/GEN, Participating Processor: SRC
    [  600.792048] [Hardware Error]: MC4_STATUS: Corrected error, other errors lost: yes, CPU context corrupt: no, CECC Error
    [  600.792055] [Hardware Error]: Northbridge Error, node 0: DRAM ECC error detected on the NB.
    [  600.792059] [Hardware Error]: Transaction: RD (MEM), no timeout, Cache Level: L3/GEN, Participating Processor: SRC
    [  600.802024] [Hardware Error]: MC4_STATUS: Corrected error, other errors lost: yes, CPU context corrupt: no, CECC Error
    [  600.802032] [Hardware Error]: Northbridge Error, node 0: DRAM ECC error detected on the NB.
    [  600.802036] [Hardware Error]: Transaction: RD (MEM), no timeout, Cache Level: L3/GEN, Participating Processor: SRC
    [  600.813021] [Hardware Error]: MC4_STATUS: Corrected error, other errors lost: yes, CPU context corrupt: no, CECC Error
    [  600.813029] [Hardware Error]: Northbridge Error, node 0: DRAM ECC error detected on the NB.
    [  600.813034] [Hardware Error]: Transaction: RD (MEM), no timeout, Cache Level: L3/GEN, Participating Processor: SRC
    [  600.823020] [Hardware Error]: MC4_STATUS: Corrected error, other errors lost: yes, CPU context corrupt: no, CECC Error
    [  600.823026] [Hardware Error]: Northbridge Error, node 0: DRAM ECC error detected on the NB.
    [  600.823029] [Hardware Error]: Transaction: RD (MEM), no timeout, Cache Level: L3/GEN, Participating Processor: SRC
    [  600.833008] [Hardware Error]: MC4_STATUS: Corrected error, other errors lost: yes, CPU context corrupt: no, CECC Error
    [  600.833022] [Hardware Error]: Northbridge Error, node 0: DRAM ECC error detected on the NB.
    [  600.833026] [Hardware Error]: Transaction: RD (MEM), no timeout, Cache Level: L3/GEN, Participating Processor: SRC
    [  600.843040] [Hardware Error]: MC4_STATUS: Corrected error, other errors lost: yes, CPU context corrupt: no, CECC Error
    [  600.843050] [Hardware Error]: Northbridge Error, node 0: DRAM ECC error detected on the NB.
    [  600.843055] [Hardware Error]: Transaction: RD (MEM), no timeout, Cache Level: L3/GEN, Participating Processor: RES
    [  600.853016] [Hardware Error]: MC4_STATUS: Corrected error, other errors lost: yes, CPU context corrupt: no, CECC Error
    [  600.853024] [Hardware Error]: Northbridge Error, node 0: DRAM ECC error detected on the NB.
    [  600.853029] [Hardware Error]: Transaction: RD (MEM), no timeout, Cache Level: L3/GEN, Participating Processor: RES
    [  600.863039] [Hardware Error]: MC4_STATUS: Corrected error, other errors lost: yes, CPU context corrupt: no, CECC Error
    [  600.863046] [Hardware Error]: Northbridge Error, node 0: DRAM ECC error detected on the NB.
    [  600.863051] [Hardware Error]: Transaction: RD (MEM), no timeout, Cache Level: L3/GEN, Participating Processor: SRC
    [  600.883023] [Hardware Error]: MC4_STATUS: Corrected error, other errors lost: yes, CPU context corrupt: no, CECC Error
    [  600.883043] [Hardware Error]: Northbridge Error, node 0: DRAM ECC error detected on the NB.
    [  600.883048] [Hardware Error]: Transaction: RD (MEM), no timeout, Cache Level: L3/GEN, Participating Processor: SRC
    [  600.923004] [Hardware Error]: MC4_STATUS: Corrected error, other errors lost: yes, CPU context corrupt: no, CECC Error
    [  600.923024] [Hardware Error]: Northbridge Error, node 0: DRAM ECC error detected on the NB.
    [  600.923027] [Hardware Error]: Transaction: RD (MEM), no timeout, Cache Level: L3/GEN, Participating Processor: SRC
    How can I test the whole system to see if it's indeed bad RAM, possibly a motherboard or possibly a CPU? I will be running memtest86 on this tonight, but other than that I'm out of ideas.

    OS: Slackware 13.37, nearly fresh install. Installed GConf, ORBit2 and any other dependencies required fro Google Chrome. I installed them from SlackBuilds.org downloads.
    Motherboard: New SuperMicro H8DAE-2
    Processor: 2 x New Opteron 2435 (hex-core Istanbul)
    Memory: 8x Used 1GB HP Stickers, different actual brands. PC2-3200R-333 - Some are labeled CL3 some are not, they all have the same HP part numbers.
    Graphics: New XFX ATI HD 6750 1GB w/ closed driver
    Hard Drive: OCZ Petrol 128GB SSD
    Power Supply: New Enermax NAXN 750W Modular Power Supply.
    Keyboard/Mouse: New Logitech USB keybaord and mouse

    My second issue is random freezing. I was experiencing this on another system: http://forums.justlinux.com/showthre...eezing-no-logs
    I now have a completely new system, except for the hard drive. EVERYTHING is new. Old system is as described:

    OS: Slackware 13.37 fresh install. I had lost a hard drive and this issue started about a week after getting the new one.
    Motherboard: Asus A8V-Deluxe
    Processor: AMD Athlon 64 3#00+ Not exactly sure. I'm sure I could check the basement if someone wanted me to.
    Memory: 4x 1GB DDR 400 random brands
    Graphics: nVidia Geforce E6200 w/ closed driver
    Hard Drive: OCZ Petrol 128GB SSD
    Power Supply: Thermaltake 420W
    Keyboard/Mouse: Future Shop brand Dynex Keybaord and Logitech mouse.

    Only two things remain the same. The Operating System, and the Hard Drive.

    I would love to get to the bottom of the hard drive issue, but the memory diagnostics would be of more interest since those messages are more annoying...although random freezing seems to be quite annoying as well.

    [EDIT]
    Random freezing still occurred while I only had 2 DIMMs and no ECC messages.
    Last edited by WrinkledCheese; 09-16-2012 at 11:17 PM.

  2. #2
    Join Date
    Aug 2011
    Posts
    33
    Just a note, I've noticed that if I have the Opteron 2210 in my system I don't get these errors but if I have either of my 2435 CPUs in my system I get these errors.

    Bad CPU?

  3. #3
    Join Date
    Aug 2011
    Posts
    33
    SuperMicro H8DAE-2 supports Opteron 2400s, but only with Revision 2.01a. I have 2.01.

    Quote Originally Posted by SuperMicro support
    I just got confirm with the engineer. The motherboard revision is not new enough to support Opteron 2400. That’s why you have errors with 2400 but not with 2200.
    Last edited by WrinkledCheese; 01-17-2013 at 06:08 PM.

  4. #4
    Join Date
    Jan 2013
    Posts
    3
    Hi,

    I'm no expert in this area except by experience.

    As you may have noticed by my nick, you would be right in assuming assume that I live in the boonies. We experience very high lightning activity.

    I have lost at least three loaded computers due to storms, and my hunch is that part of your system has been fried (maybe not by lightning though). It sounds as if that CPU is a goner if the other CPU does act up.

    You might want to google the CPU and the memory combo and see if anything shows up, then add your OS as a kicker.

    Back to lightning: a lot of my woes did not show up until the winter. Then I'd experience what you've experienced, and on my last beauty, it would just turn itself off randomly. (Gotta watch the modem/router connections -- they can zap a machine rather quickly and thoroughly.)

    Btw, I've learned my lesson - everything gets unplugged immediately after every session.

    Hope this helps a bit,
    booniesboy

  5. #5
    Join Date
    Aug 2011
    Posts
    33
    This issue has recently been "resolved".

    SuperMicro is a bunch of ***-hats.

    1) Their website doesn't contain pertinent revision information that only the LAST revision of the motherboard model H8DAE-2 supports the Opteron 2400 series. It does however say it supports hex-core Opteron 2000 series(only available in the 2400 series) 2.01a suports 2400s, I have 2.01.

    2) They won't tell me what component change is required and they want me to set up an out-of-warranty RMA. They want to charge me $45 to diagnose the issue and they won't tell me how much it will cost unless I send them the board first.

    3) They're not sure if it's a BIOS issue or a component change...well some of their support techs don't know.

    Unsatisfied customer that will never buy another SuperMicro motherboard ever again.

    P.S. Wouldn't a surge protector with LAN surge protector built in resolve your issue with lightning? $35 is a small price to pay to save so much headache.
    Last edited by WrinkledCheese; 02-20-2013 at 09:58 PM.

  6. #6
    Join Date
    Aug 2013
    Location
    Manila
    Posts
    1
    Hello, I've recently been following this thread after I ordered an SM H8DAE-2 board with dual-core Opterons and some memories.

    I think what I got was an old revision so apparently six-cores would not work or will post, but will not be stable. When you say resolved on your previous post, does this mean you made six-core Opterons work on an older revision with just a BIOS update?

    Hope to hear from you!

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •