Troubleshooing ECC errors


Results 1 to 3 of 3

Thread: Troubleshooing ECC errors

  1. #1
    Join Date
    Aug 2011
    Posts
    33

    Troubleshooing ECC errors

    Hello everyone,

    I have a new build I built a few months ago and ever since I got the thing running - long story - I have been getting ECC errors. I have replaced all the cheap components and I want to check my suspects before going out and replacing the expensive components.

    The build:
    AMD Opteron 2435 x2
    SuperMicro H8DAE-2
    XFX AMD Radeon HD 6750
    DDR2-400 RAM - Various 12 DIMM (see below)
    Enermax NAXN 750AWT
    OCZ Petrol SSD
    Slackware64 13.37

    The errors - I get thousands of these 3 line errors with 0 UECs:
    Code:
    [47175.704033] [Hardware Error]: MC4_STATUS: Corrected error, other errors lost: yes, CPU context corrupt: no, CECC Error
    [47175.704046] [Hardware Error]: Northbridge Error, node 0: DRAM ECC error detected on the NB.
    [47175.704051] [Hardware Error]: Transaction: RD (MEM), no timeout, Cache Level: L3/GEN, Participating Processor: SRC
    The only things that change between errors are as follows:

    "Northbridge Error, node 0"
    alternates between
    "Northbridge Error, node 1.
    This alternation is sometimes immediate or could be minutes later.

    "Participating Processor: SRC"
    is sometimes
    "Participating Processor: RES"
    this seems to be random.

    The RAM:
    8x HP 1GiB - Cheap eBay RAM
    PC2-3200R-333-10
    4x Hynix HYMP351R72AMP4-E3 4GiB - SuperMicro tested RAM list
    PC2-3200R-333-12

    What I have done:
    Replaced cheap eBay ram with SuperMicro suggested RAM.
    Changed BIOS settings for scrubbing CPU cache.

    What I haven't done but plan to do and plan to update this thread:
    Test DIMM by DIMM on each CPU in a 1CPU configuration - test each DIMM twice, once on each CPU - I already did this with the HP RAM on 1 CPU and none seemed bad until I added more than 2 DIMM.
    Disable ECC and run memtest86 overnight

    I'm looking for some other ECC diagnostics tests as I have already done the "try each DIMM" test on one CPU with the HP RAM.

    I've only recently installed the second CPU so it's been in a 1 CPU configuration for a few months.

    System Temps:
    Code:
    sensors
    w83793-i2c-1-2c
    Adapter: SMBus nForce2 adapter at 2e00
    VcoreA:      +1.22 V  (min =  +1.08 V, max =  +1.62 V)   
    VcoreB:      +1.24 V  (min =  +1.08 V, max =  +1.62 V)   
    in2:         +1.09 V  (min =  +1.08 V, max =  +1.33 V)   
    in3:         +0.86 V  (min =  +0.00 V, max =  +4.08 V)   
    in4:         +0.85 V  (min =  +0.00 V, max =  +4.08 V)   
    in5:         +1.81 V  (min =  +1.62 V, max =  +1.98 V)   
    in6:         +1.82 V  (min =  +1.62 V, max =  +1.98 V)   
    +5V:         +5.19 V  (min =  +4.64 V, max =  +5.65 V)   
    5VSB:        +5.09 V  (min =  +4.64 V, max =  +5.65 V)   
    Vbat:        +3.06 V  (min =  +2.96 V, max =  +3.63 V)   
    fan1:          0 RPM  (min =  712 RPM)  ALARM
    fan2:          0 RPM  (min =  712 RPM)  ALARM
    fan3:          0 RPM  (min =  712 RPM)  ALARM
    fan4:          0 RPM  (min =  712 RPM)  ALARM
    fan5:          0 RPM  (min =  712 RPM)  ALARM
    fan6:          0 RPM  (min =  712 RPM)  ALARM
    fan7:       2005 RPM  (min =  712 RPM)
    fan8:       1867 RPM  (min =  712 RPM)
    temp1:       +13.0�C  (high = +65.0�C, hyst = +60.0�C)  sensor = thermal diode
    temp2:       +13.0�C  (high = +65.0�C, hyst = +60.0�C)  sensor = thermal diode
    beep_enable:disabled
    
    w83627hf-isa-0290
    Adapter: ISA adapter
    in0:         +1.49 V  (min =  +1.34 V, max =  +1.65 V)   
    in1:         +1.39 V  (min =  +1.25 V, max =  +1.54 V)   
    in2:         +3.39 V  (min =  +2.96 V, max =  +3.62 V)   
    in3:         +3.06 V  (min =  +4.08 V, max =  +2.03 V)   ALARM
    in4:         +3.18 V  (min =  +2.83 V, max =  +3.47 V)   
    in5:         +0.59 V  (min =  +0.42 V, max =  +0.88 V)   
    in6:         +0.75 V  (min =  +4.06 V, max =  +2.93 V)   ALARM
    in7:         +3.31 V  (min =  +2.98 V, max =  +3.63 V)   
    in8:         +3.09 V  (min =  +2.96 V, max =  +3.62 V)   
    fan1:          0 RPM  (min = 3040 RPM, div = 2)  ALARM
    fan2:          0 RPM  (min =    0 RPM, div = 2)
    fan3:          0 RPM  (min = 11842 RPM, div = 2)  ALARM
    temp1:       +42.0�C  (high = +80.0�C, hyst = +75.0�C)  sensor = thermistor
    temp2:       +35.5�C  (high = +80.0�C, hyst = +75.0�C)  sensor = thermistor
    temp3:       +32.5�C  (high = +80.0�C, hyst = +75.0�C)  sensor = thermistor
    cpu0_vid:   +1.550 V
    beep_enable:enabled
    mcelog.txt
    Last edited by WrinkledCheese; 11-27-2012 at 10:53 PM. Reason: Added MCELOG

  2. #2
    Join Date
    Aug 2011
    Posts
    33
    I added the MCELOG to the original post. I noticed a few things but I'm having trouble finding information on exactly what everything means. I'm looking for an AMD x86_64 programmers guide as mentioned in the mcelog man page.

    These are two things I found in common among all errors.
    CPU 6 4 northbridge
    CPU 0 4 northbridge

  3. #3
    Join Date
    Aug 2011
    Posts
    33
    Another thing I've noticed is that the first error always occurs at the 5 minute mark. Exactly 300 seconds after boot. This seems odd to me that a hardware error would have such a timed pattern.


    This is /var/log/messages through /var/log/messages.4. There are more reboots in the past two days since i got my new 4x4GiB Hynix SuperMicro recommended RAM trying to solve this issue.
    Code:
    cat /var/log/messages* | egrep "\[\s\s300\.|\[\s\s450\."
    Nov 26 21:38:21 roxbox kernel: [  300.704048] [Hardware Error]: Machine check events logged
    Nov 26 21:40:51 roxbox kernel: [  450.704036] [Hardware Error]: Machine check events logged
    Nov 27 02:23:41 roxbox kernel: [  300.704071] [Hardware Error]: Machine check events logged
    Nov 27 02:26:11 roxbox kernel: [  450.704071] [Hardware Error]: Machine check events logged
    Nov 27 02:56:15 roxbox kernel: [  300.704075] [Hardware Error]: Machine check events logged
    Nov 27 02:58:45 roxbox kernel: [  450.704076] [Hardware Error]: Machine check events logged
    Nov 27 06:25:40 roxbox kernel: [  300.686101] [Hardware Error]: Machine check events logged
    Nov 27 06:25:40 roxbox kernel: [  300.705056] [Hardware Error]: Machine check events logged
    Nov 27 06:28:10 roxbox kernel: [  450.686077] [Hardware Error]: Machine check events logged
    Nov 27 06:28:10 roxbox kernel: [  450.705073] [Hardware Error]: Machine check events logged
    Nov 27 06:50:03 roxbox kernel: [  300.686078] [Hardware Error]: Machine check events logged
    Nov 27 06:50:03 roxbox kernel: [  300.704055] [Hardware Error]: Machine check events logged
    Nov 27 06:52:33 roxbox kernel: [  450.686069] [Hardware Error]: Machine check events logged
    Nov 27 06:52:33 roxbox kernel: [  450.704041] [Hardware Error]: Machine check events logged
    Nov 27 23:02:59 roxbox kernel: [  300.686069] [Hardware Error]: Machine check events logged
    Nov 27 23:02:59 roxbox kernel: [  300.704057] [Hardware Error]: Machine check events logged
    Nov 27 23:05:29 roxbox kernel: [  450.686071] [Hardware Error]: Machine check events logged
    Nov 27 23:05:29 roxbox kernel: [  450.704054] [Hardware Error]: Machine check events logged
    Nov 27 23:38:10 roxbox kernel: [  300.686074] [Hardware Error]: Machine check events logged
    Nov 27 23:38:10 roxbox kernel: [  300.704074] [Hardware Error]: Machine check events logged
    Nov 27 23:40:40 roxbox kernel: [  450.686090] [Hardware Error]: Machine check events logged
    Nov 27 23:40:40 roxbox kernel: [  450.705067] [Hardware Error]: Machine check events logged
    Nov 28 01:07:20 roxbox kernel: [  300.704429] [Hardware Error]: Machine check events logged
    Nov 28 01:09:50 roxbox kernel: [  450.704419] [Hardware Error]: Machine check events logged
    Nov 28 02:08:28 roxbox kernel: [  300.686066] [Hardware Error]: Machine check events logged
    Nov 28 02:08:28 roxbox kernel: [  300.704068] [Hardware Error]: Machine check events logged
    Nov 28 02:10:58 roxbox kernel: [  450.686080] [Hardware Error]: Machine check events logged
    Nov 28 02:10:58 roxbox kernel: [  450.704074] [Hardware Error]: Machine check events logged
    Nov 28 02:56:17 roxbox kernel: [  300.704066] [Hardware Error]: Machine check events logged
    Nov 28 02:58:47 roxbox kernel: [  450.704058] [Hardware Error]: Machine check events logged
    Nov 21 21:16:17 roxbox kernel: [  300.704038] [Hardware Error]: Machine check events logged
    Nov 21 21:18:47 roxbox kernel: [  450.704023] [Hardware Error]: Machine check events logged
    Nov 15 10:23:50 roxbox kernel: [  300.704035] [Hardware Error]: Machine check events logged
    Nov 15 10:26:20 roxbox kernel: [  450.704054] [Hardware Error]: Machine check events logged
    Nov  2 02:35:43 roxbox kernel: [  300.704025] [Hardware Error]: Machine check events logged
    Nov  2 02:38:13 roxbox kernel: [  450.704054] [Hardware Error]: Machine check events logged
    Nov  3 02:22:22 roxbox kernel: [  300.704026] [Hardware Error]: Machine check events logged
    Nov  3 02:24:52 roxbox kernel: [  450.704036] [Hardware Error]: Machine check events logged
    Code:
    cat /var/log/messages* | egrep "\[Hardware\sError\]"
    Nov 26 21:38:21 roxbox kernel: [  300.704048] [Hardware Error]: Machine check events logged
    Nov 26 21:40:51 roxbox kernel: [  450.704036] [Hardware Error]: Machine check events logged
    Nov 26 21:42:06 roxbox kernel: [  525.704048] [Hardware Error]: Machine check events logged
    Nov 26 21:43:11 roxbox kernel: [  591.329054] [Hardware Error]: Machine check events logged
    Nov 26 21:43:16 roxbox kernel: [  596.016047] [Hardware Error]: Machine check events logged
    Nov 27 02:23:41 roxbox kernel: [  300.704071] [Hardware Error]: Machine check events logged
    Nov 27 02:26:11 roxbox kernel: [  450.704071] [Hardware Error]: Machine check events logged
    Nov 27 02:56:15 roxbox kernel: [  300.704075] [Hardware Error]: Machine check events logged
    Nov 27 02:58:45 roxbox kernel: [  450.704076] [Hardware Error]: Machine check events logged
    Nov 27 03:00:00 roxbox kernel: [  525.704080] [Hardware Error]: Machine check events logged
    Nov 27 03:01:06 roxbox kernel: [  591.329090] [Hardware Error]: Machine check events logged
    Nov 27 03:01:11 roxbox kernel: [  596.016080] [Hardware Error]: Machine check events logged
    Nov 27 06:25:40 roxbox kernel: [  300.686101] [Hardware Error]: Machine check events logged
    Nov 27 06:25:40 roxbox kernel: [  300.705056] [Hardware Error]: Machine check events logged
    Nov 27 06:28:10 roxbox kernel: [  450.686077] [Hardware Error]: Machine check events logged
    Nov 27 06:28:10 roxbox kernel: [  450.705073] [Hardware Error]: Machine check events logged
    Nov 27 06:29:25 roxbox kernel: [  525.686083] [Hardware Error]: Machine check events logged
    Nov 27 06:29:25 roxbox kernel: [  525.705066] [Hardware Error]: Machine check events logged
    Nov 27 06:30:31 roxbox kernel: [  591.311074] [Hardware Error]: Machine check events logged
    Nov 27 06:30:31 roxbox kernel: [  591.330070] [Hardware Error]: Machine check events logged
    Nov 27 06:50:03 roxbox kernel: [  300.686078] [Hardware Error]: Machine check events logged
    Nov 27 06:50:03 roxbox kernel: [  300.704055] [Hardware Error]: Machine check events logged
    Nov 27 06:52:33 roxbox kernel: [  450.686069] [Hardware Error]: Machine check events logged
    Nov 27 06:52:33 roxbox kernel: [  450.704041] [Hardware Error]: Machine check events logged
    Nov 27 06:53:48 roxbox kernel: [  525.686077] [Hardware Error]: Machine check events logged
    Nov 27 06:54:25 roxbox kernel: [  563.186064] [Hardware Error]: Machine check events logged
    Nov 27 23:02:59 roxbox kernel: [  300.686069] [Hardware Error]: Machine check events logged
    Nov 27 23:02:59 roxbox kernel: [  300.704057] [Hardware Error]: Machine check events logged
    Nov 27 23:05:29 roxbox kernel: [  450.686071] [Hardware Error]: Machine check events logged
    Nov 27 23:05:29 roxbox kernel: [  450.704054] [Hardware Error]: Machine check events logged
    Nov 27 23:06:44 roxbox kernel: [  525.686098] [Hardware Error]: Machine check events logged
    Nov 27 23:06:44 roxbox kernel: [  525.704054] [Hardware Error]: Machine check events logged
    Nov 27 23:07:50 roxbox kernel: [  591.311065] [Hardware Error]: Machine check events logged
    Nov 27 23:07:50 roxbox kernel: [  591.329053] [Hardware Error]: Machine check events logged
    Nov 27 23:38:10 roxbox kernel: [  300.686074] [Hardware Error]: Machine check events logged
    Nov 27 23:38:10 roxbox kernel: [  300.704074] [Hardware Error]: Machine check events logged
    Nov 27 23:40:40 roxbox kernel: [  450.686090] [Hardware Error]: Machine check events logged
    Nov 27 23:40:40 roxbox kernel: [  450.705067] [Hardware Error]: Machine check events logged
    Nov 27 23:41:55 roxbox kernel: [  525.686072] [Hardware Error]: Machine check events logged
    Nov 27 23:41:55 roxbox kernel: [  525.705074] [Hardware Error]: Machine check events logged
    Nov 27 23:43:00 roxbox kernel: [  591.311082] [Hardware Error]: Machine check events logged
    Nov 27 23:43:00 roxbox kernel: [  591.330070] [Hardware Error]: Machine check events logged
    Nov 27 23:48:37 roxbox kernel: [  928.351072] [Hardware Error]: Machine check events logged
    Nov 27 23:48:37 roxbox kernel: [  928.352082] [Hardware Error]: Machine check events logged
    Nov 27 23:49:59 roxbox kernel: [ 1010.271060] [Hardware Error]: Machine check events logged
    Nov 27 23:49:59 roxbox kernel: [ 1010.272082] [Hardware Error]: Machine check events logged
    Nov 27 23:51:01 roxbox kernel: [ 1071.711061] [Hardware Error]: Machine check events logged
    Nov 27 23:51:01 roxbox kernel: [ 1071.712059] [Hardware Error]: Machine check events logged
    Nov 27 23:54:05 roxbox kernel: [ 1256.033059] [Hardware Error]: Machine check events logged
    Nov 27 23:54:05 roxbox kernel: [ 1256.043081] [Hardware Error]: Machine check events logged
    Nov 27 23:55:07 roxbox kernel: [ 1317.473083] [Hardware Error]: Machine check events logged
    Nov 27 23:55:07 roxbox kernel: [ 1317.484078] [Hardware Error]: Machine check events logged
    Nov 28 01:07:20 roxbox kernel: [  300.704429] [Hardware Error]: Machine check events logged
    Nov 28 01:09:50 roxbox kernel: [  450.704419] [Hardware Error]: Machine check events logged
    Nov 28 01:11:05 roxbox kernel: [  525.704418] [Hardware Error]: Machine check events logged
    Nov 28 01:12:11 roxbox kernel: [  591.329419] [Hardware Error]: Machine check events logged
    Nov 28 01:12:16 roxbox kernel: [  596.016406] [Hardware Error]: Machine check events logged
    Nov 28 02:08:28 roxbox kernel: [  300.686066] [Hardware Error]: Machine check events logged
    Nov 28 02:08:28 roxbox kernel: [  300.704068] [Hardware Error]: Machine check events logged
    Nov 28 02:10:58 roxbox kernel: [  450.686080] [Hardware Error]: Machine check events logged
    Nov 28 02:10:58 roxbox kernel: [  450.704074] [Hardware Error]: Machine check events logged
    Nov 28 02:12:13 roxbox kernel: [  525.686076] [Hardware Error]: Machine check events logged
    Nov 28 02:12:13 roxbox kernel: [  525.705058] [Hardware Error]: Machine check events logged
    Nov 28 02:13:19 roxbox kernel: [  591.311073] [Hardware Error]: Machine check events logged
    Nov 28 02:13:19 roxbox kernel: [  591.331058] [Hardware Error]: Machine check events logged
    Nov 28 02:56:17 roxbox kernel: [  300.704066] [Hardware Error]: Machine check events logged
    Nov 28 02:58:47 roxbox kernel: [  450.704058] [Hardware Error]: Machine check events logged
    Nov 28 03:00:02 roxbox kernel: [  525.704062] [Hardware Error]: Machine check events logged
    Nov 28 03:01:07 roxbox kernel: [  591.329060] [Hardware Error]: Machine check events logged
    Nov 28 03:01:12 roxbox kernel: [  596.016059] [Hardware Error]: Machine check events logged
    Nov 21 21:16:17 roxbox kernel: [  300.704038] [Hardware Error]: Machine check events logged
    Nov 21 21:18:47 roxbox kernel: [  450.704023] [Hardware Error]: Machine check events logged
    Nov 21 21:20:02 roxbox kernel: [  525.704036] [Hardware Error]: Machine check events logged
    Nov 21 21:21:08 roxbox kernel: [  591.329055] [Hardware Error]: Machine check events logged
    Nov 21 21:21:12 roxbox kernel: [  596.016048] [Hardware Error]: Machine check events logged
    Nov 14 14:29:41 roxbox kernel: [ 2400.704046] [Hardware Error]: Machine check events logged
    Nov 14 14:32:11 roxbox kernel: [ 2550.704037] [Hardware Error]: Machine check events logged
    Nov 14 14:33:26 roxbox kernel: [ 2625.704048] [Hardware Error]: Machine check events logged
    Nov 14 14:34:31 roxbox kernel: [ 2691.329047] [Hardware Error]: Machine check events logged
    Nov 14 14:34:36 roxbox kernel: [ 2696.016050] [Hardware Error]: Machine check events logged
    Nov 15 10:23:50 roxbox kernel: [  300.704035] [Hardware Error]: Machine check events logged
    Nov 15 10:26:20 roxbox kernel: [  450.704054] [Hardware Error]: Machine check events logged
    Nov 15 10:27:35 roxbox kernel: [  525.704035] [Hardware Error]: Machine check events logged
    Nov 15 10:28:40 roxbox kernel: [  591.329047] [Hardware Error]: Machine check events logged
    Nov 15 10:28:45 roxbox kernel: [  596.016048] [Hardware Error]: Machine check events logged
    Nov  2 02:35:43 roxbox kernel: [  300.704025] [Hardware Error]: Machine check events logged
    Nov  2 02:38:13 roxbox kernel: [  450.704054] [Hardware Error]: Machine check events logged
    Nov  2 02:39:28 roxbox kernel: [  525.704023] [Hardware Error]: Machine check events logged
    Nov  2 02:40:33 roxbox kernel: [  591.329058] [Hardware Error]: Machine check events logged
    Nov  2 02:40:38 roxbox kernel: [  596.016029] [Hardware Error]: Machine check events logged
    Nov  3 02:22:22 roxbox kernel: [  300.704026] [Hardware Error]: Machine check events logged
    Nov  3 02:24:52 roxbox kernel: [  450.704036] [Hardware Error]: Machine check events logged
    Nov  3 02:26:07 roxbox kernel: [  525.704047] [Hardware Error]: Machine check events logged
    Nov  3 02:27:13 roxbox kernel: [  591.329037] [Hardware Error]: Machine check events logged
    Nov  3 02:27:17 roxbox kernel: [  596.016049] [Hardware Error]: Machine check events logged
    Last edited by WrinkledCheese; 11-28-2012 at 02:00 AM.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •