Hello everyone,
I have a new build I built a few months ago and ever since I got the thing running - long story - I have been getting ECC errors. I have replaced all the cheap components and I want to check my suspects before going out and replacing the expensive components.
The build:
AMD Opteron 2435 x2
SuperMicro H8DAE-2
XFX AMD Radeon HD 6750
DDR2-400 RAM - Various 12 DIMM (see below)
Enermax NAXN 750AWT
OCZ Petrol SSD
Slackware64 13.37
The errors - I get thousands of these 3 line errors with 0 UECs:
Code:
[47175.704033] [Hardware Error]: MC4_STATUS: Corrected error, other errors lost: yes, CPU context corrupt: no, CECC Error
[47175.704046] [Hardware Error]: Northbridge Error, node 0: DRAM ECC error detected on the NB.
[47175.704051] [Hardware Error]: Transaction: RD (MEM), no timeout, Cache Level: L3/GEN, Participating Processor: SRC
The only things that change between errors are as follows:
"Northbridge Error, node 0"
alternates between
"Northbridge Error, node 1.
This alternation is sometimes immediate or could be minutes later.
"Participating Processor: SRC"
is sometimes
"Participating Processor: RES"
this seems to be random.
The RAM:
8x HP 1GiB - Cheap eBay RAM
PC2-3200R-333-10
4x Hynix HYMP351R72AMP4-E3 4GiB - SuperMicro tested RAM list
PC2-3200R-333-12
What I have done:
Replaced cheap eBay ram with SuperMicro suggested RAM.
Changed BIOS settings for scrubbing CPU cache.
What I haven't done but plan to do and plan to update this thread:
Test DIMM by DIMM on each CPU in a 1CPU configuration - test each DIMM twice, once on each CPU - I already did this with the HP RAM on 1 CPU and none seemed bad until I added more than 2 DIMM.
Disable ECC and run memtest86 overnight
I'm looking for some other ECC diagnostics tests as I have already done the "try each DIMM" test on one CPU with the HP RAM.
I've only recently installed the second CPU so it's been in a 1 CPU configuration for a few months.
System Temps:
Code:
sensors
w83793-i2c-1-2c
Adapter: SMBus nForce2 adapter at 2e00
VcoreA: +1.22 V (min = +1.08 V, max = +1.62 V)
VcoreB: +1.24 V (min = +1.08 V, max = +1.62 V)
in2: +1.09 V (min = +1.08 V, max = +1.33 V)
in3: +0.86 V (min = +0.00 V, max = +4.08 V)
in4: +0.85 V (min = +0.00 V, max = +4.08 V)
in5: +1.81 V (min = +1.62 V, max = +1.98 V)
in6: +1.82 V (min = +1.62 V, max = +1.98 V)
+5V: +5.19 V (min = +4.64 V, max = +5.65 V)
5VSB: +5.09 V (min = +4.64 V, max = +5.65 V)
Vbat: +3.06 V (min = +2.96 V, max = +3.63 V)
fan1: 0 RPM (min = 712 RPM) ALARM
fan2: 0 RPM (min = 712 RPM) ALARM
fan3: 0 RPM (min = 712 RPM) ALARM
fan4: 0 RPM (min = 712 RPM) ALARM
fan5: 0 RPM (min = 712 RPM) ALARM
fan6: 0 RPM (min = 712 RPM) ALARM
fan7: 2005 RPM (min = 712 RPM)
fan8: 1867 RPM (min = 712 RPM)
temp1: +13.0�C (high = +65.0�C, hyst = +60.0�C) sensor = thermal diode
temp2: +13.0�C (high = +65.0�C, hyst = +60.0�C) sensor = thermal diode
beep_enable:disabled
w83627hf-isa-0290
Adapter: ISA adapter
in0: +1.49 V (min = +1.34 V, max = +1.65 V)
in1: +1.39 V (min = +1.25 V, max = +1.54 V)
in2: +3.39 V (min = +2.96 V, max = +3.62 V)
in3: +3.06 V (min = +4.08 V, max = +2.03 V) ALARM
in4: +3.18 V (min = +2.83 V, max = +3.47 V)
in5: +0.59 V (min = +0.42 V, max = +0.88 V)
in6: +0.75 V (min = +4.06 V, max = +2.93 V) ALARM
in7: +3.31 V (min = +2.98 V, max = +3.63 V)
in8: +3.09 V (min = +2.96 V, max = +3.62 V)
fan1: 0 RPM (min = 3040 RPM, div = 2) ALARM
fan2: 0 RPM (min = 0 RPM, div = 2)
fan3: 0 RPM (min = 11842 RPM, div = 2) ALARM
temp1: +42.0�C (high = +80.0�C, hyst = +75.0�C) sensor = thermistor
temp2: +35.5�C (high = +80.0�C, hyst = +75.0�C) sensor = thermistor
temp3: +32.5�C (high = +80.0�C, hyst = +75.0�C) sensor = thermistor
cpu0_vid: +1.550 V
beep_enable:enabled
mcelog.txt