I got a really weird problem.
Two years ago bought a set of 4x DDR4 3200 16gb each, single sided and placed them in a ryzen 5600 desktop computer, which i almost never turned it off. It worked without issue.
This weekend I wanted to dust off the PC, so I took all the components out, replaced the thermal paste and so on.
Turned on the PC again, worked apparently without issues until after a while Linux was pissed about going out of memory. Out of memory? With 64gb of RAM? I checked with dmidecode -t memory
and I saw that a channel was reporting completely empty.
Shut down the PC, reinserted the second channel, rebooted, saw 64gb. One hour later, kernel panic. Rebooted in memtest86+, error in memory. What? Removed one module, error. Removed two modules, no error. Switched the modules, no error. What??
Placed the two modules that are passing the test in another computer, error. Put back in the original computer, pass test. AAAAAAAAAAAAAAA
Now I downclocked from 3200 to 2400 and everything seems working fine.
What could be? Have I been cursed?
After a few reinsertions do the slots degrade to a point that can’t sustain 3200 anymore?
Maybe the contacts were damaged on reinsert? Not just degrading / wearing down, but physically damaged
… dust off the PC …
It’s not at all out of the question that some filth got into your connector(s). Hit them with a mess of canned air and try again?
it might be, after all i took out all the components and then dusted the case with compressed air (didn’t let the fans spin)
That’s the most common thing, happened to me multiple times. Even a very small amount of dust in the slot can cause issues like that.
I don’t think that you will see a difference in performance. :)
SO DIMM and DIMM sockets have a somewhat limited durability (mating cycles) of just 25. link
I never reached that limit. And I’m not sure if this is related to your case.
Wow I didn’t imagine that the connector was so fragile
Wow, I had no idea. Thanks for the link
I wonder what that 25 number actually means. It’s 25 across multiple slot types so I’m guessing it’s less a measured value and more a quality control number based on their most fragile product.
Probably something like a sample is cycled 25 times and if less than X% still test as being in spec they know something is wrong with the current batch, but again that’s mostly a guess and the actual durability experienced by the end user would vary significantly depending on what the acceptable failure rate is.
I think so too. Most likely most of the sockets will survive more than 25 cycles. Maybe it’s a specified minimum durability which is guaranteed for nearly all sockets.
Inspect the channels for debris. Hit the RAM contacts and slot with contact cleaner (don’t get any on your skin).
RAM is easily damaged by static discharge. Were you wearing a ground strap and took care not to let the memory module touch any ungrounded surfaces while you were handling it?
Static damage can often appear as marginal or intermittent failures, probably more often than complete failure.
No I manhandled them and put them on a random shelf, I was under the impression modern electronics are designed to withstand that light abuse, saw a electroboom video where he tries and fails to fry RAM with electrostatic discharge
Newer components are if anything more vulnerable to ESD because they have more delicate construction.
Placed the two modules that are passing the test in another computer, error
So you put the ram you thought was good in another motherboard and it failed memtest? I’d interpret that to mean one of 3 things
A) the problem is in one of those modules you switched
B) separate problems occurred on both motherboards either due to unrelated issues or the memory being seated incorrectly (this is really unlucky)
C) there’s a problem with the modules you switched and an unrelated problem either in the other modules or in your primary motherboard (you poor bastard)
Did you take note of where in memory memtest was finding errors? If it wasn’t in the same general area between runs then its more likely to be a motherboard issue.
On the x370 Ryzen motherboard the test always failed at test #5 and it appeared to be shifted bytes (expected FEFEFEFE got 00FEFEFEFE)
On a H series lowest end Intel motherboard it just beeps and won’t even boot in dual channel. Single channel instead boots and pass the test. The Intel motherboard has those shitty RAM slots where there’s only one clip on a single side and the other is fixed (to save 1¢ I guess) so it’s a bit difficult to assure proper contact
I’ve encountered oxidisation of the contacts before. You can try and rub them with an ordinary eraser
the gold plating on the contacts do degrade
You put new thermal paste on things? Did you remove the CPU as well? You could have damaged some pins there too.
The delay in the failure sounds like it could be as the components expand with heat.
Take it apart and look at all the pins of both the RAM, RAM slot, and CPU (if you removed that) for any damage.
i put the new thermal stuff only on the cpu, specifically that new honeywell material. It’s a bit smaller than the cpu, ordered 3x3 cm measuring a core i3 that i had on hand, while the ryzen has a bigger IHS and fits better with a 4x4 cm
i’m thinking maybe i tightened the cooler too much but it’s the OEM one, so it shouldn’t allow overtightening because has the stoppers on the threads… unless the honeywell pad is too thick for that