Guidelines for Thorough Stability Testing

Gnufsh · May 2, 2005

It may well be that the ram is simply actually failing at that address. Either that part of it is bad or not as overclockable.

Super Nade · May 2, 2005

This is exactly what happened with my modules. Failing at one particular spot even at stock. Now this happened only for one set of modules and the other set worked ok but with some instability in Windows. I suped bad RAM, but when tested on another board, it ran memtest with 0 errors > 11 hrs. Now, the only thing I can conclude is that the DIMM slot is bad.

felinusz · May 2, 2005

yea man thanks, is there anything you can tell me about the fact that it returns the error at the same "spot" in the ram? should i ask these memtest peeps that? i guess i should get the newest version then. and as for vdimm, im using mobo max of 2.85 and the ram is at about 208 which should be alright, i think the max it can reach at 2.85 v before you need to really go more is like 215-220mhz so i dont think thats it. what i will do however, is let it run the new memtet all day when i leave for work shortly, and like i said i wonder if its something to do with my ram ITSELF that makes the test return the same spot for errors

You could very well have a bad memory chip on one of the sticks :-/

.

I would try removing both sticks, and testing them one at a time, to isolate which stick has the problem. You could also try rotating the sticks between slots individually, to remove a bad slot as a potential source of error. Some motherboards are known to have issues with certain RAM slots - your MSI Neo2 is notorious in regards to having similar issues :-/

.

If one of the sticks doesn't run stably at stock speeds, you should RMA it as soon as possible - OCZ is a superb company, and has absolutely excellent customer support services for issues such as this.

el rolio · May 2, 2005

hmm ok, well when i left it runnin this mornin with the new version of memtest 86+ ( i was using a version i downloaded from OCZ website before) it was almost done with test 7 and no errors had popped up yet, so when i get home later (it would have been runnin for about 12 hours by then) i'll see if that same spot pops errors or any at all.

thanks for comments. would suck to have a bad DIMM slot but at the same rate its already passed multiple 24 hour runs of prime95 and of course no crashes in anything at all.... so yah, i'll let yall know whats up when i get home, and thanks for the comments.

el rolio · May 2, 2005

ok got home and no errors at all, completed bout 14 hours, and about 50 passes no errors or fails. i guess the older version of the memtest i was using was bad/corrupt.

so yay. thanks all!

Mr. Cornell · May 15, 2005

felinusz said:
Because 3DMark will almost always lock within the first ten test run loops, if it’s going to lock at all, 4 hours of looped testing is a more than sufficient test for your machine. After four hours of looped testing, 3DMark has done all it really can, and isn’t going to catch any instability.

Here's an interesting case. My rig is memtest86 24+ hours (actually 24 hours 11 minutes before I hit ESC) with 0 errors, and Prime stable for 14 hours (wanted to use it, so I ended the test).

With my rig (see sig), I ran 3DMark for about 5 hours looping and it didn't crash. I was tired and I went to sleep, when I woke up about 4 hours later, I was staring at the '3DMark MFC Application has encountered a problem and needs to close" screen and my box was hard locked. So between hours 5 and 9, something in my box gave up the ghost and it crashed.

Any thoughts?

felinusz · May 15, 2005

Interesting

. It could potentially be environmental - I.E. your ambient temperature rose enough to compromise your machine's sability - but that's unlikely given that your machine handled both Prime and memtest for lengthy time periods.

Of course, the instability could also be processor related; it sounds nit-picky, but 14 hours of Prime isn't enough (in my experience/opinion) for peace of mind.

Keep in mind that it isn't an impossiblity for errors to occur in, say, the 10th hour of looped testing with 3DMark - a 'potential' for instability always exists, the best we can do is to be as thorough as possible

.

To resolve the problem and find the source, I'd start by running Prime for an extended time period, and then go from there.

Super Nade · May 15, 2005

Mate,

Could Windoze have issues as well? We seem to be discounting the fact that perfectly stable rigs may look unstable becasue of Windoze. I have yet to see someone characterize a failure on the lines of the OS.

felinusz · May 15, 2005

Super Nade, that is quite an excellent point dude, thanks for bringing it up

!

Personally, I feel that the Windows OS is a particularly sketchy piece of software. I only use it so that I can game and so that I can bench. This is my personal opinion, based off of my personal experience. I'm not trying to trash-talk Microsoft or their products here, and am in no way trying to impose my opinion on others

.

However, since stability issues caused by our Operating Sytems are so hard to properly diagnose, I prefer to avoid blaming the Windows OS for instability, or even mentioning it in the same sentance as "overclocking" for that matter. It quickly becomes a catch-all, similar to the "it's all because of a bad Athlon64 memory controller" excuse.

I have personally experienced instability as a direct and traceable result of poor OS integrity in the past (as-in, a software patch or update alleviated the problem), and I was quite furious to discover (after much time spent trying to find the source of the problem) that my OS was solely to blame.

All I can say and reccomend is that we keep Windows as up-to-date and patched as is possible. Our reliance is upon the folks over at Microsoft to constantly improve the integrity of their product. Awfully dodgy... isn't it?

So, yes, software could well be the cause of this problem, or any problem for that matter. However, I would say that it is quite unlikely that the problem is not fixeable through hardware adjustment. As an overclocker, I view the OS simply as something that we need to work with and work around when it comes to stability and overclocking, unfortunately.

Gnufsh · May 15, 2005

The OS does occasionally cause problems. I do most of my stability testing in Linux.

Mr. Cornell · May 16, 2005

Well, after more testing I have found some more information. Running Prime with small FFTs, it happily Primed for 12 hours. I was bored and decided to test further, and switched it to large FFTs, and it went POW after 3 minutes.

Confused by this, I went back into my BIOS, and ran the Vcore up a bit. Back into Windows, Prime for 34 minutes and then WHOP! So I figured, I seem to have found something interesting here. Vcore up some more, back to Prime, and right now it's been running for just under 5 hours and thus far no KABLOOIE!

So, obviously small FFTs and large FFTs are tests that both need to be run, if you are like me and hate Blend. Blend can eat me, I've already memtested the rig for 24+ hours and I'm trying to test the CPU, so could you please stop paging the entire OS to disk already Prime?

So, I guess I have learned something today. Small FFTs is great for just testing the FPU. Large FFTs is important for testing the whole CPU, and how it interacts with the rest of your system. In the case of A64, I would imagine that large FFTs will bring the Hypertransport controller into play, where as with small FFTs it's idle. Either way, it seems that for my box small FFTs was insufficient to find instability, I needed large FFTs. I hope this information is helpful to someone.

Super Nade · May 16, 2005

feliunsz:
You make an excellent point by bringing up the "catch all" situation. However, it would be nice if we could figure out a clear cut way to seperate hardware and software issues. Following your and gnufsh's lead on this matter, I'd like to propose a few simple tests.

Majority of software errors are random in nature and require a particular trigger for them to occur. By downclocking the system, one could first verify if it is indeed a software error and upon further analysis, isolate the trigger. Once the trigger is isolated, we can continue testing minus one variable. A case to point would be data corruption if memory settings are too tight. The software failure persists even after hardware issue has been resolved and one could spend endless hours tinkering with the HW settings, without realizing its an OS problem.

As gnufsh said, we could test our harware settings on another OS. Since I have 2 HDD's, I am going to use one HDD as my OC test drive. I would have both Linux and XP on the test drive. XP to ensure that my main HDD (with its own copy of XP) works, if the test HDD works and Linux, to zero in on inherent OS problems with XP.

I hope you can add a Linux stress testing section to this guide.

gnufsh,
Could you list please out the stuff you use to run stability tests in Linux?

Mr.Cornell
Thanks for the tip, mate! How about setting the size of your FFT equal to the size of the L1 + L2 cache? That should isolate the CPU, right?

:beer:

S-N

Gnufsh · May 16, 2005

From Linux I generally run prime95 and cpuburn. I finish off with a kernel compile. If all goes well, I call it stable cpu-wise. I run memtest86 as well, but that runs on its own without an OS. That fairly well covers cpu and memory testing. I haven't found a good stability test for graphics on Linux yet.

felinusz · May 16, 2005

Mr. Cornell

So, obviously small FFTs and large FFTs are tests that both need to be run, if you are like me and hate Blend. Blend can eat me, I've already memtested the rig for 24+ hours and I'm trying to test the CPU, so could you please stop paging the entire OS to disk already Prime?

I put the Blend test in the Guidelines because it is stressful on our entire system, in addition to being CPU-oriented. A Blended Prime95 test run for an adequate time period is quite thorough as a measure of processor stability in my experience. To make things as thorough as possible, "part-specific system-tests" such as Blended Prime95, and the CPU-intensive 3DMark2001SE (two tests which are "part-specific" to your processor and video respectively, yet also stress large portions of the system to a considerable degree) are reccomended in the guidelines

.

However, I have added a small text about Large FFT testing, your experience is a valuable one

.

GUIDELINE EDIT

For CPU specific testing, a Large FFT Prime95 test is an alternative to the more "system-stress" oriented Blend test that runs by default. The choice is ultimately up to the end user - the Blend test is reccomended in these guidelines because of it's qualities as both a processor and system stress test.

Super Nade

You make an excellent point by bringing up the "catch all" situation. However, it would be nice if we could figure out a clear cut way to seperate hardware and software issues. Following your and gnufsh's lead on this matter, I'd like to propose a few simple tests.

* Majority of software errors are random in nature and require a particular trigger for them to occur. By downclocking the system, one could first verify if it is indeed a software error and upon further analysis, isolate the trigger. Once the trigger is isolated, we can continue testing minus one variable. A case to point would be data corruption if memory settings are too tight. The software failure persists even after hardware issue has been resolved and one could spend endless hours tinkering with the HW settings, without realizing its an OS problem.

* As gnufsh said, we could test our harware settings on another OS. Since I have 2 HDD's, I am going to use one HDD as my OC test drive. I would have both Linux and XP on the test drive. XP to ensure that my main HDD (with its own copy of XP) works, if the test HDD works and Linux, to zero in on inherent OS problems with XP.

I like the idea of cross-OS testing quite a bit, but I have some logical and factual problems with it:

~ "The Big Three" are quite thorough as a measure of system stability when used properly and for a long enough time period. When used properly, these three programs remove hardware instability as a potential source of error - we've already established system integrity - leaving us with software issues that can be troubleshot on a case-by-case basis. As you mention, techniques such as setting everything to stock speeds, and running the machine through the "trigger-point" several times are quite effective as a means to determine what went wrong with the program. Hardware specifics (such as memory timings) that are causing instability will be netted by our "Big Three".

~ Say we have a game, let's say that the game is "Chronicles of Riddick: Escape from Butcher Bay" (Alright, so this anecdote actually happened to me

). This game is a Windows computer game. In our anecdote, the game crashes at the same point (whenever one saves the current game) on some specific maps over and over. By setting the entire system to stock speeds, and running through the crash-point (or "trigger") repeatedly, we completely remove hardware as a source of error. The errors persist even at stock speeds - so we know it's the software. The errors stop at stock speeds - so we know it's the hardware, or a conflict between our overclock and the software. Our stability testing should be thorough enough to catch any Hardware issues that would cause a crash in our computer game in the first place... cross-OS testing won't help us in any way here, as our game is made for Windows specifically. Troubleshooting occurs after stability testing, and in this case the software was to blame, not the hardware.

~ Cross-OS testing is of arguable merit when we use the exact same software in Windows as we would in Linux; some would say that running a Prime test twice across OSes is flogging a dead horse. Running Prime95 in Linux is not likely to catch any instability that won't be caught within Windows Prime. Most of the software problems we will run across are not 'checkable' across platforms.

~ The integrity of the stress testing software that we use is pretty darn good

. Prime95, memtest86, and 3DMark2001SE should all operate flawlessly as a means of stress testing, even though the OS that two of the programs use is not perfect. Software/OS problems rarely (if ever) exhibit themselves in our stress testers, they're pretty solid pieces of software - most software problems occur within games. Stress testing in Linux won't do us much good as a result... our stress testers work fine in Windows, and our games aren't made for Linux.

~ Most of us need our machines to be Windows stable, so we test exclusively within Windows, or independant of Windows (memtest86). I'm of the opinion that we need to work around our OS when it comes to overclocking and stability. The stress testing outlined in the guidelines is thorough enough to catch any hardware integrity flaws (whether they be as obvious as an unstable CPU overclock, or as subtle as a bad Drive Strength memory timing) that could possibly cause issue.

All that finally brings me here - our goal with these guidelines is to catch instability, not neccessarily isolate the source :-/

.

felinusz

One thing this guide is completely devoid of is solutions to instability. This is intentional, as lots of material exists that will aid anyone running an unstable machine in getting it up to par.

That's a whole other ballpark, and a very complex one. What with the complexity of our systems, with so much that can go wrong, it would take a long long time to cover it all.

Please let me know what you think mate, I'm interested in your opinion and take on this. These guidelines need to evolve as our hardware does, your input is appreciated and respected (it also makes for some most excellent discussion!)

MOSCHAMENSCH · Jun 4, 2005

felinusz said:
CONCLUSION

I really, really hope that this little guide has been helpful.

If you’ve read through the entire thing, and found it educational, then it has succeeded. If you read only the beginning, and learned a little bit about the dangers of instability, then it has succeeded.

The big goal here is for everyone to be aware of the potential for stability problems, and know how to test for them properly. This is an unrealistic goal, but every person who learns something about stability, and proper, thorough, stress testing, is one less potential victim of stability-related disaster.

I’m thinking about adding a little bit of information on the philosophies of stability, the different schools of thought on the issue, and different courses of action people take. I’m also thinking of making a little list of some of the ‘lesser’ stress testing programs, and where to get them – Some of them are very worthwhile, if inferior substitutes to the ‘Big Three’. I’ve left out really specific stress testing, such as that designed to stress test your hard-drive. The ‘Big Three’ do quite a thorough job of testing an entire system if used properly, and together, and I don’t find it pertinent to clutter an extremely long, but basically simple guide like this with arguably unnecessary information.

One thing this guide is completely devoid of is solutions to instability. This is intentional, as lots of material exists that will aid anyone running an unstable machine in getting it up to par.

If I have missed anything important, worded something really poorly, or not gone as in-depth as I should have on some issue, please PM me.

If anyone has any questions or comments please do take the time to post, or drop me a PM.

I use all sorts of stability testers and Benchmark prorams ,too. But I think you shouldn't take it that serious! E.g.: I just like it pushing my hardware to the edge. I test my system sometimes to roughly see if it's stable. But I don't really have that imortant stuff on HD and the important things are all backed up on a 10 GB HD laying in the shelf. Okay sometimes Unreal or other games have errors but this isn't that terrible. I just underclock a few Mhz before playing and I don't have any errors!

To come to a point: I rather want to have some errors and lots of oc fun than letting stability testers run all day and night!

felinusz · Jun 4, 2005

Fair enough, and thanks for your input

. As I mentioned, everyone who overclocks has a different outlook on stability, a different 'philosophy' on the issue if you will - and that's totally cool

The clockspeeds I run when benchmarking are far from stable by the standard of stability outlined in this guide. The clockspeeds I run when pushing hardware for fun are way far from stable

.

However, the clockspeeds that I run day-to-day, for gaming and other work, are thoroughly stability tested for my own peace of mind. Yes, running Prime95 and memtest86 for 24 hours each is a hassle. But, speaking from my personal stability philosophy, I'd rather go through that hassle than lose some important data, or have my game crash after four hours of playing, right before the next save point

The title of this thread was recently changed at my request, to "Guidelines for Thorough Stability Testing". The stability testing techniques outlined in this thread are't the law writ in stone - far from it - they are only guidelines that anyone can choose to adapt to their own unique purposes, or follow to the letter if they want to.

I will say, that these guidelines are pretty thorough

squasher · Jun 26, 2005

I am overclocking this machine right now and am having some problems with stability
2400+
1*512 ch-5 3200
NF7

I had the FSB at 180 and prime ran fine (not for a long time, just an hour, this is just preliminary) it seemed fairly stable so I made a big hop to 200. Now prime is running very slow the tests are being done at a much slower pace then before. Before they were about 1 per minute, now they seem to be about 1 ever 15 minutes. I know that I am not overheating, is this a sign of instability?
Edit: forgot to say, but before I wrote this I did one loop of memtest86+and I got no errors
Edit*2: Again I forgot to say, when I closed prime it said it had run for 25 minutes, in fact it was much closer to a hour. I made sure to put priority to 10 so I am really lost about this.

felinusz · Jun 26, 2005

It's a strange problem :-/

. The same thing happened to me with my old NF7-S after a driver change.

You might try re-installing Prime at a 200 MHz FSB, and see if that fixes it. Did you change anything else in your BIOS before this started, or did you only adjust the FSB?

squasher · Jun 26, 2005

I adjusted the timings from 2 3 3 8 to 2 2 2 7, then I set them back. Both times there were no errors but in about 30 minutes it had only done 3 of the loops(you know what I am talking about?). I was not able to set the password because I did not see the option but I did set priority to 10.

Recursion · Jul 15, 2005

awsome guide helped me alot, Thanks

Guidelines for Thorough Stability Testing

Senior Member

† SU(3) Moderator †

Senior Overclocking Magus

Registered

Registered

Member

Senior Overclocking Magus

† SU(3) Moderator †

Senior Overclocking Magus

Senior Member

Member

† SU(3) Moderator †

Senior Member

Senior Overclocking Magus

Registered

Senior Overclocking Magus

Member

Senior Overclocking Magus

Member

Member

Similar threads