• Welcome to Overclockers Forums! Join us to reply in threads, receive reduced ads, and to customize your site experience!

[Ret Sticky]Overclocking sndbx for A64 939 systems with Winchester, Opteron dual core

Overclockers is supported by our readers. When you click a link to make a purchase, we may earn a commission. Learn More.
I keep mine off because I have massive stability problems. If all things were perfect, CnQ should work smoothly if you use the max multi allowed. This beg's the question on how CnQ would work with fully unlocked CPU's like the FX series.

The reason I ask is because effective implimentation of CnQ depends on the motherboard maker. The technology is supported by the CPU, but the MoBo maker is the one who needs to implement it correctly.
 
As shown in the following post by using memtest86, the CPU and memory were found to top around a little bit over 3 GHz on air. At that time, it was determined the limiting factor was the CPU, as CPU voltage was needed to increase to increase the CPU frequency by a small amount, while relaxing timing and increaing voltage of memory did not help.
How to determine an upper limit for CPU and memory

So to follow up on the above test, here is shown how to further determine the max memory frequency.

Winchester 3000+ 0447 CBBHD UPCW
G. Skill PC4400 LE 2x256 MB
DFI LP UT Nforce4 Ultra-D Rev A02 01/25/05 bios
XP-90
6600 GT at 600/1200 (core/memory)

The CPU multiplier was lowered to 8 from 9 so the HTT and memory frequency can be raised to the highest possible.

HTT and memory were found to top out around 350 MHz at which CPU was at 350 x 8 = 2.80 GHz. CPU was not a limiting factor as it can boot around 3 GHz.

Later, HTT was able to run further at 360 MHz using a CPU_multiplier of 8 and memory_HTT_ratio of 5:6 equivalent to a CPU_memory_divider of 10. So at that point, memory was determined to be limiting at 350 MHz 3-5-5-10 1T.

HTT = 350 MHz
CPU_multiplier = x8
CPU_frequency = 2800 MHz 1.55 V
memory_HTT_ratio = 1:1
memory = 350 MHz 3-5-5-10 1T 2.8 V

lp_ultra-d_winnie3000_cbbhd_350x8_mem_350_superpi_sandra.JPG
 
Last edited:
HTT = 360 MHz
CPU_multiplier = x8
CPU_frequency = 2880 MHz
memory_HTT_ratio = 5:6 (memory_frequency = CPU_frequency / 10)
memory = 288 MHz 2.5-3-3-7 1T 2.8 V

lp_ultra-d_winnie3000_cbbhd_360x8_cpu_2880_mem_288_superpi_sandra_everest.JPG


lp_ultra-d_winnie3000_cbbhd_360x8_cpu_2880_mem_288_superpi_sandra_everest2.JPG
 
HTT = 318 MHz
CPU_multiplier = x9
CPU_frequency = 2862 MHz
memory_HTT_ratio = 9:10 (memory_frequency = CPU_frequency / 10)
memory = 286 MHz 2.5-3-3-7 1T 2.8 V
6600 GT at 600/1200 MHz (core/memory)

lp_ultra-d_winnie3000_cbbhd_318x9_mem_286_superpi_sandra_3dmark01.JPG



memory = 286 MHz 2.5-3-3-7 1T 2.8 V

lp_ultra-d_winnie3000_cbbhd_318x9_mem_286_superpi_sandra_3dmark01_24k.JPG
 
Last edited:
HTT = 317 MHz
CPU_multiplier = x9
CPU_frequency = 2853 MHz
memory_HTT_ratio = 9:10 (memory_frequency = CPU_frequency / 10)
memory = 285 MHz 2.5-3-3-7 1T 2.8 V
6600 GT at 600/1200 MHz (core/memory)

lp_ultra-d_winnie3000_cbbhd_317x9_mem_285_superpi_sandra_3dmark01_24k.JPG
 
Last edited:
HTT = 318 MHz
CPU_multiplier = x9
CPU_frequency = 2862 MHz
memory_HTT_ratio = 1:1
memory = 318 MHz 2.5-4-4-8 1T 2.8 V

lp_ultra-d_winnie3000_cbbhd_318x9_mem_318_superpi_sandra_3dmark01.JPG
 
Last edited:
02/21/05
CPU = 2.85 GHz (317 x 9)
memory = 285 MHz (ratio 9:10, CPU/10) 2.5-3-3-7 1T 2.8 V
(can also run memory = 317 MHz (ratio 1:1) 2.5-4-4-8 1T)
6600 GT overclocked at 600/1200 (core/memory)

3DMark01 and SuperPI 32M

lp_ultra-d_winnie3000_cbbhd_317x9_mem_285_superpi32M_sandra_3dmark01.JPG



3DMark01 and SuperPI 1M 30 sec

lp_ultra-d_winnie3000_cbbhd_317x9_mem_285_superpi1M_sandra_3dmark01_2.JPG
 
hi_its_ryan said:
that 6600gt doesnt seem to beat my 9800pro by much. maybe because its a dx8 benchmark?

The 6600GT's always post weak 3D Mark 2001 scores, but their real-world performance is still noticeably better than your 9800 Pro.

deception``
 
Will make some 3DMark03, 3DMark05 runs using the 6600 GT when I get a chance (not sure when).

Optimize the CPU and memory first, especially the tradeoff between memory frequency (raw bandwidth) and latency.
 
The G. Skill 4400 LE scales very well with the DFI NF4 board, both of which can operate over a wide range of HTT, memory bus frequency, timing while keeping CPU frequency constant. As a result, the tradeoff between memory frequency (raw bandwidth) and memory latency can be experimented with.

The following memory operating points have been tested:

200 MHz 2.0-2-2-5 1T 2.8 V
220 MHz 2.0-3-2-5 1T 2.8 V
240 MHz 2.0-3-3-6 1T 3.0 V
260 MHz 2.5-3-3-6 1T 3.0 V
285 MHz 2.5-3-3-7 1T 2.8 V
317 MHz 2.5-4-4-8 1T 2.8 V
350 MHz 3.0-5-5-10 1T 2.9 V
 
Last edited:
CPU = 317 x 9 = 2.85 GHz, 1.55 V
Memory frequency was varied using different CPU_memory_divider = 9, 10, 11, 12, 13, 14
6600 GT at 600/1200 MHz (core/memory)

memory 204 MHz 2.0-2-2-5 1T SuperPI_1M 32 sec 3DMark01 23766
memory 219 MHz 2.0-3-2-5 1T SuperPI_1M 32 sec 3DMark01 23716
memory 238 MHz 2.0-3-3-6 1T SuperPI_1M 32 sec 3DMark01 23794
memory 259 MHz 2.0-3-3-6 1T SuperPI_1M 31 sec 3DMark01 23879
memory 285 MHz 2.5-3-3-7 1T SuperPI_1M 30 sec 3DMark01 23846
memory 317 MHz 2.5-4-4-8 1T SuperPI_1M 31 sec 3DMark01 24365


memory 204 MHz 2.0-2-2-5 1T

lp_ultra-d_winnie3000_cbbhd_317x9_mem_204_superpi_sandra_3dmark01.JPG


memory 219 MHz 2.0-3-2-5 1T

lp_ultra-d_winnie3000_cbbhd_317x9_mem_219_superpi_sandra_3dmark01.JPG


memory 238 MHz 2.0-3-3-6 1T

lp_ultra-d_winnie3000_cbbhd_317x9_mem_238_superpi_sandra_3dmark01.JPG


memory 259 MHz 2.0-3-3-6 1T

lp_ultra-d_winnie3000_cbbhd_317x9_mem_259_superpi_sandra_3dmark01.JPG


memory 285 MHz 2.5-3-3-7 1T

lp_ultra-d_winnie3000_cbbhd_317x9_mem_285_superpi1M_sandra_3dmark01_2.JPG


memory 317 MHz 2.5-4-4-8 1T

lp_ultra-d_winnie3000_cbbhd_317x9_mem_317_superpi_sandra_3dmark01_24k.JPG
 
Last edited:
(THIS POST IS REVISED, PLEASE READ THE REVSIED VERSION IN THE NEXT POST).

Memory frequency and latency tradeoff

Latency

Latency is a measure of time to complete certain operations. For synchronous mode such as CPU, memory, ..., each operation is measured in terms of number of cycles. An operation can be a large operation such as the Read or Write operation, or the smaller internal operation of a Read or Write operation.

For DRAM memory modules that are driven by a clock is called synchronous DRAM (SDRAM). With this in mind, the memory operations can be described in terms of number of cycles instead of time. When we move from DDR400 to DDR500, or a CPU from 2GHz to 3GHz, each operation takes propotionally shorter time due to faster silicon process, but the number of cycles to achieve an operation remain the same, and the interrelationship between operations in terms of cycles remains the same, unless there is a change in architechture and timing, ...

DRAM is organized in rows and columns of storage bits. The intersection of a row and a column is a bit of data. To access data, address is decoded into row and column addresses. First, the row corresponding to the decoded row address is accessed followed by sensing of all the bits in that row (and are stored in the sense amplifier during the Read operation), and then the corresponding columns are accessed and data output. In many case, multiple columns are accessed and output as a sequence of data that are located on the same row to save row access overhead, this is called the burst mode of operation commonly used for large block/page of data (this is where CAS latency comes in).

Here use Read operation to illustrate the concept. After memory controller issues Read command and address, a DRAM read operation is like this:
1. DRAM module decodes address into row address and column address
2. Activates word-line (row address) and detect and store all the row data by the sense amplifiers
3a. Activates column (column address) and outputs data
3b. In case of multiple column access (as discuss earlier), a sequence of column data is output.
4. Restore data back to the DRAM cells and precharge for next operation.

The tRCD latency - is the time or number of cycles to perform step 2. Typically, tRCD takes 2 or 3 cyles to complete.
The CAS latency (tCAS) - is the time or number of cycles to perform step 3a. In case of step 3b (in case multiple N column data are output), the total latency = N x tCAS. Typically, CAS latency takes 2 to 3 cycles to complete.
The tRP latency - is the time or number of cycles to perform step 4. Typically, tRP takes 2 or 3 cycles to complete.

So the number of cycles of a DRAM Read operation is tRCD + N x tCAS + tRP, where typically N = 1 or 2 or 4 or 8 (or even more), depends on the number of column access per memory access.

For 1 column access, number of cycles for Read operation = tRCD + tCAS + tRP
For 4 column access, number of cycles for Read operation = tRCD + 4 tCAS + tRP
For 8 column access, number of cycles for Read operation = tRCD + 8 tCAS + tRP
etc, etc.

Number of cycles for first data output from address decode = tRCD + tCAS
Number of cycles for data output in multiple column access,
Number of cycles for second data output = tRCD + 2 tCAS
Number of cycles for third data output = tRCD + 3 tCAS
Number of cycles for fourth data output = tRCD + 4 tCAS
etc.


Typically, the number of column access is software dependent, here is listed for a typical case of burst length of 4 (or 4 column accesses).

Number of cycles for each of the following timing are listed
2.0-2-2-5 average_time = 2 + 4 x 2.0 + 2 = 12 cycles
2.0-3-2-5 average_time = 3 + 4 x 2.0 + 2 = 13 cycles
2.0-3-3-6 average_time = 3 + 4 x 2.0 + 3 = 14 cycles
2.5-3-3-6 average_time = 3 + 4 x 2.5 + 3 = 16 cycles
2.5-3-3-7 average_time = 3 + 4 x 2.5 + 3 = 16 cycles
2.5-4-3-8 average_time = 4 + 4 x 2.5 + 3 = 17 cycles
2.5-4-4-8 average_time = 4 + 4 x 2.5 + 4 = 18 cycles
3.0-5-5-x average_time = 5 + 4 x 3.0 + 5 = 22 cycles

Similarily, the average_time in cycles for other burst length can be calculated.


E.g. consider the popular memory modules such as BH-5/UTT (2-2-2-5 1T) and TCCD (2.5-3-3-7 1T).

Burst length of 1, between 2.0-2-2-5 and 2.5-3-3-7,
2.0-2-2-5 average_time = 2 + 1 x 2.0 + 2 = 6 cycles
2.5-3-3-7 average_time = 3 + 1 x 2.5 + 3 = 8.5 cycles
Difference in cycles = 2.5 (6 vs 8.5)
% of frequency to break-even = 8.5 / 6 = 41.7%

Burst length of 2, between 2.0-2-2-5 and 2.5-3-3-7,
2.0-2-2-5 average_time = 2 + 2 x 2.0 + 2 = 8 cycles
2.5-3-3-7 average_time = 3 + 2 x 2.5 + 3 = 11 cycles
Difference in cycles = 3 (8 vs 11)
% of frequency to break-even = 11 / 8 = 37.5%

Burst length of 4, between 2.0-2-2-5 and 2.5-3-3-7,
2.0-2-2-5 average_time = 2 + 4 x 2.0 + 2 = 12 cycles
2.5-3-3-7 average_time = 3 + 4 x 2.5 + 3 = 16 cycles
Difference in cycles = 4 (12 vs 16)
% of frequency to break-even = 16 / 12 = 33.3%

Burst length of 8, between 2.0-2-2-5 and 2.5-3-3-7,
2.0-2-2-5 average_time = 2 + 8 x 2.0 + 2 = 20 cycles
2.5-3-3-7 average_time = 3 + 8 x 2.5 + 3 = 26 cycles
Difference in cycles = 6 (20 vs 26)
% of frequency to break-even = 26 / 20 = 30.0%

For a burst length of 4, for memory intensive application, in order for 2.5-3-3-7 to gain back the longer latency of 4 cycles (12 vs 16 cycles), the frequency has to be increased based on the 16:12 ratio, or 33% increase.

This number represents the upper bound of memory frequency that has to be increased for 100% memory read access. For real applications, the frequency increase number for breaking-even the low latency would be less, depending on the intensitiy of memory access.
 
Last edited:
Thanks to _damien_ who pointed out the distinction between tCAS and tDQ burst cycle, the latency estimation is revised as below.


Memory frequency and latency tradeoff (revision)

Latency is a measure of time to complete certain operations. For synchronous mode such as CPU, memory, ..., each operation is measured in terms of number of cycles. An operation can be a large operation such as the Read or Write operation, or the smaller internal operation of a Read or Write operation.

For DRAM memory modules that are driven by a clock is called synchronous DRAM (SDRAM). With this in mind, the memory operations can be described in terms of number of cycles instead of time. When we move from DDR400 to DDR500, or a CPU from 2GHz to 3GHz, each operation takes propotionally shorter time due to faster silicon process, but the number of cycles to achieve an operation remain the same, and the interrelationship between operations in terms of cycles remains the same, unless there is a change in architechture and timing, ...

DRAM is organized in rows and columns of storage bits. The intersection of a row and a column is a bit of data. To access data, address is decoded into row and column addresses. First, the row corresponding to the decoded row address is accessed followed by sensing of all the bits in that row (and are stored in the sense amplifier during the Read operation), and then the corresponding columns are accessed and data output. In many case, multiple columns are accessed and output as a sequence of data that are located on the same row to save row access overhead, this is called the burst mode of operation commonly used for large block/page of data.

Here use Read operation to illustrate the concept. After memory controller issues Read command and address, a DRAM read operation is like this:
1. DRAM module decodes address into row address and column address
2. Activates word-line (row address) and detect and store all the row data by the sense amplifiers
3a. Activates column (column address) and outputs data
3b. In case of multiple column access (as discuss earlier), a sequence of column data is output.
4. Restore data back to the DRAM cells and precharge for next operation.

The tRCD latency - is the time or number of cycles to perform step 2. Typically, tRCD takes 2 or 3 cyles to complete.
The CAS latency (tCAS) - is the time or number of cycles to perform step 3a. Typically, CAS latency takes 2 to 3 cycles to complete.
For step 3b, multiple column data are output as N burst of DQ data out. DQ burst cycle time (tDQ) takes half of the memory clock cycle (tCLK) per output data burst (for DDR1). tDQ = tCLK / 2.
The total latency for step 3a and 3b = tCAS + N tDQ.
The tRP latency - is the time or number of cycles to perform step 4. Typically, tRP takes 2 or 3 cycles to complete.

So the number of cycles of a DRAM Read operation is tRCD + tCAS + N tDQ, where typically N = 1 or 2 or 4 or 8 (or even more), depends on the number of column access per memory access.

For 1 column access, number of cycles for Read operation = tRCD + tCAS + 1/2
For 2 column access, number of cycles for Read operation = tRCD + tCAS + 1
For 4 column access, number of cycles for Read operation = tRCD + tCAS + 2
For 8 column access, number of cycles for Read operation = tRCD + tCAS + 4
etc.

Number of cycles for first data output from activation command (ACT) = tRCD + tCAS
Number of cycles for data output in multiple column access,
Number of cycles for second data output = tRCD + tCAS + 1/2
Number of cycles for third data output = tRCD + tCAS + 1
Number of cycles for fourth data output = tRCD + tCAS + 1 1/2
etc.


Typically, the number of column access is software dependent, here is listed for a typical case of burst length of 4 (or 4 column accesses).

Number of cycles for each of the following timing are listed
2.0-2-2-5 average_time = 2 + 2.0 + 4 x 1/2 = 6 cycles
2.0-3-2-5 average_time = 3 + 2.0 + 4 x 1/2 = 7 cycles
2.0-3-3-6 average_time = 3 + 2.0 + 4 x 1/2 = 7 cycles
2.5-3-3-6 average_time = 3 + 2.5 + 4 x 1/2 = 7.5 cycles
2.5-3-3-7 average_time = 3 + 2.5 + 4 x 1/2 = 7.5 cycles
2.5-4-3-8 average_time = 4 + 2.5 + 4 x 1/2 = 8.5 cycles
2.5-4-4-8 average_time = 4 + 2.5 + 4 x 1/2 = 8.5 cycles
3.0-5-5-x average_time = 5 + 3.0 + 4 x 1/2 = 10 cycles

Similarily, the average_time in cycles for other burst length can be calculated.


E.g. consider the popular memory modules such as BH-5/UTT (2-2-2-5 1T) and TCCD (2.5-3-3-7 1T).

Burst length of 1, between 2.0-2-2-5 and 2.5-3-3-7,
2.0-2-2-5 average_time = 2 + 2.0 + 1/2 = 4.5 cycles
2.5-3-3-7 average_time = 3 + 2.5 + 1/2 = 6 cycles
Difference in cycles = 2.5 (4.5 vs 6)
% of frequency to break-even = 6 / 4.5 = 133.3%

Burst length of 2, between 2.0-2-2-5 and 2.5-3-3-7,
2.0-2-2-5 average_time = 2 + 2.0 + 2 x 1/2 = 5 cycles
2.5-3-3-7 average_time = 3 + 2.5 + 2 x 1/2 = 6.5 cycles
Difference in cycles = 1.5 (5 vs 6.5)
% of frequency to break-even = 6.5 / 5 = 130.0%

Burst length of 4, between 2.0-2-2-5 and 2.5-3-3-7,
2.0-2-2-5 average_time = 2 + 2.0 + 4 x 1/2 = 6 cycles
2.5-3-3-7 average_time = 3 + 2.5 + 4 x 1/2 = 7.5 cycles
Difference in cycles = 1.5 (6 vs 7.5)
% of frequency to break-even = 7.5 / 6 = 125.0%

Burst length of 8, between 2.0-2-2-5 and 2.5-3-3-7,
2.0-2-2-5 average_time = 2 + 2.0 + 8 x 1/2 = 8 cycles
2.5-3-3-7 average_time = 3 + 2.5 + 8 x 1/2 = 9.5 cycles
Difference in cycles = 1.5 (8 vs 9.5)
% of frequency to break-even = 9.5 / 8 = 118.75%

For a burst length of 4, for memory intensive application, in order for 2.5-3-3-7 to gain back the longer latency of 1.5 cycles (6 vs 7.5 cycles), the frequency has to be increased based on the 7.5:6 ratio, or 25% increase.
E.g. 250 MHz of BH5/UTT at 2-2-2-5 equates to 312.5 MHz of TCCD at 2.5-3-3-7, for the same latency from ACT to DQ out for a burst length of 4.

For a burst length of 8, for memory intensive application, in order for 2.5-3-3-7 to gain back the longer latency of 1.5 cycles (8 vs 9.5 cycles), the frequency has to be increased based on the 9.5:8 ratio, or 18.75% increase.
E.g. 250 MHz of BH5/UTT at 2-2-2-5 equates to 296.9 MHz of TCCD at 2.5-3-3-7, for the same latency from ACT to DQ out for a burst length of 8.


This number represents the upper bound of memory frequency that has to be increased for 100% memory read access. For real applications, the frequency increase number for breaking-even the low latency would be less, depending on the intensitiy of memory access.
 
Last edited:
How much frequency increase is needed to break-even with low latency

CPU = 317 x 9 = 2.85 GHz, 1.55 V
G. Skill 4400 LE TCCD memory
Memory frequency was varied using different CPU_memory_divider = 9, 10, 11, 12, 13, 14
6600 GT at 600/1200 MHz (core/memory)

1. 204 MHz 2.0-2-2-5 1T SuperPI_1M 32 sec 3DMark01 23766
2. 219 MHz 2.0-3-2-5 1T SuperPI_1M 32 sec 3DMark01 23716
3. 238 MHz 2.0-3-3-6 1T SuperPI_1M 32 sec 3DMark01 23794
4. 259 MHz 2.0-3-3-6 1T SuperPI_1M 31 sec 3DMark01 23879
5. 285 MHz 2.5-3-3-7 1T SuperPI_1M 30 sec 3DMark01 23846
6. 317 MHz 2.5-4-4-8 1T SuperPI_1M 31 sec 3DMark01 24365

Need to do more runs and tighter control to improve accuracy or reduce spread for these numbers, since the 3DMark01 score can vary greatly between runs.


Preliminary result interpretation:

From tests 1, 2,
1 cycle (from tRCD) less in latency is about 7% memory frequency.

From tests 2, 3,
1 cycle (from tRP) less in latency is about 8.7% memory frequency.

From tests 4, 5,
0.5 cycle less in tCAS latency is about 10% memory frequency.


So preliminary result, between 2-2-2-5 1T (such as BH-5/UTT at 220-250+ MHz) and 2.5-3-3-7 1T (such as TCCD PC4400 at 280-300+ MHz), the latter would require 25% higher frequency to break even with the former low latency memory setup for memory performance.

In conjunction with the 18.8-33.3% for memory read of 1 to 8 burst, and the 25% typical based on analytical estimation by counting read access cycles (see link below), it is fair to establish that memory with 2.5-3-3-7 1T would need 25-30% higher bus frequency to break even with memory with 2-2-2-5 1T timing for memory performance in memory intensive applications.

So if BH-5/UTT is able to run at 250 MHz 2-2-2-5 1T, 3.3+ V. TCCD 4400 such as G. Skill LE or TCCD 4800 has to run at around 300 - 310 MHz 2.5-3-3-7 1T, 2.8 V to break even, and in many cases it is doable using some Nforce4 motherboards.

Besides the performance comparison, these are some pros and cons for BH-5/UTT vs TCCD.
- The TCCD modules which require less voltage would lessen concern about chip reliably due to the high 3.3+ V, especially medium to long term impact (if any) of such voltage level on the CPU's memory controller interface (Vmemref).
- The TCCD modules offer a wider range of memory frequency and timing for tweaking, from 200 - 300+ MHz, cas 2/2.5/3 (if motherboard allows).
- On the other hand, the frequency of around 250 MHz for 2-2-2-5 1T memory modules is more easily achievable in many setups for top performance vs the 300+ MHz for 2.5-3-3-7 1T memory.


Memory frequency and latency tradeoff
 
Last edited:
DAMN dude! This is definately some nice research!
Props to hitechjb1!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!:clap:
:attn:
yU R teh RoXXorZ!!!!11one
hehe;)
 
Excellent post Hitech, I especially like post 198.

Do you plan to take the time to perform multiple runs to get more accurate numbers? Or are you focusing on another aspect now?
 
Back