• Welcome to Overclockers Forums! Join us to reply in threads, receive reduced ads, and to customize your site experience!

A64 CPUs, chipsets, motherboards

Overclockers is supported by our readers. When you click a link to make a purchase, we may earn a commission. Learn More.
topslop1 said:
So 939 motherboards and 939 A64's will be available around the start of June? If that's true then maybe I should wait a month before getting an A64 solution. However, will the 939's want DDR2? Or will they work with both things?

939 can work with existing unbuffered, non-ECC DDR 400/500 memory modules, this is a plus.

DDR2 is too far out at this point (2H of 05 planned) for those who want a A64 system.
 
Last edited:
How would a FX-53 at 2.4 GHz (or a 939 CPU) perform comparing to a Barton at 2.5 GHz

This is based on some raw benchmark numbers taken from:
Ref: "AMD’s latest bleeding-edge 64bit processor" http://bit-tech.net/review/309/1.

The raw benchmark data is put in a spread sheet and some average numbers are computed, detailed breakdown included in the attached table.

This is a more objective evaulation of the CPU and system performance, rather than subjectively, so as to guide our upgrade decision, .... Some may say benchmark may be biased, but if we know what to look for and breakdown the underlying test of each benchmark, it is more objective and helps a lot.


Details are:

A64 FX-53, socket 940, 130 nm SOI, stepping CG (part AT), rated 2.4 GHz, 1.5 V, 89W, 57.4 A, 70 C (case max), 1MB L2, dual channel ECC
w/ Asus SK8N nForce 3 motherboard
w/ PC3200 2 x 512 MB registered ECC memory

compared to
- Barton 2500+ at 1.83 GHz (11x166), w/ NF2-S rev 2.0
- Barton 2500+ at 2.5 GHz (12.5x200) (overclock), w/ NF2-S rev 2.0
- P4 3.2E (Prescott w/ HT enable) dual channel, DFI LanParty 865PE
all w/ 2 x 512 MB Corsair 3200 LL memory
Sparkle FX5900 XT 128MB video card
1 WD800 80GB ATA133 Hard Drive (not sure this is ATA133 or ATA100)


Interpretation of results:

1. A FX-53 at 2.4 GHz outperform a Barton 2.5 GHz, and a Prescott 3.2E at 3.2 GHz w/ HT enabled by few % to 10% in most of the individual tests. Tests that involved memory bandwidth put the FX-53 way ahead in % compared to Barton at 2.5 GHz. Actually, Barton has a slight edge (2-4%) over the FX-53 in 4 tests, and ahead in 1 test by 13% in the 23 tests.

2. Putting equal weight on every test (excluding one SSE2 floating point which Barton does not support), the result is
- FX-53 at 2.4 GHz performs 19.3 % better than a Barton at 2.5 GHz
- FX-53 at 2.4 GHz performs 7.3 % better than a Prescott at 3.2 GHz

3. Three memory intensive tests that FX-53 has huge score were then taken away and recomputed, namely the PC Mark 2002 memory and the two Sandra memory test, the result is
- FX-53 at 2.4 GHz performs 8.9 % better than a Barton at 2.5 GHz
- FX-53 at 2.4 GHz performs 7.3 % better than a Prescott at 3.2 GHz

4. Barring some human errors of handling of the data and interpretation of the original data, ...., I think the above result shows an objective evaluation of a new A64 FX-53.

5. The analysis shown is for an A64 FX CPU. A 939 CPU with a 1 MB L2 cache would perform the same as an A64 FX running at the same CPU frequency, memory bus, HT frequency, since both have the same memory bandwidth (dual channel). 939 has either 512 KB or 1 MB L2, whereas FX has 1 MB L2. For non-memory intensive applications, a 754 CPU with a 1 MB L2 would perform close to a FX.



FX-53_benchmark.JPG
 
Last edited:
Benchmark analysis of some A64, Barton, P4

Raw benchmark data from
http://www.techreport.com/reviews/2004q1/athlon64-fx53/index.x?pg=1

The raw benchmark data is put in a spread sheet and some average numbers are computed, detailed breakdown included in the attached table.

Summary of results:

Roughly speaking, putting equal weight on each of the SET of benchmarks (a mixture of very different programs), and
with the various CPU's are running same system bus (200 MHz), same memory and setting (except the FX-53 w/ registered ECC), same video card, ....

- A64 FX-53 performs about 32% better than a Barton, both running at same condition of CPU (2.4 GHz), HTT (200 MHz), ....
- A64 3000+, 3200+ running at 2.0 GHz perform about 8% and 10% respectively better than a Barton at 2.4 GHz.
- A64 FX-53 at 2.4 GHz also outperforms a P4 3.2E at 3.2 GHz by about 15%, and outperforms a P4 3.2 EE at 3.2 GHz by about 10%.

By breaking down the benchmark, roughly speaking

- A64 FX-53 at 2.4 GHz performs better than a Barton at 2.4 GHz by about
.... 60-100% on memory and cache bandwidth
.... 5 - 20% on CPU intensive programs
.... 30% on many games
.... 40 - 70+% on some encoding and rendering programs
....


A64_barton_P4_benchmark.JPG
 
Last edited:
Man what a post. :clap: I toss my cents in here. I made the mistake of constantly upgrading my socket A platform and basically it has just bocome too expensive for me to repeat. I'm waiting for a good ocing board that has pci express. That way i can just upgrade everything at once.
 
Note on Absolute Max Voltage rating

The following is from AMD A64 data sheet 31410, 31411, 31412 (for 754, 939, 940) published at the time for 130 nm.

A64_max_abs_voltage.JPG


From AMD A64 data sheet 31410, 31411, 31412

Stresses greater than those listed in Table 14 may cause permanent damage to the device and motherboard. Systems using this device must be designed to ensure that these parameters are not violated. Violation of these ratings will void the product warranty. Exposure to absolute maximum rating conditions for extended periods may affect device reliability.

Table 14. Absolute Maximum Ratings
Characteristic Range
Storage temperature –55 C to 85 C
VLDT supply voltage relative to VSS –0.3 V to 1.5 V
VDD supply voltage relative to VSS –0.3 V to 1.65 V
VTT supply voltage relative to VSS –0.3 V to 1.65 V

VDDIO supply voltage relative to VSS –1 V to 2.9 V
VDDA supply voltage relative to VSS –0.3 V to 3.0V
MEMVREF input voltage relative to VSS –1 V to 2.9 V
Input voltage relative to VSS for HyperTransport™ technology interface –0.3 V to 1.5 V
Differential input voltage for HyperTransport™ technology interface –1.5 V to 1.5 V
Input voltage relative to VSS for DDR SDRAM memory interface and Miscellaneous pins –1 V to 2.9 V

Refer to AMD Athlon™ 64 Processor Power and Thermal Data Sheet, order# 30430, for maximum case temperature specifications.

These are the numbers for the 130 nm A64 CPU's.

VDD is the CPU voltage (Vcore), 1.65 V abs. max.
This max voltage is only 1.65 V which is much smaller than that of 2.05-2.20 V
for the various Tbred B/Barton, so be careful.


VLDT is voltage for HT I/O ring, 1.5 V abs. max.
VTT is regulator voltage for side A and side B of the die, 1.65 V abs. max.
VDDIO is voltage for DDR SDRAM I/O ring, 2.9 V abs. max.
VDDA is voltage for PLL (phase locked loop), 3.0 V abs. max.
MEMVREF is DRAM Interface Voltage Reference, 2.9 V abs. max.


For 90 nm A64 CPU's, numbers are not yet found on the AMD tech doc. As a general rule for technology with smaller feature size, the max voltages for 90 nm should be lowered than that of the 130 nm.

Since the rated voltage for 90 nm 939 is 1.4 V comapred to 1.5 V for the 130 nm,
and the max absolute for 130 nm is 1.65 V for VDD,
so I would say 1.55 V for VDD should be OK for the 90 nm 939.

IMO, I would not go over 1.60 V on air for extended usage. In practice, it would be difficult to achieve a stable overclock much above 1.60 V on air due to the excessive heat generated from leakage current.

MOS scaling, voltage, power and leakage current
 
Last edited:
How to read A64 part number for 940, 754, 939

The following pics is from AMD A64 data sheet 30430

A64 Desktop

A64_partnumber.JPG


Table 1 (last 2 characters)
AP = 00000F48h, ClawHammer 754, rev. SH7 C0, 130 nm SOI
AR = 00000F4Ah, ClawHammer 754, rev. SH7 CG, 130 nm SOI
AX = 00000FC0h, NewCastle 754, rev. DH7 CG, 130 nm SOI
AW = 00000FF0h, NewCastle 939, rev. DH7 CG, 130 nm SOI
AS = 00000F7Ah, SledgeHammer 939, rev. SH7 CG, 130 nm SOI
AK = 00000F58h, SledgeHammer 940, rev. SH7 C0, 130 nm SOI
AT = 00000F5Ah, SledgeHammer 940, rev. SH7 CG, 130 nm SOI
BI = 00010FF0h, Winchester 939, rev. DH8 D0, 90 nm SOI
BP = 00020FF0h, Venice 939, rev. DH8 E3, 90 nm SOI DSL
BN = 00020F71h, San Diego 939, rev. SH8 E4, 90 nm SOI DSL
BW = 00020FF2h, Venice 939, rev. DH8 E6, 90 nm SOI DSL
CD = 00020F72h, Toledo 939, rev. JH8 E6, 90 nm SOI DSL
BV = 00020FB1h, Manchester 939, rev. BH8 E4, 90 nm SOI DSL
CS = 00020F32h, Windsor AM2 940, rev. JH F2, 90 nm SOI DSL (2x1 MB L2)
CU = 00020FB2h, Windsor AM2 940, rev. BH F2, 90 nm SOI DSL (2x512 KB L2)


Table 2 (3rd character from last)
4 - 512 KB
5 - 1 MB
6 - 2 MB

Table 3 (4rd character from last)
A - variable
I - 63 C
P - 70 C
K - 65 C (?)

Table 4 (5th character from Last)
A - variable
D - 1.35/1.40 V
E - 1.50 V
I - 1.40 V

Table 5 (6rd character from last)
A - 754 pin lidded OuPGA
D - 939 pin lidded OuPGA
I - AM2 940 pin

Table 6 Model Number (example, refer to "model and specification" link below for detailed listing)
2800+ 1800 MHz L2 512 KB, 1.5V (754)
3000+ 2000 MHz L2 512 KB, 1.5V (754)
3200+ 2200 MHz L2 512 KB, 1.5V (754)
3400+ 2400 MHz L2 512 KB, 1.5V (754)
3200+ 2000 MHz L2 1 MB, 1.5V (754)
3400+ 2200 MHz L2 1 MB, 1.5V (754)

3500+ 2200 MHz L2 512 KB, 1.5V (939)
3800+ 2400 MHz L2 512 KB, 1.5V (939)

...

First 3 characters
ADA - desktop


A64 Mobile (in addition to above)

A64_mobile_partnumber.JPG


Table 14
AP - Rev C0 Package Drawing B1
AR - Rev CG Package Drawing B1
AX - Rev CG Package Drawing B2

Table 15
4 - 512 KB
5 - 1 MB

Table 16
I - 63 C
P - 70 C
X - 95 C

Table 17
E - 1.50 V
I - 1.40 V
Q - 1.20 V

Table 18
B - 754 pin lidless OuPGA

Table 19 Model Number (example, refer to "model and specification" link below for detailed listing)
2700+ 1600 MHz L2 512 KB, 1.2V (754)
2800+ 1800 MHz L2 512 KB, 1.2V (754)
2800+ 1600 MHz L2 1 MB, 1.4V (754)
3000+ 1800 MHz L2 1 MB, 1.4V, 1.5V DTR (754)
3200+ 2000 MHz L2 1 MB, 1.4V, 1.5V DTR (754)
3400+ 2200 MHz L2 1 MB, 1.5V DTR (754)

Table 20 First 3 letters
AMA - Mobile A64 DTR
AMN - Mobile A64 62 W
AMD - Mobile A64 35 W


A64 FX (in addition to above)

A64_FX_partnumber.JPG


Table 8
AK - Rev C0 (940)
AT - Rev CG (940)
AS - Rev CG (939)

Table
5 - 1 MB

Table 9
I - 63 C
P - 70 C

Table 10
E - 1.50 V

Table 11
C - FX 940 pin lidded OuPGA
D - FX 939 pin lidded OuPGA

Table 12 Model Number (example, refer to "model and specification" link below for detailed listing)
FX51 2200 MHz L2 1 MB, 1.5V (940)
FX53 2400 MHz L2 1 MB, 1.5V (940/939)


940, 754, 939 CPU models and specifications (post 5)

Revisions and steppings (under construction) (post 6)

How to identify the physical core of an A64 (post 86)
 
Last edited:
A64 939 Thermal/Power Specification

The following data is from AMD A64 data sheet 30430

A64_939_AW_spec.JPG


A64_90nm_939_BI_spec.JPG


UNDER CONSTRUCTION
 
Last edited:
A64 754 Thermal/Power Specification

The following data is from AMD A64 data sheet 30430

OPN ending with AP (old C0 rev.) not included here, refer to AMD tech doc. for such.

A64_754_AP_spec.JPG


A64_754_AR_spec.JPG


A64_754_AR_spec2.JPG


A64_754_AX_spec.JPG
 
Last edited:
Mobile A64 754 Thermal/Power Specification

The following data is from AMD A64 data sheet 30430

OPN ending with AP (old C0 rev.) not included here, refer to AMD tech doc. for such.

Mobile DTR (1.5 V)

A64_mobile_AR_spec.JPG


A64_mobile_AR_spec3.JPG


Mobile 1.4 V

A64_mobile_AR_spec2.JPG


Mobile 1.2 V

A64_mobile_AX_spec.JPG
 
Last edited:
A64 FX 940/939 Thermal/Power Specification

The following data is from AMD A64 data sheet 30430

A64 FX 940

A64_FX_940_spec.JPG


A64 FX 939

A64_FX_939_spec.JPG


A64_FX-55_939_spec.JPG
 
Last edited:
This 939 platform memory bandwidth, as estimated from some test data (so result is preliminary), is impressive. Its efficiency is around 86-90%, which is 15-20% (to be confirmed with more 939 test data) better than the P4 QDR dual channel counterpart.

Its effective bandwidth (not max), running at the same memory bus speed, is about 15-20% higher than that of P4 QDR dual channel and 81-89% higher than that of 754 platform or nforce2 dual channel.

Estimation and importance of 939 platform memory bandwidth

A major difference between the AMD 754 and 939 platforms is the memory bus, i.e. 64-bit memory bus for 754 vs the 128-bit memory bus for 939. Here put it some estimate (since 939 is not commonly available yet) to see the potential impact on memory bandwidth performance.

I think there is a significant advantage from the 939 128-bit memory bus and on-chip dual channel controller, it is very different from the nforce2 dual channel which has only few % memory bandwidth improvement over single channel, as shown below.

memory_bandwidth_efficiency = effective_memory_bandwidth / max_memory_bandwidth

1. In the P4 arena, the dual channel QDR efficiency is around 75% with 64-bit memory bus
max_memory_bandwidth = FSB x 4 x 8 = 32 FSB
effective_memory_bandwidth = FSB x 4 x 8 MB/s x 0.75 ~ 24 FSB

2. XP nforce2 single channel efficiency ~ 85-90%
max_memory_bandwidth = FSB x 2 x 8 = 16 FSB
effective bandwidth = 0.875 x 2 x 8 x FSB ~ 14 FSB

3. XP nforce2 dual channel effieiency ~ 90 - 95% (actually should be 45-48%, depends on how it is counted)
max_memory_bandwidth = FSB x 2 x 8 x 2 = 32 FSB
max_FSB_bandwidth = FSB x 2 x 8 = 16 FSB (FSB limits dual channel memory bandwidth)
effective bandwidth = 0.925 x 2 x 8 x FSB ~ 14.8 FSB

4. 754 hardwares have been around for a while, and we have seen its memory bandwidth being around 95%.

For 754 platform, memory bandwidth efficiency ~ 95%
max_memory_bandwidth_754 = 2 x 8 x memory_bus_frequency = 16 memory_bus_frequency
effective bandwidth = 0.95 x 2 x 8 x memory_bus_frequency = 15.2 memory_bus_frequency

E.g. from Maxvla's system screenshot (http://www.maxvla.com/host/komusa4200b.jpg), a 754 memory benchmark (integer buffered iSSE2) shows the 754 memory efficiency being around 4574/4800 = 95%.
At 300 MHz, the max bandwidth would be 4800 MB/s for single channel, and 9600 MB/s for 128-bit bus (theoretical max).

5. For the 939 128-bit memory bus, there is a good possibility that it could be higher than 75% (the P4 QDR number) due to its direct 128-bit memory bus:
- max_memory_bandwidth_939 = 2 x 16 x memory_bus_frequency = 32 memory_bus_frequency
- At 80%, 300 MHz, the effective bandwidth would be 7680 MB/s
- At 90%, 300 MHz, the effective bandwdith would be 8640 MB/s
- At 95%, 300 MHz, the effective bandwdith would be 9120 MB/s
(ECC is not required in 939).

I think the 128-bit memory bus could be more efficient than the 64-bit QDR, hope it is close to the single channel number ~ 85-90%. This will be confirmed when actual 939 hardwares come out. (Will see)


- Ref result 1
I have seen number on memory bandwidth for a FX51 to a A64, assuming running same bus frequency. This is an example,
FX51 - 5315 MB/s (dual channel 128-bit bus)
A64 - 2954 MB/s (64-bit bus)

bandwidth_128_bus / bandwidth_64_bus = 5315 / 2954 = 1.8

So assuming the 64_bus has 95% efficiency, then the

128_bus_efficiency would be 95% * 1.8 / 2 = 86%.

- Ref result 2
http://www.ocworkbench.com/ocwbcgi/ultimatebb.cgi?ubb=get_topic;f=29;t=000711

A64-FX (939) DDR400 dual channel - 5763.5 MB/s
A64-FX (939) DDR400 dual channel disabled - 3101 MB/s
Not clear in the result about the memory bus speed, let's assume it is 200 MHz for the math.

The 939 dual channel 128-bit memory efficiency = 5763.5 / 6400 = 90% !!!

Improvement over dual channel disabled = 5763.5 / 3101 = 1.86 (or 86%) (impressive bandwidth improvement)


Summary (preliminary numbers, may vary as more 939 test results become available):

- If further confirmed by more 939 hardwares, this 86 - 90% number on bandwidth efficiency for 939 128-bit is 15 - 20% higher than the 75% QDR of P4 (64-bit).

- At 86-90% efficiency, the effective bandwidth for the 939 128-bit memory bus would be 81 - 89% higher than that of a 754 64-bit memory bus, with assumed 95% memory efficiency.

This higher bandwidth in 939 would have significant impact on memory intensive applications such as video and image streaming, applications using spatially structured data as in scientific computation, ..., as well as 3Dmark01.


PS:

For video, image streaming, data needs to be refreshed constantly from the main memory (L3) to the on chip L2 via the memory bus (same as FSB in P4 and XP) as size of video data >> L2 size at any given time. So the high P4 dual channel memory bandwidth delivers an advantage. For the upcoming 939, I think it would even be better due to its 128-bit memory bus (w/ dual channel controller).

Let BW stands for effective memory bandwidth (not max),
DC stands for dual channel memory controller,
SC stands for single channel memory controller,
for the same bus speed (system bus, memory bus)

BW_939 > BW_P4_DC > BW_754 > BW_XP_DC > BW_XP_SC

at a ratio estimated respectively about

86-90 : 75 : 48 : 47 : 44

or

BW_939 = 27.5 - 28.8 bus (to be confirmed when 939 available)
BW_P4_DC = 24 FSB
BW_754 = 15.2 bus
BW_XP_DC = 14.8 FSB
BW_XP_SC = 14 FSB

Multiply the corresponding number and FSB in MHz will give the MB/s memory bandwidth.
E.g. FSB = 200 MHz, mem_fsb_ratio 1:1, BW_P4_DC = 24 x 200 = 4800 MB/s
 
Last edited:
Differences between XP FSB and the A64 buses (separate memory bus and HyperTransport bus)

For XP, memory data, video card data, PCI data (hard disk, optical drives, networking, ...), serial links (USB, firewires, ...), slower peripheral (keyboard, mouse, ...), everything are going through the FSB to/from the CPU.

Using nominal 200 MHz FSB running in DDR, with 64-bit data path, the
max bandwidth is 200 x 8 x 2 = 3200 MB/s = 3.2 GB/s

The traffic that are crucial to system performance such as memory data, video card data, hard disk data (file I/O, paging) have to compete with other in the FSB, result in bottleneck and system bus conflct.


For the A64 CPU, the memory traffic and the traffic for the rest of the devices mentioned above are separated at the CPU rather than at the external chipset. This is the key difference in system bus architecture between the old XP and the new A64, and has an important advantage of system performance for the A64.

- Memory is communicating directly via a separate memory bus to the processor's on-chip north bridge/dual channel memory controller with 128-bit data path (for 939/940) and on-chip north bridge/single channel memory controller with 64-bit data path (for 754).
In an other post, the effective bandwidth for the 128-bit dual channel is estimated around 90% of max bandwidth, which is higher than the 75% number of P4 dual channel QDR.

- The rest of the subsystems such as video, hard drives (IDE, SATA, RAID), optical drives, networking, serial links, multi-CPU communication (for multi-processor board), ..., are comunicating to/from the CPU via the HyperTransport bus to the external chipset and various bridges down stream.


HyperTransport is for point to point connecting the CPU to peripheral subsystems such as networking, storage, serial links, chip to chip communication, I/O, ....

(HyperTransport bus does exist in nforce2 chipset, linking NB and SB.)

It is based on packet switching with a packet size of multiple of 32 bit (4 byte), with a max packet size of 64 bytes. HyperTransport allows for bi-directional transfer.

Data width can be configured in 2, 4, 8, 16, 32 bit.

Currently, its specification is 200-800 MHz with DDR, hence
max bit rate = 1600 Mb/s (per bit)

Since it is packet switching, the switching rate is usually referred to as number of transfer per sec (T/s). So the
maximum transfer rate = 1600 MT/s

At maximum 32-bit transfer, the
max bandwidth for 32-bit = 32 x 800 x 2 / 8 = 6400 MB/s = 6.4 GB/s
(peripheral bandwidth already higher than the XP FSB)

Since transfer is allowed for bi-direction, for 32-bit transfer, the
max throughput for 32-bit = 12.8 GB/s.

Compared this speed with 33 MHz/32-bit PCI which is 133 MB/s, it is 48X. Compared to the 1 GB/s for PCI-express, it is 12X.

The max bandwidth (for peripheral communication) of 6.4 GB/s is comparable to that offered by the current system bus (FSB) used for both memory, video and peripheral at 200 MHz, the max bandwidth for dual channel quad pump P4 is 6.4 GB/s, and DDR for AMD is 3.2 GB/s.


Summary:

Due to the separation of memory bus and HyperTransport (system) bus for all other devices in A64,
- the effective latency between the CPU (after L2 miss) and the memory (L3) is reduced
- the effective bandwidth of the A64 memory bus (128-bit in 939) to/from the CPU is alone higher than the effective P4 memory (and system) bandwidth, and twice that of XP
- the max bandwidth of the HyperTransport bus (for all other devices) to/from the CPU is alone comparable to P4 system bus, twice that of XP

The max combined bandwidth of memory bus (in 939) and HyperTransport in an A64 system is more than twice the sytem bus (FSB) of a P4 system and four times the system bus (FSB) of an XP system.
 
Last edited:
The L2 cache size from XP to A64 has grown from 256 KB to 1 MB, eventually will see 2 MB in A64.

Palomino, Tbred A, Tbred B - 256 KB
Barton - 512 KB
A64 - 512 KB, 1 MB, eventually 2 MB

Some remarks on cache latency, cache size, memory latecny and memory bandwidth (for A64's)

The tradeoff and equivalence of cache latency, cache size and memory bandwidth have been studied to great extent, .... Their average and statistical relationship and impact on performance have been welll analyzed, ....

- Cache latency is the time or number of cycles to wait for getting the first data from the cache after a read command is issued.

- Cache size, the size of cache in bytes (for data/instruction), relates the probability (hit ratio) that data can be found in a cache. The larger the cache, the higher the chance to find the data in the cache. If data is not in L2 (cache miss), then data has to be looked for in the next stage cache or main memory (aka L3). Memory can be treated as the next level cache (L3) for L2.

- Memory latency is the time or number of cycles (many more cycles compared to L2 latency) to wait for getting first data from the main memory after a read command is issued.

- Memory bandwidth is the number of bits per sec (usually in MB/s) transfer to/from the memory after a transfer has started (i.e. after the latency).

Studies have shown that twice the size of L2 would translate into few % (typically around 5%) of average system performance, over a wide range of prorgrams (some benefited more and some less). This conurs with what we have been talking about the XP w/ 256KB L2 and the 512KB L2 Barton.

Studies have also shown that there is equivalence between cache size and cache latency, i.e. using larger memory/cache to tradeoff memory/cache with larger latency.

939 has twice the max memory bandwidth of a 754 due to the 128-bit memory bus in 939. This is in additon to the dual channel memory controller in 939. For applications that require constantly changing, large, and well structured spatial data as in scientific computations, video encoding/decoding, image processing, ..., these applications would be benefited directly from the 128-bit memory bus of 939 (vs the 64-bit of 754), .... Since data needs to be refreshed constantly from the main memory (L3) to the on chip L2 via the memory bus as size of data >> L2 size at any given time.

Interesting question: how would the following A64 perform running the same CPU frequency, memory bus frequency and HT bus frequency
A - A64 1 MB L2, 128-bit memory bus (939/940)
B - A64 1 MB L2, 64-bit memory bus (754)
C - A64 512 KB L2, 128-bit memory bus (939)
D - A64 512 KB L2, 64-bit memory bus (754)

A is better than B or C.
B or C is better than D.
Between B and C, it depends on applications. For memory intensive applications, C has an advantage.
(My choice would be C (512 KB L2 939) at first to save money on CPU, then upgrade later to A (1MB L2 939) when CPU yield mature and price lowered.)


Recap on 939 memory latency, memory bus bandwidth and system bus bandwidth:

For the A64 CPU, the memory traffic and the traffic for the rest of the devices (video, IDE, SATA, serial links, ...) are separated at the CPU rather than at the chipset (NB). As a result,

- The average memory latency between the CPU (after L2 miss) and the memory (L3) is reduced.

- The effective bandwidth of the A64 memory bus (128-bit in 939) to/from the CPU is alone higher than the effective P4 memory (and system) bandwidth (estimated about 15-20% higher), and almost twice that of XP and also that of 754 (estimated 81-89% higher).
(See earlier post on memory bandwidth.)

- The max combined system bus bandwidth of memory bus (in 939) and HyperTransport in an A64 system is more than twice the sytem bus (FSB) of a P4 system and four times the system bus (FSB) of an XP system.
(See earlier post on system bus bandwidth.)
 
Cache and CPU performance

There are two processors A and B both running at 2.5 GHz, i.e. 2,500,000,000 clock cycles per sec. A basic CPU operation requires one clock cycle.

One processor A has a larger L2 cache, say 512 KB. Another processor B has a smaller L2 cache, say 256 KB.

L1 and L2 cache are for storing frequently used data for the CPU, temporarily until new data has to be swapped in from, and old data has to be swapped out to main memory. The processors can read from and write to the cache with very few clock cycles (cache latency).

Main memory (aka L3 in PC) can store much much more amount of data (e.g. 1 GB main memory would be 2000 times of 512 KB L2). To read/write the main memory, it requires much much more CPU cycle, say 30 - 80 times.

Hard drive (aka L4 in PC) can store even more data, ..., basically the universe of the data in your system, but it takes even more time, and it occurs during paging when data is not found in main memory in a computer system.

L1 cache, L2 cache, main memory (L3), hard disk (L4) form the so called memory hierarchy.

The larger the cache, the chance (probability) of finding data there is higher. Ananlysis shows that when the cache size is above certain size for a given CPU architecture, CPI and cache latency, the probability will level off. Typically, the probability is around 85 - 95% for L2 ranging from 256 KB to 512 KB or even 1 MB.

The time to read/write data to the main memory typically requires many many more CPU cycles (see earlier number). So if the CPU needs data that is not in the cache (called cache miss), it would have to wait until the data arrives in the cache again from the main memory (many more cycles later than if it is found in the cache).

Even if both CPU A and B are running at the same frequency of 2.5 GHz, CPU A will finish a given job sooner than CPU B since the probability for CPU A to find data in the cache is higher than that of CPU B. CPU A has less cache miss than CPU B.

Analysis has shown that, by doubling the L2 cache size, the overall performance would be improved by 0 - 10%+ over a wide range of applications, some more and some less, averaged typically by say 5%.

That is why we usually say a Barton (512 KB L2) performs 5% better than a 1700+ (256 KB L2) running at same frequency, or the 1700+ has to run 125 MHz faster to break even with a Barton at 2.5 GHz. Few months ago, a Tbred B DLT3C 1700+/1800+ overclock about 100 MHz better than a desktop Barton, so they were about tie. But recently the mobile Barton overclocks equally good, and in many time even higher than the 1700+/1800+, so the mobile Barton is a better choice for performance (apart from the price difference).

For A64 CPU, from the PR rating of the A64 754 CPU:

2800+ 1800 MHz L2 512 KB
3000+ 2000 MHz L2 512 KB
3200+ 2000 MHz L2 1 MB
3200+ 2200 MHz L2 512 KB
3400+ 2200 MHz L2 1 MB

we can see that going from 512 KB to 1 MB L2 size, the CPU rating stays the same while running 200 MHz slower at around 2 GHz level. In other words, the difference between 512 KB and 1 MB L2 is equivalent to about 10% CPU performance (rating).


What happens to programs running in CPU with smaller and bigger L2 cache (page 17)

Some remarks on cache latency, cache size, memory latecny and memory bandwidth (for A64's) (page 19)
 
Last edited:
Comparing (preliminary) performance of A64 754, A64 939 and Barton

Based on the benchmark analysis in
http://www.ocforums.com/showthread.php?s=&postid=2751763#post2751763

hitechjb1 said:
...
2. Putting equal weight on every test (excluding one SSE2 floating point which Barton does not support), the result is
- FX-53 at 2.4 GHz performs 19.3 % better than a Barton at 2.5 GHz
- FX-53 at 2.4 GHz performs 7.3 % better than a Prescott at 3.2 GHz

3. Three memory intensive tests that FX-53 has huge score were then taken away and recomputed, namely the PC Mark 2002 memory and the two Sandra memory test, the result is
- FX-53 at 2.4 GHz performs 8.9 % better than a Barton at 2.5 GHz
- FX-53 at 2.4 GHz performs 7.3 % better than a Prescott at 3.2 GHz
...

The following CPU's running at same frequency are compared
- A64 939 1 MB L2 (2x memory bus)
- A64 939 512 KB L2 (2x memory bus)
- A64 754 1 MB L2
- A64 754 512 KB L2
- Barton

To go further for evaluating the effect of L2 cache size:
Based on AMD PR on its CPU:
2800+ 1800 MHz L2 512 KB
3000+ 2000 MHz L2 512 KB
3200+ 2000 MHz L2 1 MB
3200+ 2200 MHz L2 512 KB
3400+ 2200 MHz L2 1 MB

we can see that going from 512 KB to 1 MB L2 size, the CPU rating stays the same while running 200 MHz slower at around 2 GHz level. In other words, the difference between 512 KB and 1 MB L2 is equivalent to about 10% CPU performance (rating).

So based on the above set of benchmarks, scaled the 2.4 GHz A64 FX-53 to 2.5 GHz, we have for CPU's running at the same frequency

- For comparing CPU intensive programs, on the average (with equal weight)
1 GHz of 939 A64 with 1 MB L2
~ 1 GHz of 754 A64 with 1 MB L2
= 1.10 GHz of 939/754 A64 with 512 KB L2
= 1.13 GHz of Barton

Adding memory bandwidth gives 19.3%-8.9%=10.4% advantage to 939.
- For both CPU intensive and memory intensive programs, on the average (with equal weight)
1 GHz of 939 A64 with 1 MB L2
= 1.10 GHz of 939 A64 with 512 KB L2
= 1.10 GHz of 754 A64 with 1 MB L2
= 1.1 x 1.1 = 1.21 GHz of 754 A64 with 512 KB L2
= 1.13 x 1.1 = 1.24 GHz of Barton
 
Last edited:
Corollary from last post:

Among the Barton, A64 754 with 512 KB/1 MB L2, A64 939 with 512/1 MB L2, running at the same frequency of CPU, FSB (e.g. 200 MHz nom), HT bus (e.g. 800 MHz nom), memory bus (e.g. 200 MHz nom)

On the average among CPU intensive programs,
A64 939 with 1 MB L2 performs
- about the same as a A64 754 with 1 MB L2
- about 10% better than a A64 939/754 with 512 KB L2
- about 13% better than a Barton.

On the average among all programs (counting CPU intensive and memory intensive programs),
A64 939 with 1 MB L2 performs
- about 10% better than a A64 939 with 512 KB L2
- about 10% better than a A64 754 with 1 MB L2
- about 21% better than a A64 754 with 512 KB L2
- about 24% better than a Barton.
 
Last edited:
Benchmark analysis of some A64, Barton, P4

Based on the benchmark analysis in
http://www.ocforums.com/showthread.php?s=&postid=2751785#post2751785


Summary of results:

Roughly speaking, putting equal weight on each of the SET of benchmarks (a mixture of very different programs), and
with the various CPU's running at same FSB (200 MHz) base line frequency, same memory and setting (except the FX-53 w/ ECC), same video card, ....

- A64 FX-53 performs about 32% better than a Barton, both CPU at 2.4 GHz, FSB 200 MHz
- A64 3200+, 3000+ running at 2.0 GHz perform about 10% and 8% respectively better than a Barton at 2.4 GHz.
- A64 FX-53 at 2.4 GHz also outperforms a P4 3.2E at 3.2 GHz by about 15%, and outperforms a P4 3.2 EE at 3.2 GHz by about 10%.

By breaking down the benchmark,

1. A64 FX-53 at 2.4 GHz performs better than a Barton at 2.4 GHz by about
- 60 - 100% on memory and cache bandwidth
- 5 - 20% on CPU intensive programs
- 30% on many games
- 40 - 70+% on some encoding and rendering programs

2. A64 FX-53 at 2.4 GHz performs better than an A64 3200+ (1 MB L2, single channel) at 2.0 GHz by about
- 29 - 91% on memory and cache bandwidth (29 - 91% as expected for SC)
- 20% on CPU intensive programs (become insignificant when scaled to 2.4 GHz)
- 12 - 20% on many games (become insignificant when scaled to 2.4 GHz)
- 20% on some encoding and rendering programs (become insignificant when scaled to 2.4 GHz)

Except for memory bandwidth and programs that can benefit from it, a 754 1 MB L2 754 about ties the performance of a A64 FX running the same CPU frequency.

3. A64 FX-53 at 2.4 GHz performs better than an A64 3000+ (512 KB L2, single channel) at 2.0 GHz by about
- 29 - 91% on memory and cache bandwidth (29 - 91% as expected for SC)
- 22 - 28% on CPU intensive programs (become 2 - 8% when scaled to 2.4 GHz)
- 15 - 29% on many games (become 3 - 9% when scaled to 2.4 GHz)
- 22% on some encoding and rendering programs (become 2% when scaled to 2.4 GHz)

Besides the lower memory bandwidth performance as in the single channel 754, as expected, a 512 KB 754 with smaller L2 cache performs 2-9% worse over a range of programs than that w/ 1 MB L2.


A A64 939 with 1 MB L2 should perform the same as a A64 FX.
 
Last edited:
I'd just like to say, wow, what an excellent series of posts. :clap: :clap:

I think that one question that we need to ask, as the 1MB Clawhammers are being phased out, is, how will the 512mb DC-enabled, socket 939 Newcastles compare to them overall? Judging from this it appears that there is a considerable performance increase given by the extra cache, especially in gaming, but also shows up quite dramatically in SuperPI; 3 seconds. Dual channel, however, adds a whole new arena of performance. We could expect to see an additional 90% in raw memory bandwidth. But which one will matter more to most of us; cache, or memory bandwidth?

Here is a thread with some results of an FX-53; socket 939 Sledgehammer. The difference in comparison to the 940 Sledgehammer doesn't look to be too significant to me Moreover, in every benchmark run there, the socket-754 Clawhammer manages to hold its own.
 
For price performance NEW build system, an A64 754 + Nforce3 250 GB is becoming a better choice than Nfoce2 + mobile Barton, as of May 04. This is the reasoning.


These links shows some benchmark analysis of a A64 FX-53, A64 3200+, A64 3000+, a Barton and some P4's.

http://www.ocforums.com/showthread.php?s=&postid=2762781#post2762781
http://www.ocforums.com/showthread.php?s=&postid=2766934#post2766934

Roughly speaking, when clocking to the same frequencies of CPU, FSB, HT, memory, IMO
the top line FX-53 (which 939 would resemble) is better than a barton by about 24-32% (average over a range of progarms, should look at the detailed breakdown as listed in the links).

An A64 754 with 1 MB L2 would be close to the FX-53 except for memory bandwidth and memory intensive programs. For memory intensive applications, the 939/940 would have an edge on performance over the 754 (with same L2 size and running same frequencies) ranging from 20-80%, as seen from those benchmarks.

An A64 754 with 512 KB L2 would be 2-10% worse than an A64 754 w/ 1 MB L2.
....


A 754 512 KB L2 A64 + 250 GB motherboard can be had around $100 more compared to a Nforce2 + mobile Barton when builiding a NEW system with the A64 technologies + 15-25% average gain over a Barton (at same frequencies).

A 754 1 MB L2 A64 + 250 GB motherboard can be had around $150 more compared to a Nforce2 + mobile Barton when builiding a NEW system with the A64 technologies + 20-30% average gain over a Barton (at same frequencies).

....


As of May 04, here assuming limiting the choice to 754 (939 not out yet):

There have been reviews of some 754 motherboards with Nforce3 250 GB (with GB), hope should be on the street very soon.

The MSI K8N Neo is one that has received good reviews and it has x5 multiplier for the HT bus.
Other good motherboards may be coming soon, if can wait a little bit more.

I think for good price performance,
- e.g.
a 754 3000+ 512 KB L2 with CG rev, NewCastle core (OPN AX) with x10 multiplier,
a 754 mobile 3000+ 1 MB L2, with CG rev, ClawHammer core (OPN AR) with x9 multiplier
- a 754 motherboard w/ 250 GB with 5X HT setting, such as one above or better ones
- SLK-948 (or wait for the heat pipe version)
- 2 x 512 MB memory (DDR500 or overclock equivalent to run 250 MHz+ ASYNC)
(IMO, this combo is coming to a point that it is considered a better choice than a Nforce2 w/ mobile Barton for a NEW build, even considering price).

There are also 754 mobile, 754 mobile DTR, 754 desktop 3200+ 1 MB L2 version, with x10 multiplier.

Target setup:
CPU x9, target to 2.25 - 2.7 GHz
Memory at 250 - 300 MHz, ASYNC, effective BW = 3800 - 4560 MB/s
HT at x3, at least 750 - 900 MHz w/ DDR

....


Of course, this does not rule out spending more for the larger cache version or the 939, but that is beyond the scope of the current discussion for price performance NEW build.
 
Last edited:
Update new models 2700+, 2800+ for A64 mobile (page 1).

Latest AMD tech doc 30430 rev. 3.21 May 04
1.20V
35W
FC0h
512 KB L2
OPN ends with AX
which points to being a NewCastle.
 
Back