• Welcome to Overclockers Forums! Join us to reply in threads, receive reduced ads, and to customize your site experience!

A64 CPUs, chipsets, motherboards

Overclockers is supported by our readers. When you click a link to make a purchase, we may earn a commission. Learn More.
Wow. A lot of text exists. It would take like a bit over a day to read and comprehend what is read, reading the entire contents of this thread.
 
Going from 130 nm to 90 nm, the dimensions of the transistors and metal wires in a chip are shrunk, this is called MOS scaling. Here is described MOS scaling in more details.

MOS scaling, voltage, power and leakage current

In the last three decades, MOS (metal oxide semiconductor) has gone through the so called classical scaling in each generation to shrink the dimensions of transistors and wires, as well as reducing voltage to maintain a constant electric field in transistors.

Let a be the scaling factor, a > 1.
Typically, for each generation shrink, a ~ 1.4
device density = a^2 = 1.96 ~ 2 (to double device density)
E.g going from 180 nm to 130 nm, a = 180 / 130 = 1.385
E.g going from 130 nm to 90 nm, a = 130 / 90 = 1.444

transistor channel length = 1 / a (decrease)
transistor oxide thickness = 1 / a (decrease)
transistor width = 1 / a (decrease)
wiring width = 1 / a (decrease)
voltage = 1 / a (decrease)
doping = a (increase)

Consequence:
electric field = 1 (constant)
transistor area = 1 / a^2 (decrease)
density = a^2 (increase)
capcitance = 1 / a (decrease)
delay = 1 / a (decrease)
power / circuit = 1 / a^2 (decrease)
power density = 1 (constant)

Recently, as voltage is scaled down to 1 - 1.x V, deviation from classical scaling has been occurring by scaling the voltage component down less aggressively to overcome the intrinsic threshold voltage in transistor in order to maintain speed - generalized MOS scaling.

voltage is scaled according to k / a (scaled down less aggressively)
doping is scaled according to k a
where k > 1.

As a result, chip dimensions are still scaled as before to get density increase.

electric field = k (increase) <-- deviates from classical scaling
capcitance = 1 / a (decrease)
delay = 1 / a (decrease)
power / circuit = k^2 / a^2
power density = k^2 (increase) !!!! <-- deviates from classical scaling

Further, from 90 nm and beyond, the static leakage current component (non-computation related) is increasing at a much faster pace than the active current component (computation and clock related), and leakage current is going to surpass the active current in some future generation. Due to the increasing leakage current, the power dissipation further limits clocking the processors at a pace similar to that from past MOS scaling. The pace of clock rate increase will begin to level off for each future generation, until some drastic solution can be found, ....


Voltage reduction

What it says is that in the last few decades, when transistors have been getting smaller, ~ 70% each time in linear dimension (so more complex, function-richer processors and chips can be built), the voltage has also to be reduced by about the same amount in order to meet the scaling requirement, i.e. to maintain constant electric field and power density for reliability and heat reasons.

There has been some deviation of the above (classical) scaling on voltage reduction recently by a lesser amount to get speed, but there are side-effects of higher leakage (more power dissipation) and higher electric field in transistors (reduced reliability ?).

When going from 180 nm to 130 nm to 90 nm, the max nominal voltage has to be reduced accordingly, i.e. the processors with smaller feature are hitting the voltage (wall) at a lower level than their predecessors with larger features.


Power estimation

For 90 nm and beyond, the static leakage power can nolonger be neglected. Static leakage power contributes a bigger and bigger percentage in the total power for 90 nm and future generations. As an illustration (number quoted for illustaration only, not exact), leakage power accounts for 10-15% of total power in 130 nm, and leakage power (including sub-threshold leakage and gate leakage) accounts for 25-35% of total power in 90 nm, and is expected to get even higher in 65 nm and beyond.

In the past, say 180 nm, power ~ active_power = A V^2 f

i.e. power is grossly proportional to V^2 and f, where A is a constant. As static leakage power is relatively small, say less than 5-10%. One can estimate the power of an overclocked processor by

power(V, f) = power(V_rated, f_stock) (V / V_stock)^2 (f / f_stock)

When static leakage power can nolonger be neglected, static leakage power is non-computational and clock related, so it is proportional to only V^2. That means one has to estimate both the active power and the static leakage power component, and only the active power component scales with clock frequency. The static leakage power dissipates power regardless the processor clock.

For a better power representation,

power = active_power + static_leakage_power
power = A V^2 f + B V^2

or in general
power = A V^2 f + B V^2 + C

where C accounts for the power that are not dependent on V (V is the processor's core voltage VDD for computation). E.g. for A64, the power component C contains the power components from other voltage sources, PLL (VDDA), memory controller interface (MEMVREF), I/O (VDDIO), HT I/O (VLDT), which are not dependent on VDD and processor clock frequency f.

This link gives an example for more detailed power estimation of overclocked Winchester, taking into account both active current, leakage current, I/O current and the corresponding power estimation.

PSU rating estimate for some 939 CPU and system


About active current and power, leakage current and power

Active current is the current component doing logic operations, switching the transistors ON and OFF and charging and discharging the internal billions of capacitors.
Active current is proportional to supply voltage and clock frequency.
Active current times supply voltage gives the active power.

Leakage current is the current component not computational related, just leak through the transistors from VDD to ground.
Leakage current is proportional to supply voltage (without clock frequency).
Leakage current times supply voltage gives the static leakage power (or static power or leakage power).

For 90 nm and beyond, when higher voltage is applied beyond certain point, the non-computational related leakage current and power could be increased drastically. Thus the heat generated would prevent further voltage increase to substain higher clock increase.


This post describes what active current, leakage current are.
How does leakage current slow down future generations of chips

hitechjb1 said:
More about leakage current and leakage power

In a silicon chip, the lowest part is silicon substrate on which 10-100 millions of transistors are deposited (current technology). Above the transistors are 100's millions wire segments in the form of multilayer grid. The metal wires are for getting power from outside, signals in and out the chip, and passing signal around the chip to the transistors.

The bulk of the silicon substrate is connected and typically grounded. Such silicon structure is usually called bulk silicon. This is what silicon chip in the past and down to 130 nm silicon chips are like. Currents also leak through the transistors to the substrate.

From 90 nm and down (some 130 nm are SOI), most of the silicon chips have the silicon body insulated from the substrate, hence the name silicon on insulator (SOI). So the leakage currents through transistors to the substrate are significantly reduced. This is the good part.

BUT the bad news is, ..., the main part of the leakage current in bulk silicon and SOI is due to the internal leakage current through the 10-100 millions transistors. Transistors have p- and n-type. Inside a chip, between the power supply (VDD) and ground, there are 10's millions of transitor paths, made up of some p- and some n-type, and leakage current are constantly flowing through those paths. This is called leakage current, or OFF current (since ideally the path should be off). So the leakage power can be written as V^2 / R, V is voltage (typically VDD), R respresents the equivalent resistance of all those leakage path. In older generation of silicon, these leakage paths and leakage current are relatively small and had not been an issue.

As transistors are getting smaller and smaller (90, 65, 45 nm), and transistor gate oxide thinner and thinner, these leakage currents are getting larger and larger (relative to the normal active current used for switching). And as described in the last post, the "wasteful" leakage current will be larger than the "useful" active current, unless something can be done. So the power for computation relative to leakage power is getting smaller for each generation, and frequency gain per generation will be leveling off.


Gate leakage tunneling current

In the so called 90 nm, 65 nm, which are measuring the channel length of transistors, the gate oxide thickness (the vertical dimension) of the transistors are already down to 1 nm (= 10 Angstrom) or just slightly above, which is of the order of a few to dozen atoms.

Leakage current through the gate oxide insulation in transistors due to quantum tunneling effect has already been observed and can be quantified. In 90 nm, gate leakage due to tunneling can contribute to about 5-10% of the total processor power.
 
Last edited:
Prerequisite:
PSU rating estimate for some 939 CPU and system


Measured power and power estimation (average- vs worst-case)

From the many posts and links, we have observed that there is a significant difference between actual measured power and the power estimation based on specifications. Here is attempted to explain the reasons behind.

Each electronic component, such as CPU, GPU, HD, video card, memory chip/module, is specified by the manufacturers certain power, current, frequency, voltage, temperature, … so that one can design and choose components to build a system. These numbers are usually specified in worst-case scenario.

Actual measurements such as power, current of a component (CPU, HD, GPU, video card, …) and a system generally reflect realistic, average-case situation. Actual number depends on how the components and system are being setup and used in real-life situation, there is no one specification number that can fit all cases.

Example 1: For a system with 4 hard drives each with a (worst-case) power specification of 15 W, the 15 W can be further broken down into power number for spinup, read, write, seek, standby. The worst-case power of 4 hard drives would be 60 W. But since not all hard drives would spin or read/write at the same time in a system, and the actual activity depends on how the drives are being configured (RAID, server, data backup, system drive, paging, …). So it is not uncommon to see actual power measured around 50% of the worst case number.

Example 2: Manufacturer specification may put higher numbers into the specification. The discrepancy may be due to process improvement, the operation conditions wherein the CPU operates is not as severe and exhaustive as the manufacturer stress test conditions.
e.g.
Winchester 3000+/3200+/3500+ CPU power
AMD TDP specification = 67 W (rated)
Power measurement = 35 W (52%, based on some reported measurement)

actaual_power_component < worst-case_power_component (specification)

Example 3: A CPU and video card, worst-case power estimation is

worst-case_power_CPU_video_card = worst-case_power_CPU + worst-case_power_video_card

actual_power_CPU_video_card < worst-case_CPU_video_card

In real usage, a CPU and a video card (even in gamers’ systems) may not be fully stressed at the same time (e.g. seldom running CPU intensive programs and Doom3 together). Further, due to system task management, the various units of a CPU are not running under full stress at the same time while memory controller and video card are under full stress.

Example 4: Another example of a system discussed, 3200+ Winchester to 2.6 GHz 1.55 V, 6800 Ultra video card, 4 hard drives, 2 optical drives, …, the power estimated based on adding power specification of individual components give an estimate of about 396 W. Some actual measurement (with some overclocked adjustment) showed about 241 W, about 61% of estimated.

This can be explained by the these difference between actual measurements and worst-case specification of individual components and system,
- not all the possible situations have been tested and measured,
- not all the components in a system tested are under worst-case stress exhasutively at the same time,
- manufacturer specifications add design margin into the raw numbers.

In general,

actual_power_component < power_spec_component (worst-case)

actual_power_system < power_spec_component_1 + power_spec_component_2 + …

The difference between worst-case power specification/estimation and actual measurement reported can be quite significant. Actual numbers are typical 50% - 80% of estimation, as seen.

One should not design a system just based on average-case, measurment results, as it may not be sufficient to provide enough design margin and to handle exhaustively all stressed conditions.


So I think we should use in concert both the power estimation based on worst-case specification and obtain actual power measurements based on real-life, application specified conditions.

The power estimation based on worst-case specification would always be higher, it gives the worst-case scenario of individual components and the overall system, it provides certain design margin and safety net to the system design. It should be used, as worst-case guideline, in conjunction with real measurement.

The actual measurement reflects a realistic view of the system (power) behavior under certain setup and application conditions, and can keep the power estimation “honest”. But on the other hand, measurements may not be able to explore some of the worst-case situations that a system can get into.


E.g. to illustrate the actual power usage being a function of how the hard drives are being configured, used.

One HD:
total_actual_power_HD ~ 80% of total_worst_case_power_HD

Two HD in RAID, both HD are working together:
total_actual_power_HD ~ 80% of total_worst_case_power_HD

Two HD not running in RAID, less power since one HD would be idle often:
total_actual_power_HD ~ 65% of total_worst_case_power_HD

Four HD in busy file server, all HD are for data access:
total_actual_power_HD ~ 75% of total_worst_case_power_HD

Four HD in a desktop system, two or three HD would be idle often:
total_actual_power_HD ~ 50% of total_worst_case_power_HD

Numbers are for illustration only. As illustrated, the actual power can vary significantly depending on the actual situation.
 
anything on ati's bullhead chipset? i believe it's the RS480...i've seen deception & felinusz post in a thread on it at the xs forums...
 
ATI Radeon XPRESS 200 PCI-express chipset for A64 Platforms

The XPRESS 200P, codenamed RX480, is the north bridge supporting discrete PCI-e video card. The XPRESS 200G, codenames RS480, is the north bridge that integrates DX9 graphics. Each of the RX480 and RS480 combines with a south bridge SB400.

The RX480/RS480 supports both single and dual channel memory, one PCI-e x16 for graphics, and up to 4 PCIe x1 slots. The north and south bridges communicate via 2 PCI-e lanes. Communication with the CPU is via a 1GHz HyperTransport bus.

Rumor is the upcoming chipsets RX482 and RS482 will support dual video cards in two PCI-e x8 slots, without even external bridges.

http://www.ati.com/products/radeonxpress200/index.html
http://www.ati.com/products/radeonxpress200/specs.html

http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2269

http://www.legitreviews.com/article.php?aid=117
 
Last edited:
As of Feb 01, 2005, DFI Nforce4 SLI-DR and DFI UT Nforce4 Ultra-D are becoming available in retail stock.

A price-performance PCI-e system (as of Feb 01, 05):

DFI UT Nforce4 Ultra-D $160 or DFI Nforce4 SLI-DR $210
Winchester 3000+ / 3200+ $150 - 200
XP-90 $40
2 x 256 MB or 2 x 512 MB TCCD based memory modules, e.g. G. Skill PC4400 LE (280 - 300+ MHz, 2.5-3-3-7 1T 2.8 V), BH-5/UTT based modules (250-260+ MHz 2-2-2-x 1T 3.3+ V) , $130 - 260
If system can run above 300 MHz HTT, TCCD is preferred for its high frequency.
6600 GT or 6800 GT or X800 XL depends on budget and level of gaming needs $200 - 400
Fortron Blue Storm 500W AX500-A or Antec Neo Power 480W or OCZ PowerStream 520W (try reusing PSU first) $90 - 120

Total = $770 - 1230 (depends on 512/1024 MB of memory and video card)

The Nforce4 Ultra-D can potentially run dual Nvidia video cards in x16/x2 DXG mode.


Anandtech just put up a review "DFI nForce4: SLI and Ultra for Mad Overclockers"
http://www.anandtech.com/mb/showdoc.aspx?i=2337

Nforce4 chipsets with PCI-e, SLI features


DFI LanParty UT NForce4 Ultra-D, LanParty Nforce4 SLI-DR

Both the SLI-DR version and the Ultra-D version use the exact same PCB layout.
Both versions have 2 PCI-e x16 slots, 1 PCI-e x4 slot, 1 PCI-e x1 slot.
Both versions have 2 regular PCI slots.

HTT 200 to 450+ MHz
Vcore can be adjusted to 2.1 V (1.55 V startup + %),
Vdimm to 3.2 V or 4 V (using 5V jumper),
Vchipset to 1.8 V,
V_LDT to 1.5 V.
Memory maxclock = Auto, 100, 120, 133, 140, 150, 166, 180, 200

The SLI version has 2 x16 PCI-e slots for dual video cards each running x8 bandwidth.
The Ultra-D version has 2 x16 PCI-e slots for dual video cards running x16 and x2 bandwidth respectively.
In dual video card mode, it has been reported that the Ultra-D DXG mode is about 10% less video performance than the SLI mode, but about $50 less.

The SLI-DR version has 4 SATA2 channels (RAID 0, 1) + 4 SATA channels (RAID 0,1, 5) (from Sil3114 controller)
the Ultra-D version has 4 SATA2 channels (RAID 0, 1)

10 USB, 2 IEEE 1394A (firewire), dual gigabit ethernet
Karajan Audio Module based on Realtek ALC850 8-Channel codec

For the detailed specification, refer to the following DFI links.

UT Nforce4 Ultra-D
http://www.dfi.com.tw/Product/xx_pr....jsp?PRODUCT_ID=3471&CATEGORY_TYPE=LP&SITE=NA
Dual Xpress Graphics (DXG) mode in DFI UT Nforce4 Ultra-D:
http://www.dfi.com.tw/Press/press_h...&TITLE_ID=4890&LINKED_URL=arch344.jsp&SITE=NA

Nforce4 SLI-DR
http://www.dfi.com.tw/Product/xx_pr....jsp?PRODUCT_ID=3449&CATEGORY_TYPE=LP&SITE=NA


A64 Nforce4 939 Motherboards
 
Last edited:
Memory bandwidth and efficiency in terms of CPU frequency, memory frequency, CPU_memory_divider, CPU_multiplier, memory_HTT_ratio

1. For the same CPU frequency and memory frequency and timing, the memory bandwidth is about the same, independent of the memory_HTT_ratio or max memclock setting.

2. The memory bandwidth efficiency varies with the CPU_memory_divider,
where CPU_memory_divider = CPU_frequency / memory_frequency

E.g. based on some measurements using a Winchester 3000+, DFI LP UT Nforce4-D, G. Skill 4400 LE
memory = CPU / 8, memory_bandwidth_efficiency = 73%
memory = CPU / 9, memory_bandwidth_efficiency = 81%
memory = CPU / 10, memory_bandwidth_efficiency = 86%
memory = CPU / 11, memory_bandwidth_efficiency = 90%
memory = CPU / 12, memory_bandwidth_efficiency = 93%
memory = CPU / 13, memory_bandwidth_efficiency = 93.5%
memory = CPU / 14, memory_bandwidth_efficiency = 94%
etc

The higher the CPU_memory_divider, i.e. the higher the ratio of CPU_frequency to memory_frequency, the higher the memory_bandwidth_efficiency.


Since CPU_memory_divider depends on CPU_multiplier and memory_HTT_ratio and is given by

CPU_memory_divider = celing(CPU_multiplier / memory_HTT_ratio)

so it looks as if memory_bandwidth_efficiency depends on CPU_multiplier. It is only true when the memory_HTT_ratio is kept constant.

Indeed, at default memory_HTT_ratio of 1:1, the memory_bandwidth_efficiency of a 3500+ with x11 is higher than a 3200+ with x10 is higher than a 3000+ with x9.

But for a CPU with lower multiplier (e.g. 3000+), one can lower the memory_HTT_ratio to achieve the same CPU_memory_divider as a CPU with higher multiplier (e.g. 3200+, 3500+).

hitechjb1 said:
Possible explanation:

When the CPU is clocked faster (consuming faster), the memory controller is not fast enough to keep pace and provide enough data I/O with the L2 cache running in sync with the processor clock, hence resulting in more cache wait states relative to the memory controller and in turn lower efficiency (the actual bandwidth is still higher, but efficiency which is bandwidth per memory clock is reduced).

Will the revision E0 correct/improve this?
 
Last edited:
WHOA! Im a first time overclocker trying to learn a little bit about what to do and how to prepare to overclock my shiny new A64 3500+(939) on my Epox 9NDA3+ mobo....this is all so confusing.Can someone quote me and then paste what I want to do in their response? If you could do that for me,not only would I be less confused,but very much appreciative!

Thanks!
 
Venice and San Diego (April 2005)

Desktop A64 939 (90 nm SOI DSL)
3000+: ADA3000DAA4BP 1.35/1.4V (DH E3 rev, 00020FF0h) <- Venice, 512 KB L2, 1.8 GHz, x9, 67 W
3200+: ADA3200DAA4BP 1.35/1.4V (DH E3 rev, 00020FF0h) <- Venice, 512 KB L2, 2.0 GHz, x10, 67 W
3500+: ADA3500DAA4BP 1.35/1.4V (DH E3 rev, 00020FF0h) <- Venice, 512 KB L2, 2.2 GHz, x11, 67 W
3800+: ADA3800DAA4BP 1.35/1.4V (DH E3 rev, 00020FF0h) <- Venice, 512 KB L2, 2.4 GHz, x12, 89 W

Desktop A64 939 (90 nm SOI DSL)
4000+: ADA4000DAA5BN 1.35/1.4V (SH E4 rev, 00020F71h) <- SanDiego, 1 MB L2, 2.4 GHz, x12, 89 W

First stepping code:
Venice CBBLE 0504


A64 940, 754, 939 CPU Models, OPN code, rating (post 5)

About Rev E and SSE3 instructions

Some links about latest silicon technology, Silicon on Insulator (SOI), Strained Silicon (SS), Dual Stress Liner (DSL)

Some overclocking scenarios for 939 Winchester/Venice/San Diego


http://www.xbitlabs.com/news/cpu/display/20050310212502.html
http://www.amd.com/us-en/assets/content_type/DownloadableAssets/Rev_change_1_UK.pdf
http://www.amd.com/us-en/assets/content_type/DownloadableAssets/Rev_change_2_UK.pdf
http://www.amd.com/us-en/assets/content_type/DownloadableAssets/Rev_change_3_UK.pdf
 
Last edited:
Major difference between Venice (E3)/San Diego (E4) and Winchester (D0)/NewCastle (CG)/ClawHammer (CG):

1. Addition of SSE3 instructions which accelerate a number of different types of computation, including video encoding, scientific computing, and software graphics vertex shaders.
About Rev E and SSE3 instructions

2. Faster transistors by using a new form of strain silicon (on insulator) process called Dual Stress Liner, up to 24% faster transistor speed (quote from IBM/AMD announcement Dec 13, 2004). Hence there is good potential of getting higher clock when process and yield improve. (Note that 24% transistor speed increase may not translate directly and immediately into the same speed improvement for CPU clock, but rather is an upside potential).

hitechjb1 said:
In regular silicon, atoms are spaced apart with certain distance determined by the silicon lattice.

In stained silicon, silicon is deposted onto a substrate (such as silicon germanium) whose atoms are spaced apart in the lattice with larger distance than that in regular silicon lattice. Since atoms tend to align with one another, so the top silicon atoms are stretched or strained to align with the atoms underneath in the stretched lattice.

In strained silicon (lattice), electrons flow with less resistance and up to 70% faster, which in turns can lead to 35% faster chips without scaling down the size of transistors (numbers quoted from IBM).

Strained Silicon (SS) can be built on top of Silicon on Insulator (SOI), the two are not mutually exclusive. Intel, IBM, AMD, ... are building 90 nm chips using both SS and SOI in various ways. IBM called it SSDOI (Strained Silicon Directly on Insulator).

Conventional strained silicon on insulator is referred to as "singly stressed" only. The dual stress liner refers to both "stretched" and "compressed" on NFET and PFET respectively to achieve further speed improvement.
Some links about latest silicon technology, Silicon on Insulator (SOI), Strained Silicon (SS), Dual Stress Liner (DSL)

3. Venice/San Diego has improved memory controller which supports up to 4 double-sided memory modules running DDR400 (not sure about 1T yet) compared to DDR333 of Winchester, and also has better compatibility with various memory modules.

4. More metal layers in the E3/E4 revision chips.

...


A review from xbitlabs:
AMD Athlon 64 3800+ CPU: E3 Processor Core aka Venice at the Door

The performance of Venice compared to Winchester and NewCastle can be summarized in this graph from Xbitlabs (from above article).

It is assumed that all processors are tested at 3500+ rated frequency of 2.2 GHz. So according to the xbitlabs measurement, Venice is roughly 0.5 - 2% better than Winchester (say 1% average), 1 - 7% better than NewCastle (say 2-3% average) clock for clock.

xbitlabs_venice_result.JPG




Low PR 90 nm 939 Winchester (Sept 2004)
Venice and San Diego (April 2005)
Some overclocking scenarios for 939 Winchester/Venice/San Diego
 
Last edited:
Overclocking frequecy and voltage of various A64

- It shows 90 nm CPU can be clocked higher with lower voltage.
- Venice/SanDiego can be clocked higher, about 200 MHz at least on average than Winchester, NewCastle, ClawHammer.
- It shows the 90 nm SOI processors (e.g. SanDiego/Venice/Winchester) can be clocked at higher frequencies while running at lower voltages, compared to that of the 130 nm (NewCastle/ClawHammer).
- The 90 nm SOI with DSL strained silicon (e.g. SanDiego/Venice) further improves the frequencies (by 200+ MHz on the average) compared to 90 nm SOI (non DSL) (e.g. Winchester).

Based on preliminary data collected in
A64 Overclocking Result Collection

A64_avg_freq_vs_voltage_2.JPG
 
Last edited:
Do "best" CPU voltage and frequency exist for overclocking

Like the body height and body weight of male or female population, ..., there is NO SINGLE number of voltage and frequency to specify the best achievable overclocking of a CPU core for a given type of cooling. Each particular CPU behaves differently to some extent. For a large sample, the best way to model the overclocking behaviour in terms of voltage and frequency, IMO, is to use the so called "normal distribution".

Based on some sample results from members here and on the net on the overclocking of various A64 cores in an uncontrolled environment, I calculated the mean and standard deviation for each of them, they are plotted as shown.

A64_avg_freq_vs_voltage_2.JPG


E.g for Venice, I estimated

overclock_frequency_average = 2781 MHz
overclock_frequency_standard_deviation = 141 MHz
overclock_voltage_average = 1.548 V
overclock_voltage_standard_deviation = 0.102 V

In other words,
- about 68.2 % of Venice would be within 1 standard deviation,
i.e. 1.548 V +- 0.102 V for voltage, 2781 MHz +- 141 MHz for frequency
- about 95.4 % of Venice would be within 2 standard deviation,
i.e. 1.548 V +- 0.204 V for voltage, 2781 MHz +- 282 MHz for frequency
- about 99.6% of Venice would be within 3 standard deviation,
i.e. 1.548 V +- 0.306 V for voltage, 2781 MHz +- 423 MHz for frequency

Don't try to overclock the CPU into the 3 standard deviation regime (4.2% of the population statistically) without paying attention to CPU stability, voltage may become too high in that regime. Statbility should be considered the overridding factor than voltage and frequency.


Also in the statistics for various A64 cores, it is clear that:

- The 90 nm SOI processors can be clocked at higher frequencies while running at lower voltages, compared to that of the 130 nm SOI.

- The 90 nm SOI with DSL strained silicon further improves the frequencies (by 200+ MHz on the average) compared to 90 nm SOI (non DSL).
 
Preliminary: Added OPN for dual cores and rumored 939 E6 revision.


Desktop A64 X2 939 (90 nm SOI DSL) Toledo
4400+: ADA4400DAA6CD 1.35/1.4V (JH E6 rev, 00020F32h) Toledo, 2x 1 MB L2, 2.2 GHz, x11, 110 W
4800+: ADA4800DAA6CD 1.35/1.4V (JH E6 rev, 00020F32h) Toledo, 2x 1 MB L2, 2.4 GHz, x12, 110 W

Desktop A64 X2 939 (90 nm SOI DSL) Manchester
3800+: ADA3800DAA5BV 1.35/1.4V (BH E4 rev, 00020FB1h) Manchester, 2x 512 KB L2, 2.0 GHz, x10, 89 W
4200+: ADA4200DAA5BV 1.35/1.4V (BH E4 rev, 00020FB1h) Manchester, 2x 512 KB L2, 2.2 GHz, x11, 89 W
4600+: ADA4600DAA5BV 1.35/1.4V (BH E4 rev, 00020FB1h) Manchester, 2x 512 KB L2, 2.4 GHz, x12, 89 W

Rev E6
3000+: ADA3000DAA4BW 1.35/1.4V (DH E6 rev, 00020FF2h) Venice, 512 KB L2, 1.8 GHz, x9, 67 W
3200+: ADA3200DAA4BW 1.35/1.4V (DH E6 rev, 00020FF2h) Venice, 512 KB L2, 2.0 GHz, x10, 67 W
3500+: ADA3500DAA4BW 1.35/1.4V (DH E6 rev, 00020FF2h) Venice, 512 KB L2, 2.2 GHz, x11, 67 W
3800+: ADA3800DAA4BW 1.35/1.4V (DH E6 rev, 00020FF2h) Venice, 512 KB L2, 2.4 GHz, x12, 89 W


A64 940, 754, 939 CPU Models, OPN code, rating (post 5)

Revisions and steppings (under construction) (post 6)


http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25759.pdf
 
Last edited:
Added information and updated.

Very preliminary dual core performance analysis

Let's do some estimations of what a dual core X2 can do to speed up the run time of some (benchmark) programs compared to an A64 single core.
The estimate is based on picking up some numbers from screenshots and attempting to derive some simple and quick estimate. The results are not fully confirmed from well controlled tests, but should give a quick estimate to within say +-5%.

As seen from some results, some dual cores X2 reaches 2.8+ GHz on air and 3.1+ GHz on phase. So the dual core X2 is performing equally well as, if not better than, the Venice/SanDiego for single thread applications.

Even for gaming, a X2 dual core shows much higher fps (2x+) in Doom3 + multitasking than an A64 single core (according to the article below).

For applications that are parallelizable and dual-core ready, the dual core shines. E.g. PCmark 04 or 05 type of applications benefit a lot from dual core with two caches (even w/o HT).

This article shows some PCmark 04 numbers for X2 and Pentium D:
http://www.overclockers.com.au/article.php?id=384519&P=1
It says some PCmark 04/05 tests showed the numbers dropped using dual core with HT enabled (Pentium D 840 EE, 4 virtual cores, 2 caches) compared to dual core without HT (Pentium D 840, 2 cores, 2 caches).

Pentium 4 3.73 EE 2MB L2 at 3730 MHz (w/ HT) - 6289
Pentium D 840 EE 2x1MB L2 at 3200 MHz (w/ HT) - 6186
Pentium D 840 2x1MB L2 at 3200 MHz (w/o HT) - 6702
Pentium D 840 2x1MB L2 at 3840 MHz (w/o HT) - 7887
A64 Venice 3500+ 512KB L2 at 2200 MHz - 4295
A64 X2 4400+ 2x1MB L2 at 2400 MHz - 7146

Sucka's X2 4400+ 2x1MB L2 at 2855 MHz - 8670
Sucka's X2 4400+ 2x1MB L2 at 3180 MHz - 9518
(http://www.ocforums.com/showthread.php?t=397130)


PCMark04

PCMark04 performance per clock (PPC) analysis

Pentium 4 3.73 EE 2MB L2 (w/HT), PPC = 6289 / 3730 = 1.686
Pentium D 840 EE 2x1MB L2 (w/HT), PPC = 6186 / 3200 = 1.933
Pentium D 840 2x1MB L2 (w/o HT) PPC = 6702 / 3200 = 2.094
Pentium D 840 2x1MB L2 (w/o HT), PPC = 7887 / 3840 = 2.054
Pentium D EE 2x1MB w/HT vs Pentium 4 EE 2MB w/HT: 1.933 / 1.686 = 1.147 (14.7% speedup)
Pentium D 2x1MB wo/HT vs Pentium 4 EE 2MB w/HT: 2.094 / 1.686 = 1.242 (24.2% speedup)

X2 4400+, PPC = 8670 / 2855 = 3.036
X2 4400+, PPC = 9518 / 3180 = 2.993
Venice 3500+, PPC = 4295 / 2200 = 1.952
X2 4400+ vs Venice: 3.028 / 1.952 = 1.55 (55% speedup)

Preliminary (to be further confirmed): X2 dual core seems to have (much) higher speedup than Pentium D over the corresponding single cores, 55% for X2 vs 1.47% and 24.2% for Pentium D. An possible explanation is that the single core Pentium 4 has already multithreading.

These numbers had not been adjusted for the impact of memory frequency and timing.


Sandra CPU benchmark

Sandra SR1 2005 used, will redo calculation using version SR2 2005 which includes A64 X2 and latest A64 single cores.

OS WinXP 32-bit

From Sucka's X2 numbers,
CPU X2 4400+ 2888.7 MHz (262.6 x 11)
Dhrystone 26805 MIPS => IPC_Dhrystone = 9.279
Whetstone 9153 / 11847 MFLOPS => IPC_Whetstone = 3.168 / 4.101

Compared to single core A64 (should use SanDiego core or FX-57 with 1MB L2 for better comparison)
- Winchester 3000+ 512KB L2 at 2.944 GHz
(http://www.ocforums.com/showthread.php?t=364223)
Dhrystone MIPS = 13564
Whetstone FLOPS = 4622 / 5975
IPC_Dhyrstone = 13564 / 2944 = 4.61
IPC_Whetstone = 1.57 / 2.03
- Opteron 152 1MB L2 at 2.6 GHz, 11573 Dhrystone MIPS
IPC_Dhrystone = 4.45

X2 dual core speedup:
Dhrystone_speedup = 9.28 / 4.61 = 2.01 (100% speedup)
Whetstone_speedup = 3.168 / 1.57 = 2.02 (100% speedup)
Whetstone_speedup (SSE2) = 4.101 / 2.03 = 2.02 (100% speedup)


Cinebench 2003

From runs by windwithme (on A64)
http://www.ocforums.com/showpost.php?p=3819118
"Single core 2.7G 1MB is 68.9 seconds
Dual core2.7G 1MB is 36.8 seconds"

So speedup = 68.9/36.8 = 1.78 (78% speedup)

www.cinebench.com said:
CINEBENCH is the free benchmarking tool for Windows and Mac OS based on the powerful 3D software CINEMA 4D. The tool is set to deliver accurate benchmarks by testing not only a computer's raw processing speed but also all other areas that affect system performance such as OpenGL, multithreading, multiprocessors and Intel's new HT Technology.


Dual core (post 48)

How to compare processors of different architectures and frequencies
 
Last edited:
From overclocking results collected from Venice and SanDiego, the average frequencies for them are about

Venice: average 2750 MHz, standard deviation 140 MHz
SanDiego: average 2850 MHz, standard deviation 160 MHz

A64 Overclocking Result Collection

Since dual core X2 Toledo is derived from SanDiego, and X2 Manchester from Venice, for a given dual core X2, the overclocking frequency of a dual core would be the smaller of the overclocking frequency of its two cores, i.e.

fmax_dual_core = min(fmax_core1, fmax_core2)

where fmax_dual_core is the max overclocking frequency of the dual core,
fmax_core1 is the max overclocking frequency of the first core, and
fmax_core2 is that of the second core.

E.g. fmax_core1 = 2770 MHz, fmax_core2 = 2660 MHz, so
fmax_dual_core = min(2770, 2660) = 2660 MHz

So the max overclocking expectation of a dual core tends to be lower, and should be lower.

Question is how much lower do we expect a dual core to be lower in frequency (MHz), given the known statistics for the single cores from which the dual core is derived.

This is indeed an interesting question, both academically and practically. It can also be applied to quad-core, octo-core, ....


For answer see next post.
 
Dual core overclocking estimation from single core statistic

From overclocking results collected from Venice and SanDiego, the overclocking frequency and voltage distribution can be approximated by normal distribution with average and standard deviation frequencies about:

Venice: average 2750 MHz, standard deviation 140 MHz
SanDiego: average 2850 MHz, standard deviation 160 MHz

E.g.
- about 68.2 % of Venice would be within 1 standard deviation,
i.e. 2750 MHz +- 140 MHz for frequency
- about 95.4 % of Venice would be within 2 standard deviation,
i.e. 2750 MHz +- 280 MHz for frequency
- about 99.6% of Venice would be within 3 standard deviation,
i.e. 2750 MHz +- 420 MHz for frequency
Ref: A64 Overclocking Result Collection

Since dual core X2 Toledo is derived from SanDiego, and X2 Manchester from Venice, for a given dual core X2, the overclocking frequency of a dual core would be the smaller of the overclocking frequency of its two cores, i.e.

fmax_dual_core = min(fmax_core1, fmax_core2)

where fmax_dual_core is the max overclocking frequency of the dual core,
fmax_core1 is the max overclocking frequency of the first core, and
fmax_core2 is that of the second core.

Assuming the overclocking frequency of a single core (a random variable) follows a normal distribution with certain mean and standard deviation, since a dual core is made up of two single cores, taking the min of two random variables, the composite overclocking frequency distribution can be derived and it has a new value of mean and standard deviation.

It can be shown through statistic and mathematic that the mean of the dual core is intrinsically lowered than that of the single core, and quad core is also intrinsically lowered than that of dual core, .... The standard deviation is also getting smaller as the number of cores increase in a chip.

The estimate for Manchester, Toledo and a potential quad core (based on SanDiego) are shown in the following distributions. In the estimate, the two single cores are assumed to be statistically independently. If needed and with some data on statistical correlation among the cores, such can be added in the derivation.

In the estimation, it is assumed that the multi-cores are derived from single cores with similar manufacturing process statistics. And it is also assumed that the setup such as cooling, integrated heat spreader (IHS), memory and controller, motherboard, PSU to run the multi-cores and single cores are the same and without limiting the CPU overclocking.


Venice and Manchester

mean_Venice = 2750 MHz
stddev_Venice = 140 MHz
mean_Manchester = 2671 MHz
stddev_Manchester = 116 MHz

So from the estimate, there is an intrinsic frequency reduction (in the mean) of 79 MHz (2.9%) going from Venice to Manchester.

dualcore_manchester_oc.JPG



SanDiego and Toledo and quad core (potential)

mean_SanDiego = 2850 MHz
stddev_SanDiego = 160 MHz
mean_Toledo = 2759 MHz
stddev_Toledo = 133 MHz
mean_quadcore = 2686 MHz
stddev_quadcore = 112 MHz

So from the estimate, there is an intrinsic frequency reduction (in the mean) of 91 MHz (3.2%) going from SanDiego to Toledo, and an intrinsic frequency reduction (in the mean) of 73 MHz (2.6%) from Toledo to quadcore (made from SanDiego assumed).

dualcore_toledo_quadcore_oc.JPG
 
Last edited:
Update:
Added Opteron 1xx socket 939 CPU. They are SanDiego and Toledo based cores.

Opteron A64 939 (90 nm SOI DSL)

- The 144, 146, 148, 150, 152, 165. 170, 175 socket 939 Opterons are for non-SMP.
- The 939 Opterons should be able to work with non-ECC unbuffered memory modules, ECC and buffered memory modules.
- The corresponding 244, 246, 248, 250, 252, 265, 270, 275 and 844, 846, 848, 850, 852, 865, 870, 875 socket 940 Opterons are the respectively 2-way and 4 to 8-way SMP versions.
- With coherent HT links in 2xx and 8xx, the 2xx's are validated for 2-way SMP, and the 8xx's are validated for 4-8 way SMP.

144: OSA144DAA5BN 1.35/1.4V (SH E4 rev, 00020F71h) SanDiego, 1 MB L2, 1.8 GHz, x9, 67 W
146: OSA146DAA5BN 1.35/1.4V (SH E4 rev, 00020F71h) SanDiego, 1 MB L2, 2.0 GHz, x10, 67 W
148: OSA148DAA5BN 1.35/1.4V (SH E4 rev, 00020F71h) SanDiego, 1 MB L2, 2.2 GHz, x11, 85(?) W
150: OSA150DAA5BN 1.35/1.4V (SH E4 rev, 00020F71h) SanDiego, 1 MB L2, 2.4 GHz, x12, 85(?) W
152: OSA152DAA5BN 1.35/1.4V (SH E4 rev, 00020F71h) SanDiego, 1 MB L2, 2.6 GHz, x13, 104(?) W

165: OSA165DAA6CD 1.35/1.4V (JH E6 rev, 00020F32h) Toledo (Denmark), 2x1 MB L2, 1.8 GHz, x9, 110 W
170: OSA170DAA6CD 1.35/1.4V (JH E6 rev, 00020F32h) Toledo (Denmark), 2x1 MB L2, 2.0 GHz, x10, 110 W
175: OSA175DAA6CD 1.35/1.4V (JH E6 rev, 00020F32h) Toledo (Denmark), 2x1 MB L2, 2.2 GHz, x11, 110 W


A64 940, 754, 939 CPU Models, OPN code, rating (post 5)

Understanding AMD Opteron™ Processor Model Numbers (from amd)

The Best-Value Dual-Core AMD Processor: Opteron 165 CPU Review (from xbitlabs)


Next is an tnteresting thread, potential some cost effective CPU for single core and dual core systems.
Low End A64 bargains to be had for the adventurous!

hitechjb1 said:
OC Detective said:
...

Edit confirmation of the good news.
here is the AMD link referring to this new cpu on socket 939 and specifically how well they run on a Sun Ultra 20
http://www.amd.com/us-en/Corporate/VirtualPressRoom/0,,51_104_543~100324,00.html
(Beer Hunter note there is NO doubling of price!)
and here is the Sun spec which confirms you can use non ECC unbuffered
http://www.sun.com/desktop/workstation/ultra20/specifications.jsp#Processor

Perhaps this is merely a function of a bios update on socket 939s to allow both ECC AND non ECC unbuffered.

...

With your above two links from AMD and SUN, and this link
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25759.pdf

I would tend to agree that it is correct to expect that the Opteron 100 can work with non-ECC unbuffered memory and 939 board (non SMP) and the 100 is equivalent to an SH-E6 (00020F71h) which is the 1MB L2 A64 SanDiego and SanDiego based FX.


Further, corollary,

AMD (Aug 02 said:
The new AMD Opteron 100 Series processors with ECC unbuffered memory support offer a compelling price/performance ratio, beginning with Model 144, priced at $125 in 1,000-unit quantities, and scaling to Model 152, priced at $799 in 1,000-unit quantities. Dual-Core AMD Opteron 100 Series processors with ECC unbuffered memory support, ranging from Model 165 at $417 to Model 175 at $530 for 1,000-unit quantities, are expected to be available within 30 days.
Ref: http://www.amd.com/us-en/Corporate/VirtualPressRoom/0,,51_104_543~100324,00.html

I would say the dual core Opterons 165 (rated 1.8 GHz), 170 (rated 2.0 GHz) and 175 (rated 2.2 GHz) would be equivalent to the JH-E6 (00020F32h) which is basically the 2 x 1MB L2 A64 X2 Toledo's.

The lowest rating Opteron 165 (1.8 GHz), priced at $417 per 1000-unit, ... having 2 x 1MB L2, would work with non-ECC unbuffered memory with a 939 motherboard (non SMP). It may be what many people (price/performance) are waiting for.

Don't know how much would be the unit price, but that is close enough to the current 2.0 GHz 3800+ Manchester X2 with 2 x 512KB L2 priced at around $350.
 
Last edited:
Back