Re: Adjusting PC Hyperthreading for Spice Simulation



On Sun, 01 Feb 2009 22:20:08 -0800, JosephKK wrote:

Lets see, even about 10 to 1 clock speed difference cannot translate
into over 100 to 1 time cost.

Clock speed alone tells you nothing. How many clocks is the worst-case
latency, assuming an existing burst is in progress on a different row?


If you want to include clocks to complete things the typically higher
CPU clocks per instruction (about 7 to 12 for 90% of instruction stream
on X86, ignoring pipelining) compared to clocks per memory access
(typically 3 to 5 without burst, 5 to 11 with burst in progress) still
comes out against you.

If so, worst case would be 11 memory clocks, with 10 CPU clocks per
memory clock, 3 instructions (or 3 cycles' worth of instructions) per CPU
cycle = 330 cycles.

Where in Finnegans fictional fantasies did you get this weird arithmetic?
Where did the 3 instructions come from?

It's called "superscalar"; I thought that you understood this concept.
PentiumPro and later can execute multiple instructions concurrently,
commencing and completing up to 3 instructions per clock cycle.

Where do you get 10 CPU clocks per memory clock?

1100MHz CPU with 100MHz FSB, 1400MHz CPU with 133MHz FSB. Odd that you
didn't take issue with it a few messages back.

Neither one is supported by the facts. Look again at
the references:

This will set up some time line referents to work with:
http://www.dewassoc.com/performance/memory/how_to_id_pc133.htm

Taking 1999 as a useful base year lets look at processors:
http://www.pdfdownload.org/pdf2html/pdf2html.php?url=
http%3A%2F%2Fwww.connellybarnes.com%2Fdocuments%2Fcpu_speed.pdf&images=yes

The clock ratios you claim just are not there.

Nor is anything else there, AFAICT. The first one doesn't appear to
mention CPU speeds. The second one just has a banner and "no file", but
using the original (non-pdfdownload.org) URL gives a PDF which charts CPU
speed against year, with no mention of FSB speeds.

Here are some actual references:

http://processorfinder.intel.com/details.aspx?sSpec=SL5XL

CPU Speed: 1.40 GHz
Bus Speed: 133 MHz
Bus/Core Ratio: 10.5

http://processorfinder.intel.com/details.aspx?sSpec=SL4BR

CPU Speed: 1 GHz
Bus Speed: 100 MHz
Bus/Core Ratio: 10

Or is there some reason why that cannot happen? Remember, we're talking
worst case, not average case (average case is a cache hit). And
worst-case isn't always some obscure theoretical concept. It's not hard
to write code which is memory-bound (so there will usually be a burst in
progress) and has poor cache coherence (so cache misses are common), and
an instruction fetch will typically be for a different row than a data
fetch.

While it is possible to write pathological code in assembler, higher level
languages will generally prevent it. It may be possible to brute force
"C" in this way, but it will readily recognizable as pathological.

That's not even remotely true. Any code which performs simple calculations
on large amounts of data is inherently memory bound (i.e. there is
always an outstanding transfer).

The most obvious case of code with poor cache coherence is OO code where
an abstract base class has many subclasses.

For a concrete example, a 3D game engine will typically have abstract
"brush" and "actor" classes, the first representing immutable
terrain (walls, floors), the second representing dynamic entities
(enemies, weapons, ordnance, other mutable objects, ...).

Updating the game state involves iterating over a set of actors, but the
code executed for each one depends upon the final class (updating a zombie
is quite different from updating a bullet). You can realistically end up
calling over 100 distinct update methods for a single frame.

Rendering is similar, although there a fewer distinct methods (but the
number is continually increasing with the use of specialised shaders,
procedural textures etc) but more data (you have to render both terrain
and actors, but terrain doesn't need updating).

.


Quantcast