Re: Adjusting PC Hyperthreading for Spice Simulation



On Tue, 27 Jan 2009 18:59:11 -0800, JosephKK <quiettechblue@xxxxxxxxx>
wrote:

On Tue, 27 Jan 2009 07:20:58 -0600, krw <krw@xxxxxxxxxxxxx> wrote:

In article <852tn4h3nak1itkdiodosnn6gd24hb36vs@xxxxxxx>,
quiettechblue@xxxxxxxxx says...>
On Mon, 26 Jan 2009 08:10:53 -0600, krw <krw@xxxxxxxxxxxxx> wrote:

In article <o7bqn49n08t3cor6nj2gtc29g311e4j0qe@xxxxxxx>,
quiettechblue@xxxxxxxxx says...>
On Sun, 25 Jan 2009 13:29:39 -0600, krw <krw@xxxxxxxxxxxxxxxxx> wrote:

On Sun, 25 Jan 2009 10:59:31 -0800, JosephKK <quiettechblue@xxxxxxxxx>
wrote:

On Sun, 25 Jan 2009 00:06:57 +0000, Nobody <nobody@xxxxxxxxxxx> wrote:

On Sat, 24 Jan 2009 19:39:32 +0000, Nobody wrote:

In other words... you get 1 billion operations per second (or whatever).
Hyperthreaded CPUs just give the appearance of two CPUs so that if a
particular thread is waiting on, e.g., a memory read from DRAM (this
can take hundreds of cycles)

Memory access taking hundreds of cycles? Hell not even a dozen.

It depends how fast your RAM is. At one point (I guess around 5 years
ago), 350 CPU cycles for a code cache miss was not atypical, but RAM
speed has been consistently increasing faster than CPU speed for the
last few years.

To remove a possible source of confusion: cycle "costs" take into account
the fact that each core can execute multiple instructions concurrently
(superscalar architecture). So a cost of e.g. "100 cycles" refers to a
delay in which a sequence of instructions totalling 100 cycles could be
executed, not 100 times the CPU clock period.

So you have heard of pipeline bubbling. The pipelines are not that
deep, about 7 stages max due to complexity increases.

Depends on the processor. The G5 and P4 were significantly deeper
than that (more like 20 stages). The entire pipe is flushed on a
mispredicted branch or context switch. If the target isn't in the
cache it has to be reloaded from main memory.

Not so on mispredicted branches. Moreover speculative execution of
both sides almost eliminates the issue. Also that may have been that
much total depth but less than 3% of instructions (and much less than
1 % of execution) need all of them, mostly things like pusha and popa
which move multiple registers onto and off of the stack.

If the branch target misses the cache and a new DRAM page has to be
opened, yes it does. Branches don't do PUSHA/POPA. Memory access
is still 100x CPU clock.

I am amazed at how badly you misread this.

Perhaps. I'll look at this again later.

<snip>

The only place you get killed is on cache writeback block outs, that
does have 100 ns plus lags before reading the new data (but that does
not apply to instruction caches).

Huh? I'm missing your point here. Cache castouts aren't be in the
performance path.


The issue is dirty cache page write back (data segments) in order to
load a new page.

Page? The LINE isn't cast out until the read of the new line is
complete (and the memory bus idle). The castout isn't in the
critical performance path. The read is.

Improbable. If the cache line/page selected to be replaced is "dirty"
it must be written back before reading the new data or the changes
will be lost.

Nope. It's written to the store queue where it can be written back to
memory at the processor's leisure.

Thus the dirty line/page overhead is present in some
fraction (usually small) of cache read attempts with misses. Balancing
allocated write bandwidth with read performance and cache miss rate is
the design issue.

Write bandwidth has nothing to do with it, except in the pathological
case where you miss on every fetch.
.


Quantcast