Re: Adjusting PC Hyperthreading for Spice Simulation
- From: JosephKK <quiettechblue@xxxxxxxxx>
- Date: Mon, 26 Jan 2009 20:53:07 -0800
On Mon, 26 Jan 2009 21:52:43 +0000, Nobody <nobody@xxxxxxxxxxx> wrote:
On Sun, 25 Jan 2009 22:18:19 -0800, JosephKK wrote:
It depends how fast your RAM is. At one point (I guess around 5 years
ago), 350 CPU cycles for a code cache miss was not atypical, but RAM
speed has been consistently increasing faster than CPU speed for the
last few years.
To remove a possible source of confusion: cycle "costs" take into account
the fact that each core can execute multiple instructions concurrently
(superscalar architecture). So a cost of e.g. "100 cycles" refers to a
delay in which a sequence of instructions totalling 100 cycles could be
executed, not 100 times the CPU clock period.
So you have heard of pipeline bubbling. The pipelines are not that
deep, about 7 stages max due to complexity increases.
Current and recent processors (about 5 years for x86, more for SPARC
and others) support speculative execution and out of order execution
to reduce this problem.
Indeed. But while that mitigates data cache misses, it doesn't do anything
for a code cache miss.
That does not seem to follow.
Especially on x86, where instructions are variable length and so can't
even be decoded if prior instructions are missing, let alone executed.
Kindly explain how you get past the previous instruction to begin
decoding the current instruction without decoding the previous
instruction.
Well, that's my point.
If the CPU tries to execute the next instruction, but the data isn't
immediately available, it can just move on to the following instructions.
For modern devices and even several older ones (pentium, SPARC,
PA-RISC, i860, i960, S/360 and S/370, DEC VAX, DEC Alpha, and more)
you do not have a point. Out of order execution requires decoding of
each instruction to determine data dependencies. If the next
instruction(s or more) does not have data dependencies, continuing
with execution does no harm. It will not work for a 6502 or a 8080.
OTOH, if it tries to execute the next instruction, but the *instruction*
isn't available, it's stuck. It can't do anything until the instruction is
available.
If the CPU can normally execute several instructions per clock (regardless
of whether their operands are immediately available), a code cache miss
means that it has to wait for the existing transfer to complete, then
start a new transfer and wait for that to start producing instructions,
meaning that it's going to end up many instructions behind where it would
have been compared to a cache hit.
Do you think for some reason that there is only block of instructions
mapped in the cache? After the first time through a loop both
sequences get mapped if they are not already.
Even on a RISC architecture, it makes such techniques much less efficient.
If the missing instructions include a branch which will usually be taken,
speculatively executing subsequent instructions is a waste of cycles.
What? If you speculative execute both results of a conditional branch
the pipeline stays full and you just drop the unused results in the
bit bucket. The result: conditional branches without significant
pipeline bubbles.
Taking both branches is inefficient if one is far more likely than the
other; that's the purpose of branch prediction. E.g. with a loop, the
branch which executes the next iteration is usually far more common than
the branch which exits the loop, so you're better off just speculatively
executing the next iteration than hedging your bets.
Only partially true. More aggressive cache preloading will have both
sequences available for when the loop gets near termination.
Similarly for an instruction which uses the contents of a register which
is modified by a missing prior instruction.
Ah, you misunderstand the methods and practice of out of order
execution. See the Tomasulo algorithm, and register scoreboarding
(which requires "phantom registers").
Only works if you know what the previous instruction is.
If you have:
mov r3,[something]
add r0,r1,r2
but [something] isn't available, you can still commence the add
immediately.
OTOH, if you have:
mov r2,[something]
add r0,r1,r2
the addition can be commenced but can't proceed until r2 holds a concrete
value.
Correct as far as it goes. But what if the next instruction is div
r4,r5,r6; it can be executed because it has no unsatisfied data
dependencies. See how it works?
And if you have:
<missing instruction>
add r0,r1,r2
you can't even commence the add with an abstract value for r2, because you
have no idea where the value will eventually come from. r2 may even hold a
concrete value, but you don't know that until you've seen the missing
instruction.
This case cannot occur except at boot time when there are exactly 0
previously executed instructions.
Out-of-order and speculative execution avoid CPU stalls due to data
cache misses, but they either don't work or are significantly less
efficient for a code cache miss.
Oh contraire. Because there are still instruction(s) in execution it
reduces the size of the pipeline bubble.
If anything, it is the reliance upon caching and instruction re-ordering
which makes code cache misses such a performance killer, as they mitigated
the problems with slow RAM to such an extent that there was little
incentive to increase RAM speed.
That is non-factual and barely coherent.
Which part of it is unclear?
Code cache misses are note the killer you pretend because of the
pipeline and TLB start the memory access well in advance of the actual
need for the code to be present. Moreover it causes a burst read
which has a better transfer rate and hugely reduces latency for the
next few instructions.
Although such techniques worked well for "classical" procedural code, they
often worked rather less well for e.g. object-oriented code making heavy
use of virtual functions, or interpreted languages where a substantial
portion of the interpreter can be required for even the simplest functions.
In terms of good instruction cache locality, interpreted languages win
hugely.
That depends heavily upon the language complexity. For a simple
language like BASIC, with few types and few primitive operations, you'll
probably get good locality.
For a complex language with many variants on the basic types, something as
simple as adding a list of numbers can end up calling a dozen different
addition functions (int+float, int+double, arbitrary-precision-int+double,
...), and it may have to go through several steps just to determine the
correct function for each value (e.g. Python will first check whether the
LHS has an __add__ method, if not then whether the RHS has an __radd__
(reverse add) method, then if either side has type-cast methods, ...).
Now you are grasping at hypothetical straws.
They also tend to win on data locality as well.
Interpreted languages are more likely to have values dynamically allocated
and referenced through pointers. E.g. if the value is a 3-tuple, you get a
pointer to the tuple which holds 3 pointers to the individual values,
which themselves may contain additional levels of indirection, and the
various pointers are pointing all over the heap.
Compared to C, where most values are either stored at a small offset from
the frame pointer, or are one level of indirection away (i.e. a struct
pointer stored at a small offset from the frame pointer).
Think about
how they really work. I haven't seen many interpreted OO languages
for some reason, maybe there the virtual functions do cause problems.
JavaScript, Python and C# are all interpreted languages with a strong OO
bias.
I can't speak to implementation issues for JavaScript or C#, but Python
is extremely dynamic.
Everything is an object. Retrieving a member value from an object involves
first checking whether the object has __getattr__ or __getattribute__
methods; if it does, the method is called with the name of the field to
retrieve the value. This is also done for methods, which are just members
which happen to be functions.
None of this can be done a priori due to dynamic typing. Functions don't
require their argument values to belong to a specific class, just that
they contain the members which the function uses. E.g. a function which
expects a file argument might only care that the object has a read()
method (which has to be retrieved by name each time).
From what I know of JavaScript, it isn't much different. Its primitiveoperations are more primitive, but being template-based rather than
class-based means that you still can't optimise based upon the expected
type of an object, as the code has to work with any object providing the
correct interface, with no knowledge of its underlying implementation.
You are not making your case here. Go more for deep down details to
support your case. Without the deeper facts you are not all that
credible. Try to emulate Larry Wahl.
.
- Follow-Ups:
- Re: Adjusting PC Hyperthreading for Spice Simulation
- From: Nobody
- Re: Adjusting PC Hyperthreading for Spice Simulation
- References:
- Re: Adjusting PC Hyperthreading for Spice Simulation
- From: D from BC
- Re: Adjusting PC Hyperthreading for Spice Simulation
- From: D from BC
- Re: Adjusting PC Hyperthreading for Spice Simulation
- From: Joel Koltner
- Re: Adjusting PC Hyperthreading for Spice Simulation
- From: JosephKK
- Re: Adjusting PC Hyperthreading for Spice Simulation
- From: Nobody
- Re: Adjusting PC Hyperthreading for Spice Simulation
- From: Nobody
- Re: Adjusting PC Hyperthreading for Spice Simulation
- From: JosephKK
- Re: Adjusting PC Hyperthreading for Spice Simulation
- From: Nobody
- Re: Adjusting PC Hyperthreading for Spice Simulation
- From: JosephKK
- Re: Adjusting PC Hyperthreading for Spice Simulation
- From: Nobody
- Re: Adjusting PC Hyperthreading for Spice Simulation
- Prev by Date: Re: Freedom, Leftist Weenie Style
- Next by Date: Re: w7
- Previous by thread: Re: Adjusting PC Hyperthreading for Spice Simulation
- Next by thread: Re: Adjusting PC Hyperthreading for Spice Simulation
- Index(es):