Re: Larkin, Power BASIC cannot be THAT good:



On Thu, 28 May 2009 15:46:52 +0100, Martin Brown
<|||newspam|||@nezumi.demon.co.uk> wrote:

John Larkin wrote:
On Fri, 22 May 2009 11:52:40 +0100, Martin Brown
<|||newspam|||@nezumi.demon.co.uk> wrote:

John Larkin wrote:
On Thu, 21 May 2009 16:15:19 +0100, Martin Brown
<|||newspam|||@nezumi.demon.co.uk> wrote:

John Larkin wrote:
On Thu, 21 May 2009 09:03:56 +0100, Martin Brown
<|||newspam|||@nezumi.demon.co.uk> wrote:

Although it might waste cache space having matched size operands in
separate arrays might still be faster - SIMD vector instructions are
better for that case. SSE 128bit registers can do 4 parallel 32 bit adds
moving 16 bytes at a time. You would need to test it.
It's interesting that people seem to be converging to around 0.22
seconds on various machines. I guess we all buy the same DRAM chips.
It is dominated by memory speed under sustained sequential access.
Has anyone tried it on a box with DDR3 ram?

My down-count loop, in Basic, is hitting about 0.207 or so.
It would be interesting to see the code it has generated. I see only a
tiny difference at all between counting up and counting down. There is a
slight gain in doing the opposite of your initialisation code since then
the cache will be preloaded with useful data first time around.

The difference is that a branch-not-zero is faster than an immediate
compare.

A typical spurious Larkin just so story. Arguing from ignorance.

Branch prediction and speculative execution on these CPUs is so good
that they only mispredict on the final termination before loop exit.

In the stone age that was true. But on the latest CPUs with
sophisticated branch prediction, out of order and speculative execution
the time to compute the immediate compare is invisible. This is
particularly true in this case where the loop doesn't contain enough
other computational work to absorb the memory access time.

Nice in theory, wrong in practise. Try it.

I did. Theory and practice agree just fine here. The differences between
CPUs though even in the Core 2 family is startling even on the
relatively small sample I have immediately to hand.

There is absolutely no measurable difference between a count up or count
down assembler loop on a Core2 Quad, Core 2 Duo or P4. There is enough
slack with the faster CPUs to hide 3 immediate operand instructions
without altering the loop timing at all (just detectable on a P4 3GHz).

Fastest on the Core 2 Q6600 was the modern array style indexing
mov eax, [edi+4*ecx] etc at 2.159 +/- 0.008
This is typical of modern compiler output for x86.

Runner up was the cache aware algorithm at 2.190 +/- 0.007
Obvious simple pointer based loop gave 2.233 +/- 0.007
Cutting out word aligned 16 bit reads gave 2.215 +/- 0.008
Loop unrolling slowed it down to 2.250
SIMD was slowest of all at 2.288 (very disappointing)

The big surprise was the portable Mobile Core 2 Duo T5200 @ 1.6GHz
SIMD was fastest at 2.356 +/- 0.020
Array style indexing next at 2.420 +/- 0.010
Pointer loop using addition 2.450 +/- 0.012
Pointer loop using LOOP 3.225 +/- 0.026 !!!!

*Big* surprise using LOOP instruction on this older Core2 CPU slowed
things down by a massive 0.8s. I repeated it several times as I didn't
believe it at first. But it is a solid result.

On older and weaker CPUs the SIMD and cache aware code did better.

The only way to speed it up is to either vectorise it with SIMD SSE
instructions or add judicious prefetch cache hints to the loop code.
Branch target alignment to 32 bytes might also help.

Most of these tricks have limited utility on the latest quad cores, but
they do help on earlier P4 models.
Any time you programmers need help understanding how computers work,
feel free to ask an engineer.
You have shot yourself in the foot with this cheap jibe. It is now very
clear for all to see that you do not have the first clue about how
things execute on modern Intel CPUs. BASIC hacker mentality.

Try it. I went from 0.225 seconds to 0.207 by counting down.

You have no idea what you are talking about. It is some coincidental
change in the generated code (possibly use of LOOP instruction to count
down).

If you knew how to examine the generated code you would have done so by
now. But instead you keep harping on about the magical properties of
Power Basic.

I grant you its code generator must be fairly good - but the timing
pattern on the latest CPUs is extremely flat. In other words any half
way reasonable loop construct executes much faster than the memory
subsystem can supply the data to work on. Evidence is that it is the
write back to main memory that is causing all the problems.

The code generated is all that matters going up or down memory makes no
difference at all. There is more than enough slack in the loop timing.
It is utterly dominated by memory access delays.

Regards,
Martin Brown

Thank you for taking the time to do the tests that prove what has
already been discussed.
.



Relevant Pages

  • Re: Larkin, Power BASIC cannot be THAT good:
    ... separate arrays might still be faster - SIMD vector instructions are better for that case. ... Branch prediction and speculative execution on these CPUs is so good that they only mispredict on the final termination before loop exit. ... The differences between CPUs though even in the Core 2 family is startling even on the relatively small sample I have immediately to hand. ... There is enough slack with the faster CPUs to hide 3 immediate operand instructions without altering the loop timing at all. ...
    (sci.electronics.design)
  • Re: Larkin, Power BASIC cannot be THAT good:
    ... separate arrays might still be faster - SIMD vector instructions are ... It is dominated by memory speed under sustained sequential access. ... Branch prediction and speculative execution on these CPUs is so good ... that they only mispredict on the final termination before loop exit. ...
    (sci.electronics.design)
  • Re: Implement memcmp function with help from gccs inline asm
    ... >> number of iterations by duplicating the loop body. ... > due to how the C compiler interprets unsigned vs signed integers. ... test ecx, ecx ... My version of this code has 2 fewer instructions in the loop body. ...
    (comp.lang.asm.x86)
  • Re: LOOP - Why so slow?
    ... of complicated ones with multiple side effects, so "loop" wasn't ... loop instructions that branch forward). ... This is no big deal for a compiler. ... BS that Intel made up. ...
    (comp.lang.asm.x86)
  • Re: regression introduced by - timers: fix itimer/many thread hang
    ... Patches exist that implement a dynamically growable percpu pool ... doing a for_each_possible_cpuloop on every tick on all cpus isn't ... While I agree that the linear loop is sub-optimal, ... sane number of threads, but on a very large machine (1024 CPUs ...
    (Linux-Kernel)