Re: Sydney-X1 FPGA Computer Challenges Commodore, Amiga and Apple

From: John S. Dyson (toor_at_iquest.net)
Date: 07/21/04


Date: Wed, 21 Jul 2004 23:43:48 +0000 (UTC)

In article <cdmop4$set$1@newsg4.svr.pol.co.uk>,
        "John Jardine" <john@jjdesigns.fsnet.co.uk> writes:
>
> John S. Dyson <toor@iquest.net> wrote in message
> news:cdhcd1$29kr$1@news.iquest.net...
>> Giving you a coherent idea of the speed of the 3328MHz Pentium4 (other
>> processors of that ilk, like the Athlon are similar), here are some
>> speed measures (there is false precision in the numbers):
>>
>> A simple (empty) loop of 20 iterations, executed 500million times
>> takes about 7.5seconds. (10 billion iterations.)
>>
>> A simple loop that adds 20 integers, executed 500million times
>> takes about 10.6seconds. (10 billion iterations)
>>
>> A simple loop that adds two groups of 20 integers, executed 500million
>> times takes about 13.4seconds. (20 billion additions)
>>
>> A partially unrolled loop (two groups of 5 integers long), adding
>> the same numbers as the two groups of 20 integers, executed 500million
>> times takes about 7seconds.
>> (About 2.8 billion int additions per second.)
>>
>> So, the simple empty loop takes about .75 nanoseconds per iteration.
>> Each iteration of the loop with 1 addition takes about 1.06 nanosecond
>> per iteration.
>> (About 1 billion int additions per second.)
>>
>> Each iteration of the loop with 2 additions takes about 1.34 nanoseconds.
>> (about 1.4 billion int additions per second.)
>>
>> The additional overhead to add the addition of an array element (the
>> time difference between the loop with one addition instead of two
>> additions) is .34nanoseconds. The additional addition consists
>> of a register indirect memory reference and an update (increment)
>> of the register containing the address.
>>
>> So, the addition rate from a cache resident array of integers is
>> about 3 billion integer additions per second. This INCLUDES the update of
>> the memory address (the leal 4(%esi),%esi instruction.) This addition
>> rate is maximum, but realistically it appears that 1.4 billion additions
>> per second in non-unrolled loop (including loop overhead.)
>>
>> Unrolling can make a very significant improvement in performance.
>>
>> For floating point, I seem to get 800 million/sec 32 bit SSE floating
>> point adds for the non-unrolled loop, and 833 million/sec for unrolled
>> SSE adds per second. (I didn't bother carefully optimizing these.)
>>
>> In general, the new P4 type processors can dispatch between 4 instructions
>> per clock cycle down to one instruction every two clock cycles (for
>> normal, non multiply/divide instructions.) You'll normally see that
>> for NORMAL instruction streams, where the data/program is cache resident,
>> that 1 instruction per clock cycle is plausible. For lots of floating
>> point, lots of arrays that aren't resident in cache, or lots of mulitply
>> and divids, then the instruction rate is much slower.
>>
>> Amazingly, the P4 can peak at 4 add/sub instructions every clock cycle (at
>> 3200MHz), but cannot sustain that for long. The instruction stream
>> is highly piplined, so that an instruction that is executed at a rate
>> of 1 per clock cycle might not make the result available for 5-10
>> clock cycles. ANY upset in the branch prediction (or interrupt)
>> will tend to greatly damage the performance.)
>>
>> The P4 makes a fairly fast DSP machine!!! I have an application
>> that does 448168 512 point FFTs in 10 seconds... It does some other
>> things than just the FFTs, but 22usec for 512 point floating point
>> FFT seems pretty reasonable (given the fact that the program does
>> lots of other things also.)
>>
>> John
>
> That's bloody staggering!.
>
> If only they'd stretch (just a teensy weensy couple of inches) a couple of
> those address and data tracks out to a connector on the PC case, I'd be as
> happy as a pig in s**t.
> regards
>
I'd like for there to be some good A/D interfaces that don't cost
an arm and a leg. For audio, I have a USB I/O device, and that
works okay for that purpose. However, it would be neat to have a
LOW COST 4msps or better at 12 bits or better (maybe 14 bits) to
do receiver prototypes. (Of course, faster would be better, like
20msps, but it seems like 4msps would be a good cost tradeoff, and
normal audio-type receivers would be very practical with plenty of
room.) To avoid high performance devices, I would be happy with
a 455kHz (or thereabouts) IF. For some applications, that would
be overkill, and perhaps a waste of CPU, so even a 50kHz IF could
be useful. For self-directed learning, I'd like to be able to
develop prototypes on the convienient PC platform (using a free
Unix), and avoid the use of a (usually slower) DSP.

It is still amazing that the current top of the line PCs can
have the CPU speed that is 1000X or even more than of a VAX11/780.
Running a CPU simulator, a simulated DEC10 is much faster than
the original, highest performance version.

John



Relevant Pages

  • Re: [PATCH] x86 - Enhance DEBUG_RODATA support - alternatives
    ... has been pulled out of the x86 tree. ... text_poke required to support this. ... correctly and so the CPU HOTPLUG special case can be removed. ... When you use this code to patch more than one byte of an instruction ...
    (Linux-Kernel)
  • Re: Simple function arguments
    ... are 2 names refering to the same memory location and use that. ... In the internals of a CPU there are various registers. ... address is stored from where the next instruction from memory is read and executed. ... what is generally referred to as 'The stack'. ...
    (comp.lang.cpp)
  • Re: wikipedia article
    ... parallel but skewed by one instruction. ... If the first CPU instruction execution causes a miss, ... memory access. ... distinguish between instruction and data references, ...
    (freebsd-questions)
  • Re: [PATCH] x86 - Enhance DEBUG_RODATA support - alternatives
    ... has been pulled out of the x86 tree. ... text_poke required to support this. ... correctly and so the CPU HOTPLUG special case can be removed. ... When you use this code to patch more than one byte of an instruction ...
    (Linux-Kernel)
  • Re: How does this make you feel?
    ... >>>primitives to implement, say, a memcpy just as efficiently as microcode ... > The work is offloaded from the programmer in any case - this type of code ... library macros need updating for new CPU products, ... And designing such instruction such that they don't ...
    (comp.arch)