Re: Sydney-X1 FPGA Computer Challenges Commodore, Amiga and Apple
From: John S. Dyson (toor_at_iquest.net)
Date: 07/19/04
- Next message: Vince Bafetti: "Re: Carbon microphone adapted to condenser mic input?"
- Previous message: John Larkin: "Re: Photodiode wich is fast enough to detect +50Mhz analog (sinus) signal??"
- In reply to: John Larkin: "Re: Sydney-X1 FPGA Computer Challenges Commodore, Amiga and Apple"
- Next in thread: John Jardine: "Re: Sydney-X1 FPGA Computer Challenges Commodore, Amiga and Apple"
- Reply: John Jardine: "Re: Sydney-X1 FPGA Computer Challenges Commodore, Amiga and Apple"
- Messages sorted by: [ date ] [ thread ]
Date: Mon, 19 Jul 2004 20:53:54 +0000 (UTC)
In article <a69of056iepu0uiqrfad2bsnvr9nm1npj9@4ax.com>,
John Larkin <jjlarkin@highSNIPlandTHIStechPLEASEnology.com> writes:
> On Mon, 19 Jul 2004 19:31:06 +0100, "John Jardine"
> <john@jjdesigns.fsnet.co.uk> wrote:
>
>>
>>John Larkin <jjlarkin@highSNIPlandTHIStechPLEASEnology.com> wrote in message
>>news:oqvnf0lhisia9kkajpaptq103d3283025j@4ax.com...
>>> On Mon, 19 Jul 2004 17:24:34 +0100, "John Jardine"
>>> <john@jjdesigns.fsnet.co.uk> wrote:
>>> >rate.
>>> PowerBasic returns the good-old-days of clean, fast programming and
>>> direct hardware access. I've run useful megahertz-rate FOR loops on a
>>> PC using PB. Direct hardware access is allowed under Win9x, and the
>>> Totalio utility opens up all the i/o ports under 2K/XP. We even have a
>>> utility that locates PCI cards and drags them down into the real
>>> memory space where PB can flog them.
>>>
>>> I wrote a nice little parallel-port serial-DAC bit-banger in PB, and
>>> my customer felt the need to rewrite it in VB.NET. It went from 2
>>> files (.bas, .exe) to about 25, ran about 0.05 as fast, needed some
>>> extra dll's to access the hardware, and never quite worked right.
>>>
>>> The PowerBasic people are weird, though, but not as weird as
>>> Microsoft.
>>
>>(I've got a screen 12, 'vectored text' programme somewhere on their site)
>>>
>>> John
>>
>>Customers?, pah!, who'd have 'em :-) I genuinely try to encourage 'em to
>>make small wanted changes to the kit. Y'know, take a bit of control for
>>themselves, save themselves a bit of cash, even shoehorn me out of the loop.
>>Doesn't happen, They'll religously play the game. Have come to realise that
>>(quite sensibly) having paid for a specialist supplier they've no interest
>>in doing any of it for themselves.
>>The old '486 I use for hardware test was used to run an impedance analyser
>>design. DUT Phase shifts measured by a PB machine code loop counter. The
>>loops were running at an amazing 10MHz! (Yet again final precision massively
>>limited by the ISA bus)
>>Wonder what the new PCs are capable of running at?.
>>regards
>>john
>>
>
> My klunky old 700 MHz Dell runs an empty long-integer FOR loop in 35
> ns, about 28 MHz. Adding a long-integer increment inside the loop adds
> only 3 ns, which may be the consequence of CPU pipelining or
> something. A function/sub call adds about 150 ns of overhead, which is
> why I prefer GOTO programming.
>
> ISAbus INP takes 1.4 usec!
>
Giving you a coherent idea of the speed of the 3328MHz Pentium4 (other
processors of that ilk, like the Athlon are similar), here are some
speed measures (there is false precision in the numbers):
A simple (empty) loop of 20 iterations, executed 500million times
takes about 7.5seconds. (10 billion iterations.)
A simple loop that adds 20 integers, executed 500million times
takes about 10.6seconds. (10 billion iterations)
A simple loop that adds two groups of 20 integers, executed 500million
times takes about 13.4seconds. (20 billion additions)
A partially unrolled loop (two groups of 5 integers long), adding
the same numbers as the two groups of 20 integers, executed 500million
times takes about 7seconds.
(About 2.8 billion int additions per second.)
So, the simple empty loop takes about .75 nanoseconds per iteration.
Each iteration of the loop with 1 addition takes about 1.06 nanosecond
per iteration.
(About 1 billion int additions per second.)
Each iteration of the loop with 2 additions takes about 1.34 nanoseconds.
(about 1.4 billion int additions per second.)
The additional overhead to add the addition of an array element (the
time difference between the loop with one addition instead of two
additions) is .34nanoseconds. The additional addition consists
of a register indirect memory reference and an update (increment)
of the register containing the address.
So, the addition rate from a cache resident array of integers is
about 3 billion integer additions per second. This INCLUDES the update of
the memory address (the leal 4(%esi),%esi instruction.) This addition
rate is maximum, but realistically it appears that 1.4 billion additions
per second in non-unrolled loop (including loop overhead.)
Unrolling can make a very significant improvement in performance.
For floating point, I seem to get 800 million/sec 32 bit SSE floating
point adds for the non-unrolled loop, and 833 million/sec for unrolled
SSE adds per second. (I didn't bother carefully optimizing these.)
In general, the new P4 type processors can dispatch between 4 instructions
per clock cycle down to one instruction every two clock cycles (for
normal, non multiply/divide instructions.) You'll normally see that
for NORMAL instruction streams, where the data/program is cache resident,
that 1 instruction per clock cycle is plausible. For lots of floating
point, lots of arrays that aren't resident in cache, or lots of mulitply
and divids, then the instruction rate is much slower.
Amazingly, the P4 can peak at 4 add/sub instructions every clock cycle (at
3200MHz), but cannot sustain that for long. The instruction stream
is highly piplined, so that an instruction that is executed at a rate
of 1 per clock cycle might not make the result available for 5-10
clock cycles. ANY upset in the branch prediction (or interrupt)
will tend to greatly damage the performance.)
The P4 makes a fairly fast DSP machine!!! I have an application
that does 448168 512 point FFTs in 10 seconds... It does some other
things than just the FFTs, but 22usec for 512 point floating point
FFT seems pretty reasonable (given the fact that the program does
lots of other things also.)
John
- Next message: Vince Bafetti: "Re: Carbon microphone adapted to condenser mic input?"
- Previous message: John Larkin: "Re: Photodiode wich is fast enough to detect +50Mhz analog (sinus) signal??"
- In reply to: John Larkin: "Re: Sydney-X1 FPGA Computer Challenges Commodore, Amiga and Apple"
- Next in thread: John Jardine: "Re: Sydney-X1 FPGA Computer Challenges Commodore, Amiga and Apple"
- Reply: John Jardine: "Re: Sydney-X1 FPGA Computer Challenges Commodore, Amiga and Apple"
- Messages sorted by: [ date ] [ thread ]
Relevant Pages
|