Re: Sydney-X1 FPGA Computer Challenges Commodore, Amiga and Apple
From: John Jardine (john_at_jjdesigns.fsnet.co.uk)
Date: 07/21/04
- Next message: Ken Smith: "Re: Fifty-six Deceits in Fahrenheit 911"
- Previous message: Ken Smith: "Re: Fifty-six Deceits in Fahrenheit 911"
- In reply to: John S. Dyson: "Re: Sydney-X1 FPGA Computer Challenges Commodore, Amiga and Apple"
- Next in thread: John S. Dyson: "Re: Sydney-X1 FPGA Computer Challenges Commodore, Amiga and Apple"
- Reply: John S. Dyson: "Re: Sydney-X1 FPGA Computer Challenges Commodore, Amiga and Apple"
- Messages sorted by: [ date ] [ thread ]
Date: Wed, 21 Jul 2004 23:09:01 +0100
John S. Dyson <toor@iquest.net> wrote in message
news:cdhcd1$29kr$1@news.iquest.net...
> In article <a69of056iepu0uiqrfad2bsnvr9nm1npj9@4ax.com>,
> John Larkin <jjlarkin@highSNIPlandTHIStechPLEASEnology.com> writes:
> > On Mon, 19 Jul 2004 19:31:06 +0100, "John Jardine"
> > <john@jjdesigns.fsnet.co.uk> wrote:
> >
> >>
> >>John Larkin <jjlarkin@highSNIPlandTHIStechPLEASEnology.com> wrote in
message
> >>news:oqvnf0lhisia9kkajpaptq103d3283025j@4ax.com...
> >>> On Mon, 19 Jul 2004 17:24:34 +0100, "John Jardine"
> >>> <john@jjdesigns.fsnet.co.uk> wrote:
> >>> >rate.
> >>> PowerBasic returns the good-old-days of clean, fast programming and
> >>> direct hardware access. I've run useful megahertz-rate FOR loops on a
> >>> PC using PB. Direct hardware access is allowed under Win9x, and the
> >>> Totalio utility opens up all the i/o ports under 2K/XP. We even have a
> >>> utility that locates PCI cards and drags them down into the real
> >>> memory space where PB can flog them.
> >>>
> >>> I wrote a nice little parallel-port serial-DAC bit-banger in PB, and
> >>> my customer felt the need to rewrite it in VB.NET. It went from 2
> >>> files (.bas, .exe) to about 25, ran about 0.05 as fast, needed some
> >>> extra dll's to access the hardware, and never quite worked right.
> >>>
> >>> The PowerBasic people are weird, though, but not as weird as
> >>> Microsoft.
> >>
> >>(I've got a screen 12, 'vectored text' programme somewhere on their
site)
> >>>
> >>> John
> >>
> >>Customers?, pah!, who'd have 'em :-) I genuinely try to encourage 'em to
> >>make small wanted changes to the kit. Y'know, take a bit of control for
> >>themselves, save themselves a bit of cash, even shoehorn me out of the
loop.
> >>Doesn't happen, They'll religously play the game. Have come to realise
that
> >>(quite sensibly) having paid for a specialist supplier they've no
interest
> >>in doing any of it for themselves.
> >>The old '486 I use for hardware test was used to run an impedance
analyser
> >>design. DUT Phase shifts measured by a PB machine code loop counter.
The
> >>loops were running at an amazing 10MHz! (Yet again final precision
massively
> >>limited by the ISA bus)
> >>Wonder what the new PCs are capable of running at?.
> >>regards
> >>john
> >>
> >
> > My klunky old 700 MHz Dell runs an empty long-integer FOR loop in 35
> > ns, about 28 MHz. Adding a long-integer increment inside the loop adds
> > only 3 ns, which may be the consequence of CPU pipelining or
> > something. A function/sub call adds about 150 ns of overhead, which is
> > why I prefer GOTO programming.
> >
> > ISAbus INP takes 1.4 usec!
> >
> Giving you a coherent idea of the speed of the 3328MHz Pentium4 (other
> processors of that ilk, like the Athlon are similar), here are some
> speed measures (there is false precision in the numbers):
>
> A simple (empty) loop of 20 iterations, executed 500million times
> takes about 7.5seconds. (10 billion iterations.)
>
> A simple loop that adds 20 integers, executed 500million times
> takes about 10.6seconds. (10 billion iterations)
>
> A simple loop that adds two groups of 20 integers, executed 500million
> times takes about 13.4seconds. (20 billion additions)
>
> A partially unrolled loop (two groups of 5 integers long), adding
> the same numbers as the two groups of 20 integers, executed 500million
> times takes about 7seconds.
> (About 2.8 billion int additions per second.)
>
> So, the simple empty loop takes about .75 nanoseconds per iteration.
> Each iteration of the loop with 1 addition takes about 1.06 nanosecond
> per iteration.
> (About 1 billion int additions per second.)
>
> Each iteration of the loop with 2 additions takes about 1.34 nanoseconds.
> (about 1.4 billion int additions per second.)
>
> The additional overhead to add the addition of an array element (the
> time difference between the loop with one addition instead of two
> additions) is .34nanoseconds. The additional addition consists
> of a register indirect memory reference and an update (increment)
> of the register containing the address.
>
> So, the addition rate from a cache resident array of integers is
> about 3 billion integer additions per second. This INCLUDES the update of
> the memory address (the leal 4(%esi),%esi instruction.) This addition
> rate is maximum, but realistically it appears that 1.4 billion additions
> per second in non-unrolled loop (including loop overhead.)
>
> Unrolling can make a very significant improvement in performance.
>
> For floating point, I seem to get 800 million/sec 32 bit SSE floating
> point adds for the non-unrolled loop, and 833 million/sec for unrolled
> SSE adds per second. (I didn't bother carefully optimizing these.)
>
> In general, the new P4 type processors can dispatch between 4 instructions
> per clock cycle down to one instruction every two clock cycles (for
> normal, non multiply/divide instructions.) You'll normally see that
> for NORMAL instruction streams, where the data/program is cache resident,
> that 1 instruction per clock cycle is plausible. For lots of floating
> point, lots of arrays that aren't resident in cache, or lots of mulitply
> and divids, then the instruction rate is much slower.
>
> Amazingly, the P4 can peak at 4 add/sub instructions every clock cycle (at
> 3200MHz), but cannot sustain that for long. The instruction stream
> is highly piplined, so that an instruction that is executed at a rate
> of 1 per clock cycle might not make the result available for 5-10
> clock cycles. ANY upset in the branch prediction (or interrupt)
> will tend to greatly damage the performance.)
>
> The P4 makes a fairly fast DSP machine!!! I have an application
> that does 448168 512 point FFTs in 10 seconds... It does some other
> things than just the FFTs, but 22usec for 512 point floating point
> FFT seems pretty reasonable (given the fact that the program does
> lots of other things also.)
>
> John
That's bloody staggering!.
If only they'd stretch (just a teensy weensy couple of inches) a couple of
those address and data tracks out to a connector on the PC case, I'd be as
happy as a pig in s**t.
regards
john
- Next message: Ken Smith: "Re: Fifty-six Deceits in Fahrenheit 911"
- Previous message: Ken Smith: "Re: Fifty-six Deceits in Fahrenheit 911"
- In reply to: John S. Dyson: "Re: Sydney-X1 FPGA Computer Challenges Commodore, Amiga and Apple"
- Next in thread: John S. Dyson: "Re: Sydney-X1 FPGA Computer Challenges Commodore, Amiga and Apple"
- Reply: John S. Dyson: "Re: Sydney-X1 FPGA Computer Challenges Commodore, Amiga and Apple"
- Messages sorted by: [ date ] [ thread ]
Relevant Pages
|