Re: Sydney-X1 FPGA Computer Challenges Commodore, Amiga and Apple

From: John Jardine (john_at_jjdesigns.fsnet.co.uk)
Date: 07/21/04


Date: Wed, 21 Jul 2004 23:09:01 +0100


John S. Dyson <toor@iquest.net> wrote in message
news:cdhcd1$29kr$1@news.iquest.net...
> In article <a69of056iepu0uiqrfad2bsnvr9nm1npj9@4ax.com>,
> John Larkin <jjlarkin@highSNIPlandTHIStechPLEASEnology.com> writes:
> > On Mon, 19 Jul 2004 19:31:06 +0100, "John Jardine"
> > <john@jjdesigns.fsnet.co.uk> wrote:
> >
> >>
> >>John Larkin <jjlarkin@highSNIPlandTHIStechPLEASEnology.com> wrote in
message
> >>news:oqvnf0lhisia9kkajpaptq103d3283025j@4ax.com...
> >>> On Mon, 19 Jul 2004 17:24:34 +0100, "John Jardine"
> >>> <john@jjdesigns.fsnet.co.uk> wrote:
> >>> >rate.
> >>> PowerBasic returns the good-old-days of clean, fast programming and
> >>> direct hardware access. I've run useful megahertz-rate FOR loops on a
> >>> PC using PB. Direct hardware access is allowed under Win9x, and the
> >>> Totalio utility opens up all the i/o ports under 2K/XP. We even have a
> >>> utility that locates PCI cards and drags them down into the real
> >>> memory space where PB can flog them.
> >>>
> >>> I wrote a nice little parallel-port serial-DAC bit-banger in PB, and
> >>> my customer felt the need to rewrite it in VB.NET. It went from 2
> >>> files (.bas, .exe) to about 25, ran about 0.05 as fast, needed some
> >>> extra dll's to access the hardware, and never quite worked right.
> >>>
> >>> The PowerBasic people are weird, though, but not as weird as
> >>> Microsoft.
> >>
> >>(I've got a screen 12, 'vectored text' programme somewhere on their
site)
> >>>
> >>> John
> >>
> >>Customers?, pah!, who'd have 'em :-) I genuinely try to encourage 'em to
> >>make small wanted changes to the kit. Y'know, take a bit of control for
> >>themselves, save themselves a bit of cash, even shoehorn me out of the
loop.
> >>Doesn't happen, They'll religously play the game. Have come to realise
that
> >>(quite sensibly) having paid for a specialist supplier they've no
interest
> >>in doing any of it for themselves.
> >>The old '486 I use for hardware test was used to run an impedance
analyser
> >>design. DUT Phase shifts measured by a PB machine code loop counter.
The
> >>loops were running at an amazing 10MHz! (Yet again final precision
massively
> >>limited by the ISA bus)
> >>Wonder what the new PCs are capable of running at?.
> >>regards
> >>john
> >>
> >
> > My klunky old 700 MHz Dell runs an empty long-integer FOR loop in 35
> > ns, about 28 MHz. Adding a long-integer increment inside the loop adds
> > only 3 ns, which may be the consequence of CPU pipelining or
> > something. A function/sub call adds about 150 ns of overhead, which is
> > why I prefer GOTO programming.
> >
> > ISAbus INP takes 1.4 usec!
> >
> Giving you a coherent idea of the speed of the 3328MHz Pentium4 (other
> processors of that ilk, like the Athlon are similar), here are some
> speed measures (there is false precision in the numbers):
>
> A simple (empty) loop of 20 iterations, executed 500million times
> takes about 7.5seconds. (10 billion iterations.)
>
> A simple loop that adds 20 integers, executed 500million times
> takes about 10.6seconds. (10 billion iterations)
>
> A simple loop that adds two groups of 20 integers, executed 500million
> times takes about 13.4seconds. (20 billion additions)
>
> A partially unrolled loop (two groups of 5 integers long), adding
> the same numbers as the two groups of 20 integers, executed 500million
> times takes about 7seconds.
> (About 2.8 billion int additions per second.)
>
> So, the simple empty loop takes about .75 nanoseconds per iteration.
> Each iteration of the loop with 1 addition takes about 1.06 nanosecond
> per iteration.
> (About 1 billion int additions per second.)
>
> Each iteration of the loop with 2 additions takes about 1.34 nanoseconds.
> (about 1.4 billion int additions per second.)
>
> The additional overhead to add the addition of an array element (the
> time difference between the loop with one addition instead of two
> additions) is .34nanoseconds. The additional addition consists
> of a register indirect memory reference and an update (increment)
> of the register containing the address.
>
> So, the addition rate from a cache resident array of integers is
> about 3 billion integer additions per second. This INCLUDES the update of
> the memory address (the leal 4(%esi),%esi instruction.) This addition
> rate is maximum, but realistically it appears that 1.4 billion additions
> per second in non-unrolled loop (including loop overhead.)
>
> Unrolling can make a very significant improvement in performance.
>
> For floating point, I seem to get 800 million/sec 32 bit SSE floating
> point adds for the non-unrolled loop, and 833 million/sec for unrolled
> SSE adds per second. (I didn't bother carefully optimizing these.)
>
> In general, the new P4 type processors can dispatch between 4 instructions
> per clock cycle down to one instruction every two clock cycles (for
> normal, non multiply/divide instructions.) You'll normally see that
> for NORMAL instruction streams, where the data/program is cache resident,
> that 1 instruction per clock cycle is plausible. For lots of floating
> point, lots of arrays that aren't resident in cache, or lots of mulitply
> and divids, then the instruction rate is much slower.
>
> Amazingly, the P4 can peak at 4 add/sub instructions every clock cycle (at
> 3200MHz), but cannot sustain that for long. The instruction stream
> is highly piplined, so that an instruction that is executed at a rate
> of 1 per clock cycle might not make the result available for 5-10
> clock cycles. ANY upset in the branch prediction (or interrupt)
> will tend to greatly damage the performance.)
>
> The P4 makes a fairly fast DSP machine!!! I have an application
> that does 448168 512 point FFTs in 10 seconds... It does some other
> things than just the FFTs, but 22usec for 512 point floating point
> FFT seems pretty reasonable (given the fact that the program does
> lots of other things also.)
>
> John

That's bloody staggering!.

If only they'd stretch (just a teensy weensy couple of inches) a couple of
those address and data tracks out to a connector on the PC case, I'd be as
happy as a pig in s**t.
regards
john



Relevant Pages

  • Re: Sydney-X1 FPGA Computer Challenges Commodore, Amiga and Apple
    ... DUT Phase shifts measured by a PB machine code loop counter. ... the simple empty loop takes about .75 nanoseconds per iteration. ... the memory address,%esi instruction.) This addition ... that 1 instruction per clock cycle is plausible. ...
    (sci.electronics.design)
  • Re: programming language
    ... you will find the source code to my bf interpreter. ... instruction_pointer is the index of the instruction currently being executed in the instruction array. ... execute() is where the action happens. ... executegets a pointer to a bf_vm, where it executes one instruction, increments the instruction pointer of the bf_vm so that it points to the next instruction (or does a loop), and returns. ...
    (comp.programming)
  • Re: How much does it take to execute MMX instruction?
    ... a unrolled loop with lots of nop's in the ... This way we have accurate enough instruction timings. ... Pentium M, in general, has latency one clock cycle less, than Pentium ...
    (comp.lang.asm.x86)
  • Re: IAR MSP430 compiler problem
    ... Does anybody knows how to force compiler to use call instruction ... to next instruction after Spin function..... ... But it doesn't actually co-operate - an eternal loop is not co-operative, even if it you cheat and break out using interrupts. ... Interrupts are inherently asynchronous - if the thread can be suspended by an interrupt function, ...
    (comp.arch.embedded)
  • Re: About dispatching process
    ... (Of course somewhere within the instruction sequence, it would have to check for hierarchy.) ... Subject: About dispatching process ... A disabled loop by itsself will not generate a problem. ... For IBM-MAIN subscribe / signoff / archive access instructions, ...
    (bit.listserv.ibm-main)