Re: OT Dual core CPUs versus faster single core CPUs?



John Larkin wrote:
On Tue, 06 May 2008 19:19:29 -0700, JosephKK <quiettechblue@xxxxxxxxx>
wrote:

On Mon, 05 May 2008 10:16:49 -0700, John Larkin
<jjlarkin@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote:

On Sat, 03 May 2008 12:39:17 -0700, Jeff Liebermann <jeffl@xxxxxxxxxx>
wrote:

On Sat, 03 May 2008 06:50:49 -0700, John Larkin
<jjlarkin@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote:

What else are you going to do with 1024 CPU's on a chip?
John
Well, how about...

Error detection. Have three CPU's do essentially the same
calculations. If they agree, continue. If they disagree, take the
result from the two that agree. The overhead is minimal as the
processes are all concurrent. However, the power dissipation might be
up to 3 times higher than with a single CPU.

That only works if the three versions of the software were written independently to satisfy the same specifications and preferably using different tools. It is done only in absolutely mission critical life or death software. I think parts of the space shuttle launch sequence uses this approach (and sometimes the launch is cancelled because the systems disagree at a checkpoint).

Otherwise all you are ensuring is that the same software run three times gives the same answer (which might or might not be true depending on the FP rounding rules). And you have to be very careful that the additional complexity does not itself add a new mode of failure and unreliability. A failure in the supervisor that compares the answers for instance.

A much cheaper way to improve software reliability is to port it to another machine or even a different compiler. We just about always found something of interest every time this was done even for code that was extremely robust and had been run on everything from a Cray down to a Z80 (the latter was done to win a bet). Static testing of software is possible but comparatively few shops do it seriously.

The CPU's don't have bugs; the software does.

John

CPUs typically have a few bugs each but they are seldom of major consequence. The last one I can recall that was serious egg on face was the Intel F00F bug. So before you gloat to much about hardwares seeming infalliblity I suggest you read the abstract at:

http://ieeexplore.ieee.org/Xplore/login.jsp?url=/iel5/4211748/4211749/04211889.pdf?tp=&isnumber=&arnumber=4211889

CPUs are much more intensively simulated and individual key component blocks like multipliers and dividers are much more amenable to formal methods proof of correctness and reuse than generic freeform business software. Even so there is a test vs performance trade off for release.

ISTR when Cyrix commissioned their formal specification of the x87 to produce a cleanroom clone that was pin for pin compatible they found a couple of dozen minor defects in the original Intel x87 chip.

I am truly amazed in the current generation of P4 chips that the register colouring and speculative execution doesn't cause more problems.

So you have forgotten about the Pentium FDIV bug already?

In a sense that was as much an algorithmic/firmware error as a hardware bug (and it was exceedingly rare that it triggered). I had one machine with the fault - provided to me by a customer to ensure that our fixed software worked OK even on the defective CPUs.

For comparison the XL2007 pre SP1 cannot annotate log graphs correctly above 10^8 (two ticks labelled 10000000) and infamously displays 65535-eps as 100000 for certain unlucky values of eps<<<1. And it is so slow at drawing large graphs without SP1 as to be a joke.

But CPU bugs are rare and documented, and there are workarounds. In a
typical PC, in the OS and a reasonable set of apps, there will be
thousands of bugs, maybe tens of thousands, most of which are
undocumented and never fixed. Next rev will keep many of the old bugs
and add thousands more. There's just no comparison between crashes
caused by software vs hardware: I bet the ratio is ballpark 1e5:1.

I'd be more inclined to bet 1000:1 maybe 10000:1 at the outside. I have seen the odd bug and/or undocumented feature in most CPUs I have worked on - most of them unimportant, a couple show stopping, and some of them are even useful. One or two were a major security risk.

Existing hardware design methodologies work very well;
billion-transistor chips are reliable. Current software methodologies
are clearly broken: a million line program typically has thousands of
bugs.

It is worth pointing out here that all modern chips are designed using software. The big difference is that committing to large scale bulk chip fabrication is *so* horrendously expensive that it doesn't happen until the thing simulates perfectly and tests out OK in prototype hardware against aggressive whitebox testers.

Software by comparison is dirt cheap to duplicate and first to market advantage is huge. The result is regretably a "ship it and be damned" management culture. You can always issue chargeable hotfixes or service packs.

BTW A million line manually written program with average industry practice should typically have around 500 bugs in it. Best practice is one or two orders of magnitude better if you are prepared to pay the price (and wait longer).

Regards,
Martin Brown
** Posted from http://www.teranews.com **
.



Relevant Pages

  • Re: kernel is always too big....
    ... Despite the barbs shot against XFS by that fellow from gentoo, ... > Well, that was ONE bug. ... It's just that they change their chips like nobody's ... > perfectly innocent. ...
    (comp.os.linux.setup)
  • RE: EHCI Regression in 2.6.23-rc2
    ... for some reason way too many of the add-on PCI cards with VIA ... chips use that pretty-broken VT6202 chip. ... IFF we know that the bug shows up in EHCI 1.00 chips rather than ... The NEC controller (EHCI 1.00) seems to work fine, ...
    (Linux-Kernel)
  • Re: 32 bit FORTH ??? Different tack! Jona
    ... CPUs available. ... I/O for your particular target hardware, and your done, as far as ... If you write a "UNIX" OS, ... AMD chips, INTEL chips, ...
    (comp.lang.forth)
  • Re: RFID chip barcodes can carry a virus
    ... hardware and firmware for that matter) can be described as a "bug". ... We are not talking about vulnerabilities here. ...
    (misc.survivalism)
  • Re: Anthonys drive issues.Re: ssh password delay
    ... What makes it a _bug_? ... or approached the hardware in a way that made the modifications ... > at work which has a modded microcode in an Adaptec 2940U adapter card ... > spew errors) This is the same scsi chipset as Anthonys Vectra. ...
    (freebsd-questions)