Re: Larkin, Power BASIC cannot be THAT good:
- From: Martin Brown <|||newspam|||@nezumi.demon.co.uk>
- Date: Mon, 18 May 2009 11:43:53 +0100
Jasen Betts wrote:
On 2009-05-15, John Devereux <john@xxxxxxxxxxxxxx> wrote:Jan Panteltje <pNaonStpealmtje@xxxxxxxxx> writes:
Forget my remark about -O4, runs the same with your C code as -O3, about 7 seconds on my eeePC.I suspect this is all limited by the system memory bandwidth.
It looks like it, with -O4 optimisiation on dual-threaded and single-threaded
versions both run in 1.9 seconds here.
Yes. It is bandwidth limited on the external ram because of the way it hits sequential locations once per loop. Using DDR2 ram execution is roughly 2s and older DDR ram is about 3s in execution. The absolutely dire code generated by debug mode with all optimisation disabled is only 4s. So it isn't really a test of PowerBasics code generation at all.
The real test of an optimising compiler is how well it can exploit cache aware alogrithms with multiple nested loops. A cache aware version of this vector add and accumulate (without using SIMD instructions) is about 15% faster than the simple loop on a good optimising compiler (actually that is measured on MSC I haven't checked its code generation for optimisation - too tedious). I suspect gcc might be better.
The optimisation totally disabled version compiled by MSC is:
for (i = 0; i < ARRAY_SIZE; i++) {
004010BE mov dword ptr [i],0
004010C5 jmp main+0D0h (4010D0h)
004010C7 mov ecx,dword ptr [i]
004010CA add ecx,1
004010CD mov dword ptr [i],ecx
004010D0 cmp dword ptr [i],3D09000h
004010D7 jge main+0F7h (4010F7h)
s[i] += a[i];
004010D9 mov edx,dword ptr [i]
004010DC mov eax,dword ptr [a]
004010DF movsx ecx,word ptr [eax+edx*2]
004010E3 mov edx,dword ptr [i]
004010E6 mov eax,dword ptr [s]
004010E9 add ecx,dword ptr [eax+edx*4]
004010EC mov edx,dword ptr [i]
004010EF mov eax,dword ptr [s]
004010F2 mov dword ptr [eax+edx*4],ecx
}
004010F5 jmp main+0C7h (4010C7h)
And takes 4s on P4 3GHz with DDR ram.
Optimiser turns it into something like
mov esi, a
mov edi, s
mov ecx, ARRAY_SIZE
xor edx, edx
forloop:
movsx eax, word ptr[esi+2*edx]
add dword ptr[edi+4*edx], eax
inc edx
loop forloop
And this extremely tight optimised loop code still takes 3s with DDR ram and just under 2s with DDR2 ram.
MSC has changed the way it does this. My oldest compiler generates a pointer implementation for array access whereas newer compilers exploit the additional scaled indexing features of later CPUs. The oldest compiler generates very slightly faster code for the loop (at least it does after a manual reordering of the generated instructions).
forloop:
movsx eax,word ptr[esi]
add esi, 2
add dword ptr[edi], eax
add edi, 4
loop forloop
Code snippets above subject to typos.
Be interested to see what code gcc -O4 generates.
Regards,
Martin Brown
.
- References:
- Larkin, Power BASIC cannot be THAT good:
- From: Jan Panteltje
- Re: Larkin, Power BASIC cannot be THAT good:
- From: John Larkin
- Re: Larkin, Power BASIC cannot be THAT good:
- From: Jan Panteltje
- Re: Larkin, Power BASIC cannot be THAT good:
- From: John Devereux
- Re: Larkin, Power BASIC cannot be THAT good:
- From: Jasen Betts
- Larkin, Power BASIC cannot be THAT good:
- Prev by Date: Jon Kirwan = prize jerk
- Next by Date: Re: PowerBasic rocks!
- Previous by thread: Re: Larkin, Power BASIC cannot be THAT good:
- Next by thread: Re: Larkin, Power BASIC cannot be THAT good:
- Index(es):
Relevant Pages
|