Re: Share Your Experience with 3DNow, SSE, SSE2 etc.



aruzinsky schrieb:

$L2:
movaps xmm0, [edx]
movaps xmm1, [edx+16]
movaps xmm2, [edx+32]
movaps xmm3, [edx+48]

PREFETCHNTA [edx+ecx]

movntpd [eax], xmm0
movntpd [eax+16], xmm1
movntpd [eax+32], xmm2
movntpd [eax+48], xmm3

add edx, 64
add eax, 64

dec esi
jnz $L2

Prefetching here doesn't gain much, as the RAM access pattern
is easily predictable by the hardware prefetcher.

I tried several things on my Core2.

For arrays larger than the caches:
- nontemporal writes give 33% speedup
- prefetching makes no difference
- unroll by 4 vs not unrolled makes no difference

For array which fit in L2 (copy the same block 100 times):
- nontemporal writes reduce performance to 27%
- prefetching makes no difference
- unroll by 4 vs not unrolled makes no difference


Hendrik vdH
.



Relevant Pages