Re: Share Your Experience with 3DNow, SSE, SSE2 etc.
- From: aruzinsky <aruzinsky@xxxxxxxxxxxxxxxxxxxx>
- Date: Sun, 3 Aug 2008 08:49:34 -0700 (PDT)
On Aug 3, 5:32 am, Hendrik van der Heijden <h...@xxxxxx> wrote:
aruzinsky schrieb:
On Aug 2, 4:36 am, Hendrik van der Heijden <h...@xxxxxx> wrote:
My personal experience for image processing (P4, K8, Core2)I experimented with SSE and SSE2 replacements for memcopy with and
is that performance gains through prefetching are non-existant
or not worth the effort and not consistant over different systems.
without prefetches.
Memcopy is quite different to (stream) processing. My code had
5-50 SSE operations per vector read.
I think, large (50MB+) memcopys are mostly relevant in benchmarks.
There are applications which do large memcopys, but those do not
need performance. Applications with a demand for optimum performance
will hopefully be designed/coded in a way to avoid large memcopys.
The SSE version uses movaps for both reading and
writing to memory whereas the SSE2 version uses movntpd to write to
memory. The percentage decreases in time over memcopy for very large
arrays were
Do you mean there's a speedup for small arrays, but lower/no speedup
for very large arrays?
No, my test arrays were 50e6 floats.
If I gave you the code, would you (or anyone) run it on your computers
and report the results in this thread?
There are already benchmarks for this available. I just played a bit with
Rightmark Memory Analyzer, and that's what I observed on my Core2 1.8GHz,
P965, 2ch DDR2-800:
There's no option to disable prefetching, one just can set the
PF distance to zero (0..4KB). One can choose between regular and
non-temporal stores and choose the array size.
SSE2 memcpy performance (GB/s)
array size fit in L1 fit in L2 larger than L2
nontemporal store 3.6 3.6 2.0
regular store 23 7.9 1.5
reg store + prefetch 21/14 7.9 1.5
nt store + prefetch 3.6 3.6 2.1
Observations:
1. Non-temporal stores force writes to go to RAM. If the array fits
in cache (L2), this yields a severe performance degradation,
as memcpy are bound by RAM instead of cache bandwidht.
2. Prefetching barely makes a difference for >L2 copies and
lowers performance for L1 sized copies.
Hendrik vdH
Thank you for your input. I would feel more confident if you tested
my code, though.
.
- Follow-Ups:
- Re: Share Your Experience with 3DNow, SSE, SSE2 etc.
- From: Hendrik van der Heijden
- Re: Share Your Experience with 3DNow, SSE, SSE2 etc.
- References:
- Share Your Experience with 3DNow, SSE, SSE2 etc.
- From: aruzinsky
- Re: Share Your Experience with 3DNow, SSE, SSE2 etc.
- From: Hendrik van der Heijden
- Re: Share Your Experience with 3DNow, SSE, SSE2 etc.
- From: aruzinsky
- Re: Share Your Experience with 3DNow, SSE, SSE2 etc.
- From: Hendrik van der Heijden
- Share Your Experience with 3DNow, SSE, SSE2 etc.
- Prev by Date: Re: Share Your Experience with 3DNow, SSE, SSE2 etc.
- Next by Date: Re: Accurate edge detection?
- Previous by thread: Re: Share Your Experience with 3DNow, SSE, SSE2 etc.
- Next by thread: Re: Share Your Experience with 3DNow, SSE, SSE2 etc.
- Index(es):