Re: Share Your Experience with 3DNow, SSE, SSE2 etc.



On Aug 3, 5:32 am, Hendrik van der Heijden <h...@xxxxxx> wrote:
aruzinsky schrieb:

On Aug 2, 4:36 am, Hendrik van der Heijden <h...@xxxxxx> wrote:
My personal experience for image processing (P4, K8, Core2)
is that performance gains through prefetching are non-existant
or not worth the effort and not consistant over different systems.
I experimented with SSE and SSE2 replacements for memcopy with and
without prefetches.

Memcopy is quite different to (stream) processing. My code had
5-50 SSE operations per vector read.
I think, large (50MB+) memcopys are mostly relevant in benchmarks.
There are applications which do large memcopys, but those do not
need performance. Applications with a demand for optimum performance
will hopefully be designed/coded in a way to avoid large memcopys.

The SSE version uses movaps for both reading and
writing to memory whereas the SSE2 version uses movntpd to write to
memory.  The percentage decreases in time over memcopy for very large
arrays were

Do you mean there's a speedup for small arrays, but lower/no speedup
for very large arrays?


No, my test arrays were 50e6 floats.


If I gave you the code, would you (or anyone) run it on your computers
and report the results in this thread?

There are already benchmarks for this available. I just played a bit with
Rightmark Memory Analyzer, and that's what I observed on my Core2 1.8GHz,
P965, 2ch DDR2-800:

There's no option to disable prefetching, one just can set the
PF distance to zero (0..4KB). One can choose between regular and
non-temporal stores and choose the array size.

SSE2 memcpy performance (GB/s)

array size           fit in L1   fit in L2  larger than L2
nontemporal store       3.6         3.6          2.0
     regular store        23         7.9          1.5
reg store + prefetch   21/14        7.9          1.5
  nt store + prefetch    3.6         3.6          2.1

Observations:

1. Non-temporal stores force writes to go to RAM. If the array fits
    in cache (L2), this yields a severe performance degradation,
    as memcpy are bound by RAM instead of cache bandwidht.

2. Prefetching barely makes a difference for >L2 copies and
    lowers performance for L1 sized copies.

Hendrik vdH

Thank you for your input. I would feel more confident if you tested
my code, though.
.