Re: Share Your Experience with 3DNow, SSE, SSE2 etc.



aruzinsky schrieb:
On Aug 2, 4:36 am, Hendrik van der Heijden <h...@xxxxxx> wrote:

My personal experience for image processing (P4, K8, Core2)
is that performance gains through prefetching are non-existant
or not worth the effort and not consistant over different systems.

I experimented with SSE and SSE2 replacements for memcopy with and
without prefetches.

Memcopy is quite different to (stream) processing. My code had
5-50 SSE operations per vector read.
I think, large (50MB+) memcopys are mostly relevant in benchmarks.
There are applications which do large memcopys, but those do not
need performance. Applications with a demand for optimum performance
will hopefully be designed/coded in a way to avoid large memcopys.

The SSE version uses movaps for both reading and
writing to memory whereas the SSE2 version uses movntpd to write to
memory. The percentage decreases in time over memcopy for very large
arrays were

Do you mean there's a speedup for small arrays, but lower/no speedup
for very large arrays?

If I gave you the code, would you (or anyone) run it on your computers
and report the results in this thread?

There are already benchmarks for this available. I just played a bit with
Rightmark Memory Analyzer, and that's what I observed on my Core2 1.8GHz,
P965, 2ch DDR2-800:

There's no option to disable prefetching, one just can set the
PF distance to zero (0..4KB). One can choose between regular and
non-temporal stores and choose the array size.


SSE2 memcpy performance (GB/s)

array size fit in L1 fit in L2 larger than L2
nontemporal store 3.6 3.6 2.0
regular store 23 7.9 1.5
reg store + prefetch 21/14 7.9 1.5
nt store + prefetch 3.6 3.6 2.1

Observations:

1. Non-temporal stores force writes to go to RAM. If the array fits
in cache (L2), this yields a severe performance degradation,
as memcpy are bound by RAM instead of cache bandwidht.

2. Prefetching barely makes a difference for >L2 copies and
lowers performance for L1 sized copies.



Hendrik vdH
.



Relevant Pages

  • Re: Share Your Experience with 3DNow, SSE, SSE2 etc.
    ... Memcopy is quite different to processing. ... Do you mean there's a speedup for small arrays, ... There's no option to disable prefetching, ...     in cache, this yields a severe performance degradation, ...
    (sci.image.processing)
  • Re: cciss update for 2.4.24-pre1, #3
    ... We found a bug in the ASIC used on the 64xx Smart Array ... If this occurs on a memory boundary the machine will crash. ... This patch turns on prefetch for x86 based systems only. ... Has the prefetching been tested for long? ...
    (Linux-Kernel)
  • Re: Instruction Cache Optimisations
    ... I'm little bit confused about the effectiveness of the memory ... layout achieved by the described algorithm. ... The suggested chains are: ... up on prefetching the start of a function after ...
    (comp.arch)
  • Re: [PATCH 2/2] cciss: disable dma prefetch for P600
    ... falling off into one the holes on IPF and AMD. ... It doesn't happen on Proliant because the last 4kB of memory is ... prefetching was walking off the end of real mmeory and into the AGP region ... There is a bug in the DMA engine that that may result in prefetching ...
    (Linux-Kernel)
  • Re: Inner loop and out of cache question
    ... > can run concurrent with long memory fetches. ... Prefetching can more than double performance as it effectively turns ... including bus protocol limitations, limitations on numbers of ...
    (comp.lang.asm.x86)