Re: how to achieve fast image 90 deg rotation in C

From: Martin Brown (|||newspam|||_at_nezumi.demon.co.uk)
Date: 06/15/04


Date: Tue, 15 Jun 2004 11:32:42 +0100

In message <6t2sc0hm8m2a9smih74dqaeb9l8shv94dr@4ax.com>, Severian
<severian@chlamydia-is-not-a-flower.com> writes
>On Mon, 14 Jun 2004 11:11:31 -0700, "Richard Zhu" <yzhu@algis.ca>
>wrote:
>
>>I have the following code to implement a 90 deg rot over an Image in VC,
>>the speed can't reach what is required,

>> for(y1=0;y1<n_h;y1++)
>> {
>> offset=y1*n_w;
>> for(x1=0;x1<n_w;x1++)
>> {
>> x0=y1;
>> y0=o_h-1-x1;
>> chR_rot[offset+x1]=chR1[y0*o_w+x0];
>> chG_rot[offset+x1]=chG1[y0*o_w+x0];
>> chB_rot[offset+x1]=chB1[y0*o_w+x0];
>> }
>> }
>
>I am not a cache expert, but it may be quicker to read rows and write
>columns. You're reading columns, so you will have a lot more cache
>misses.

It is worth trying both ways. They may well be different.

But it probably requires something a bit more devious than that.
1024x768 RGB transposed naively will ensure there are huge numbers of
cache misses. And unless the L2 cache is big enough you will end up with
everything being choked by making frequent direct access to main memory.

Assuming the CPU has 2 x 8k and 32 byte per line L1 cache. The transpose
may fit entirely into L1 cache if you try using slabs that are 32 rows
deep and 256 wide (or maybe 128 wide). If you can monitor cache misses
then it should be possible to tune it precisely. Some CPUs and OS's will
tell you about their cache configuration details on request.

You probably want as little code contained inside the inner loop as
possible, modern predictive branching makes loop unrolling unlikely to
help. It also matters whether or not the RGB buffers are interleaved.

If they aren't interleaved then using entirely separate loops for each
of R,G,B arrays might fit the whole image into L2 cache if you are
lucky.
>
>My program does a 1024x768 24-bit rotation in around 60-70ms on a P3
>800, so that should put you in the ballpark on your faster machine.
>
>If you have other image-oriented stuff to do, it might be worthwhile
>to investigate the Intel performance libraries, which will do things
>like this with highly-optimized MMX, SSE and SSE2 code.
>
>IPL requires 10-20ms to rotate a 1024x768 image in on my P3 800.

There is your target to beat. Or you could use IPL instead.

Regards,

-- 
Martin Brown


Relevant Pages

  • Re: hot path optimizations in uma_zalloc() & uma_zfree()
    ... > I suppose the reason of first gain lies in increasing of cpu cache hits. ... > separate buckets. ... I ran ministat against your tests with 1000 sockets loop and there isn't a lot ...
    (freebsd-hackers)
  • Re: hot path optimizations in uma_zalloc() & uma_zfree()
    ... > I suppose the reason of first gain lies in increasing of cpu cache hits. ... > separate buckets. ... I ran ministat against your tests with 1000 sockets loop and there isn't a lot ...
    (freebsd-hackers)
  • [PATCH][RFC] fast file mapping for loop
    ... are done once they hit page cache. ... loop without making it even slower than it currently is. ... * Add bio to back of pending list and wakeup thread ... * Find extent mapping this lo device block to the file block on the real ...
    (Linux-Kernel)
  • Cache size restrictions obsolete for unrolling?
    ... into the cache (i.e. the code of the loops was slightly smaller than ... execution time using a cycle-true simulator. ... unrolling factor stepwise resulting in the unrolled loop that exceeded ... I expected to get a performance decrease, i.e. the stronger the loop ...
    (comp.compilers)
  • Re: Share Your Experience with 3DNow, SSE, SSE2 etc.
    ... Cache behaviour can be rather quirky at times. ... And then try loop unrolling. ... I would only optimise at this sort of low level as a last resort. ... I would have to use 2D float arrays with all rows ...
    (sci.image.processing)