Re: That's not the Answer to Your Question



On Apr 17, 2:40 pm, vectorizor <vectori...@xxxxxxxxxxxxxx> wrote:
ooppss, I cliked the wrong button!

// img stored row-wise, i.e. img[row*width + col]

Ironically it would actually run a bit faster if you had used the
indexing method above.

C support for multi-dimensional arrays is rubbish. All the img[ ][ ]
construct does here is waste cache lines by holding an array of
pointers to the start of each line of the image. And it gets much much
worse in higher dimensionals.

// horizontal pass
for row=0:height
for col=0:width-1
img[row][col] = 0.2f*img[row][col] + 0.2f*img[row][col+1];

// vertical pass
for col=0:width
for row=0:height-1
img[row][col] = 0.2f*img[row][col] + 0.2f*img[row+1][col];

this is plain vanilla C, piece of cake to re-create it.

Slightly odd syntax but your meaning is fairly clear. But beware of
fence post errors! C arrays index [0..N-1] whereas classic FORTRAN
goes from 1,N and you seem to be using a mixture of notation here.

You are probably unwittingly accessing off the end of the array and
mixing in the first element of the next line with the last element of
the previous one. It isn't clear what type img is, but if it is byte
or word length you could consider changing 0.2f into a suitable
integer mulitply, add to round and divide by power of two with slight
loss of accuracy to rounding errors. Converting integers to floating
point and back is expensive.

The code has an avoidable multiple access to the same cache line in
the inner loop of the horizontal code which will stall the processor
unless you have a very clever optimising compiler. Intels profile
directed C compiler might just be up to it but most others will need a
hint. And it is likely that writing the output to a new destination
array in transposed order would be faster still.

Also, this is only an excerpt. Some obvious optimisation could be
perfomed, but due to storage requirements, the whole program has to be
written that way. This nevertheless clearly demonstrates the issue.

As indeed it will. It would be difficult to write the second vertical
loop in a more cache unfriendly way. C lacks convenient
multidimensional array descriptors and so you have to roll your own by
hand because the compiler can't do the book keeping for you.

Rewriting the vertical loop for cache optimality and minimum use of
external memory bandwidth gives:

(unchecked and bound to have some typos)

const BLOCK = 32; // try also 64, 128, 256 (128 probably
best on a P4)
imgsize = width*height;
for (stripe = 0; stripe < width-BLOCK; stripe+= BLOCK)
for (col = stripe; col < stripe+BLOCK-1; col++)
{
x = img[col];
for (row = col; row < imgsize; row+= width)
{
y = img[row+width];
img[row] = 0.2f*(x+y);
x = y;
}
}

// to change to another ouput array dest[imgsize] change " img[row] =
" to " *dest++ = "
// This code is given only as an example with no warrantee as to
fitness for purpose.

NB given how simple the kernel is in this case you could swap the
order of the two innermost loops probably with a speed advantage.

If you are superstitious about expressions in for loops you can
compute them outside, but I have never seen a C compiler that couldn't
make this trivial optimisation (even some basic interpretters can
manage it!). Making it work and rewriting the horizontal case is left
as an excercise to the reader.

An Intel white paper also suggests you can exploit register colouring
by deliberately unrolling integer operations in the loop but I have
never tried it myself. I can't find the URL for that paper instantly
but for more general Intel specific optimisation advice see for
example:

http://www.intel.com/software/products/compilers/clin/docs/main_cls/mergedprojects/optaps_cls/common/optaps_perf_optstrats.htm

And related links on the site. (it also contains info that you will
need to fix up the sample code I gave which does not handle the last
partial stripe if width is not an exact multiple of BLOCK)

Worth making sure the start of the img array is properly aligned with
cache size too.
You might also consider using some of the MMX and SSE extensions if
your code need not be portable.

It is worthwhile running a profiler on the final code to see where the
remaining bottlenecks are if you really need to get it as fast as
possible. The 80:20 rule applies and there is no point making random
optimisations as some have advocated.

BTW how big is your image?

Regards,
Martin Brown

.


Quantcast