Re: That's not the Answer to Your Question
- From: "Martin Brown" <|||newspam|||@nezumi.demon.co.uk>
- Date: 16 Apr 2007 01:36:11 -0700
On Apr 13, 5:12 pm, "aruzinsky" <aruzin...@xxxxxxxxxxxxxxx
cathexis.com> wrote:
The context ofyourquestionwas:
"I was suggested to try a block processing approach, whereby the image is
stored by blocks rather than by rows."
Yourallegedansweris:
Original: matrix vector multiplication
for (i=0; i<N; i++)
for (j=0; j<N; j++)
c[i] = c[i]+ a[i,j]*b[i];
After loop tiling 2*2:
for (i=0; i<N; i+=2)
for (j=0; j<N; j+=2)
for (ii=i; ii<min(i+2,N); ii++)
for (jj=j; jj<min(j+2,N); jj++)
c[ii] =c[ii]+ a[ii,jj]*b[ii];
There was no storage of the image in blocks therefore this isnotananswertoyourquestion.
You are displaying your ignorance here. And also annoyingly by
modifying the header you break the threading on some newsreaders that
don't handle followups too elegantly. It isn't the storage of the
image that is blocked it is the *access* to the stored image that is
done in blocks with the blocksize carefully chosen to minimise cache
misses.
Without bothering to check the algorithmic details above substitute K
for 2 in the above. Where typically
K ~ MIN( sqrt(PRIMARY_CACHE/TSIZE)), SECONDARY_CACHE/(TSIZE*2*N) )
perhaps with an extra factor of two thrown in depending on the
algorithm. If it needs two working copies or other intermediate
storage. The whole object of the excercise is to keep memory
references to local blocks or stripes.
The exact choice of K may need some fine tuning if optimum performance
is to be obtained. One notable IBM architecture had a quirk that
rendered 2^N FFT algorithms painfully slow unless they were tweaked.
Furthermore, this was a badanswerto anyquestionbecause the example
code is obviously poor. A typical programmer would have exchanged the
ordering of the i and j and ii and jj loops.
And that is why a typical programmer would be hopelessly wrong in this
case. The sample code looks like it is originally from Fortran style
notation where [i,j] accesses element i+j*N
Also, min(j+2,N)and
min(i+2,N) are unnecessarily recalculated for every block.
Very few optimising compilers are quite so dumb but there are better
ways to cater for odd N.
Since you have exhibited a lack of appreciation of fundamental details, I
suspect thatyourcode runs slowly for similar reasons.
The lack of appreciation of the essential factors involved in block
processing is entirely yours.
Regards,
Martin Brown
.
- Follow-Ups:
- Re: That's not the Answer to Your Question
- From: aruzinsky
- Re: That's not the Answer to Your Question
- References:
- Block processing
- From: vectorizor
- Re: Block processing
- From: vectorizor
- That's not the Answer to Your Question
- From: aruzinsky
- Block processing
- Prev by Date: A list of most useful books on Medical Image Analysis?
- Next by Date: Re: Super resolution reconstruction for medical images?
- Previous by thread: Also
- Next by thread: Re: That's not the Answer to Your Question
- Index(es):
Relevant Pages
|