Re: That's not the Answer to Your Question



On Apr 13, 5:12 pm, "aruzinsky" <aruzin...@xxxxxxxxxxxxxxx
cathexis.com> wrote:
The context ofyourquestionwas:

"I was suggested to try a block processing approach, whereby the image is
stored by blocks rather than by rows."

Yourallegedansweris:

Original: matrix vector multiplication

for (i=0; i<N; i++)
for (j=0; j<N; j++)
c[i] = c[i]+ a[i,j]*b[i];

After loop tiling 2*2:

for (i=0; i<N; i+=2)
for (j=0; j<N; j+=2)
for (ii=i; ii<min(i+2,N); ii++)
for (jj=j; jj<min(j+2,N); jj++)
c[ii] =c[ii]+ a[ii,jj]*b[ii];

There was no storage of the image in blocks therefore this isnotananswertoyourquestion.

You are displaying your ignorance here. And also annoyingly by
modifying the header you break the threading on some newsreaders that
don't handle followups too elegantly. It isn't the storage of the
image that is blocked it is the *access* to the stored image that is
done in blocks with the blocksize carefully chosen to minimise cache
misses.

Without bothering to check the algorithmic details above substitute K
for 2 in the above. Where typically

K ~ MIN( sqrt(PRIMARY_CACHE/TSIZE)), SECONDARY_CACHE/(TSIZE*2*N) )

perhaps with an extra factor of two thrown in depending on the
algorithm. If it needs two working copies or other intermediate
storage. The whole object of the excercise is to keep memory
references to local blocks or stripes.

The exact choice of K may need some fine tuning if optimum performance
is to be obtained. One notable IBM architecture had a quirk that
rendered 2^N FFT algorithms painfully slow unless they were tweaked.

Furthermore, this was a badanswerto anyquestionbecause the example
code is obviously poor. A typical programmer would have exchanged the
ordering of the i and j and ii and jj loops.

And that is why a typical programmer would be hopelessly wrong in this
case. The sample code looks like it is originally from Fortran style
notation where [i,j] accesses element i+j*N

Also, min(j+2,N)and
min(i+2,N) are unnecessarily recalculated for every block.

Very few optimising compilers are quite so dumb but there are better
ways to cater for odd N.

Since you have exhibited a lack of appreciation of fundamental details, I
suspect thatyourcode runs slowly for similar reasons.

The lack of appreciation of the essential factors involved in block
processing is entirely yours.

Regards,
Martin Brown

.



Relevant Pages

  • Re: Inventing store 1.12Gb raw video in 5Mb!
    ... contains only n kilo byte on an external storage medium. ... n time the character A, a, up to and including Z, z. ... with this calculation by means of an unique algorithm. ...
    (comp.compression)
  • Distributed storage. Security attributes and ducumentation update.
    ... I'm pleased to announce third release of the distributed storage ... * mirror algorithm has been moved from per-page to per-sector dirty ... implement optional saving of mirroring/linear information on the remote ...
    (Linux-Kernel)
  • Re: Distributed storage. Security attributes and ducumentation update.
    ... I'm pleased to announce third release of the distributed storage ... * mirror algorithm has been moved from per-page to per-sector dirty ... implement optional saving of mirroring/linear information on the remote ...
    (Linux-Kernel)
  • Distributed storage. Bug fixes.
    ... I'm pleased to announce the fourth release of the distributed storage ... implement optional saving of mirroring/linear information on the remote ... new redundancy algorithm (complex) ...
    (Linux-Kernel)
  • Distributed storage. Move away from char device ioctls.
    ... I'm pleased to announce fourth release of the distributed storage ... implement optional saving of mirroring/linear information on the remote ... +are placed into storage by appropriate algorithm, ...
    (Linux-Kernel)