The difference is caused by the same super-alignment issue from the following related questions:
- Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?
- Matrix multiplication: Small difference in matrix size, large difference in timings
But that's only because there's one other problem with the code.
Starting from the original loop:
for(i=1;i<SIZE-1;i++)
for(j=1;j<SIZE-1;j++) {
res[j][i]=0;
for(k=-1;k<2;k++)
for(l=-1;l<2;l++)
res[j][i] += img[j+l][i+k];
res[j][i] /= 9;
}
First notice that the two inner loops are trivial. They can be unrolled as follows:
for(i=1;i<SIZE-1;i++) {
for(j=1;j<SIZE-1;j++) {
res[j][i]=0;
res[j][i] += img[j-1][i-1];
res[j][i] += img[j ][i-1];
res[j][i] += img[j+1][i-1];
res[j][i] += img[j-1][i ];
res[j][i] += img[j ][i ];
res[j][i] += img[j+1][i ];
res[j][i] += img[j-1][i+1];
res[j][i] += img[j ][i+1];
res[j][i] += img[j+1][i+1];
res[j][i] /= 9;
}
}
So that leaves the two outer-loops that we're interested in.
Now we can see the problem is the same in this question: Why does the order of the loops affect performance when iterating over a 2D array?
You are iterating the matrix column-wise instead of row-wise.
To solve this problem, you should interchange the two loops.
for(j=1;j<SIZE-1;j++) {
for(i=1;i<SIZE-1;i++) {
res[j][i]=0;
res[j][i] += img[j-1][i-1];
res[j][i] += img[j ][i-1];
res[j][i] += img[j+1][i-1];
res[j][i] += img[j-1][i ];
res[j][i] += img[j ][i ];
res[j][i] += img[j+1][i ];
res[j][i] += img[j-1][i+1];
res[j][i] += img[j ][i+1];
res[j][i] += img[j+1][i+1];
res[j][i] /= 9;
}
}
This eliminates all the non-sequential access completely so you no longer get random slow-downs on large powers-of-two.
Core i7 920 @ 3.5 GHz
Original code:
8191: 1.499 seconds
8192: 2.122 seconds
8193: 1.582 seconds
Interchanged Outer-Loops:
8191: 0.376 seconds
8192: 0.357 seconds
8193: 0.351 seconds