Theoric MagazineOptimization strategiesOriginal implementationHorizontal drawingSpan initializationIteration loopDepth computationToo Smart???Keep It Simple StupidNow that the final scroller is finally running at an acceptable speed, the second most pressing issue is the Mandelbrot fractal part.
Optimizing a scroller is just a matter of finding a fast way to copy things around, but optimizing a fractal is much more complicated due to the limitations of the 6502 processor regarding doing mathematics.
Theoric MagazineThe idea of having a fractal came from an old Oric Magazine called Theoric.
One of the issues contained a small BASIC program which could render arbitrary parts of the Mandelbrot set, in full color.
It was obviously terribly slow, but you could just change the parameters and let it run during the night, and on the next morning you would have your beautiful fractal on the screen ready to be saved to tape :)
Just for the sake of completeness here is the complete original1 source code:
The program is not very complicated:
0 PAPER 0:INK 4
10 HIRES:DOKE #306,#FFFF
11 PAPER 0:INK 3
20 FOR I=1 TO 199 STEP 2
30 CURSET 0,I,3:FILL 1,1,16+4
40 CURSET 6,I-1,3:FILL 1,1,1
60 FOR X=11 TO 237 STEP 2
70 FOR Y=0 TO 199 STEP 2
83 UNTIL N=0 OR (ABS(XN)+ABS(YN)>D3)
90 IF N AND 1 THEN CURSET X,Y,1
91 IF N AND 2 THEN CURSET X+1,Y,1
92 IF N AND 4 THEN CURSET X,Y+1,1
93 IF N AND 8 THEN CURSET X+1,Y+1,1
130 DOKE #306,10000:END
- Line 15 defines the Mandelbrot set to compute
- Lines 20-50 sets some attributes on the screen to achieve the color effects
- Line 60 is the outer loop iterating horizontally on the screen
- Line 70 is the second loop iterating vertically on the screen
- Lines 79-83 compute the actual color of the point at this location
- Lines 90-93 draw some points to achieve the dithering effect based on the point color
An important element to notice is the STEP 2 on each of the FOR loops: We are actually computing only a quarter of the total number of points, because each of the point is then rendered on screen using a 2 by 2 pixel matrix to achieve the nice colored dithering effect.
Something else to notice is that in this particular case the image is symmetrical along the horizontal axis, so a simple symmetry could have achieved a near doubling in performance... but that's definitely cheating:
Let's keep that as an option for later if we really need to.
If you are interested by the topic, you can read the very long wikipedia page about the Mandelbrot set (in this particular case we are particularly interested in the section called
Computer drawings), and there are also many different implementations in various languages on the Rosetta Code site.
Optimization strategiesUntil we profile the code we will not actually know where most of the time is spent, but we can already notice a few things:
- The BASIC variable format uses floating point representation, do we need that?
- What is the range of the variables, do they fit on one byte, do they get negative?
- There are two multiplications and two divisions in the inner loop...
- The loop test performs absolute value computations
- Drawing 2x2 sized pixels requires additional masking when blitting individually
This is a common optimization seen in fractal rendering code, it's based on the observation that if all the points on the border of a section of the Mandelbrot3 set have the same color (depth) then all the points inside ALSO have the same depth, thus do not require additional calculation.
I did the same thing as I did with the scroller, and modified the demo so it jumped directly to the fractal part, without playing the music.
Something worth noting is that without the music playing, the fractal finishes just a fraction of time after displaying the WRATHDESIGNS group name, skipping the WWW.C64.ORG and WWW.ORIC.ORG names.
Indeed, comparing the computation time with and without the music gives us a running time of 1:30 without and 1:40 with the music.
Not a huge difference, but worth keeping in mind when doing adjustments at the end if we want the music to be in sync with the effects for example.
So let see how the demo code is written so we can see where to optimize.
Original implementationThe mandelbrot computation is called from main.c:
The file mandel.s contains the actual assembler code:
What this routine does is to first compute the top, bottom, left and right side values, and then call a recursive routine.
While writing this article I noticed that some of the values seem to be computed more than once: When drawing a rectangle, the corner points are shared, and in this case it looks like the point (0,0) at least is computed more than once. If this happens on each recursive level, that may end up being a significant amount of time.
Another obvious thing is that despite being computed, we do not see the left vertical column being displayed!
So maybe we are actually computing more pixels than we are actually able to draw on screen...
Something else to check is if intermediate variables are using zero page locations or normal memory:
Everything seems alright, we are indeed using zero page for all these variables.
_x .dsb 1
_y .dsb 1
_w .dsb 1
_h .dsb 1
_x1 .dsb 2
_y1 .dsb 2
_xn .dsb 2
_yn .dsb 2
Horizontal drawingSo, let see how these subroutines are implemented:
The lda _w/pha/pla/sta _w seem peculiar, but then we have to remember that the code is doing recursive calls, so it makes sense to store the original Width and Height values, else Bad Things Would Happen.
sta (big_buff),y ; Store pixel in buffer
The rest of the code seems reasonable, first calling _Mandel_ComputeParameters to initialize some parameters, then iterate for each of the values along the requested _w parameters, for each calling _compute_outer and then storing the result in some intermediate buffer.
Let see how these two sub-routines looks like... dig down into the rabbit hole.
What this routine does is not very complicated, it takes the vertical position (stored passed in the _y variable) and use it to compute which scanline in the buffer will be used to store the result of the computations, as well as computing the X1=X-100:Y1=Y-100 part of the BASIC program, except here 100 was replaced by 64 because we prefer power of two values when doing assembler code.
#define MANDEL_X_POS 32
#define MANDEL_Y_POS 92
clc ; Compute X1=-64+(x*2)
clc ; Compute y1=-64+(y*2)
This routine can most probably be improved considering all the initial values are 8 bit, but not being part of the inner loop this would seem to be a premature optimization to me.
This code is pretty much the straight conversion from the BASIC original code UNTIL N=0 OR (ABS(XN)+ABS(YN)>D3).
#define MANDEL_ITER_BIS 19
#define MAX_MANDEL_ITER 512
lda _xn+1 ; Compute _axn=ABS(xn)
lda _yn+1 ; Compute _ayn=ABS(yn)
// sum axn+ayn => axn
// if axn+ayn>d3 exit loop
First it calls _compute_inner then it checks if we have reached MANDEL_ITER_BIS (the end of iteration), or if the computed _xn and _yn values gets larger than MAX_MANDEL_ITER.
The reason why the code is so long compared to what it actually does, is that the 6502 only have 8 bit registers, and does not have any way to compute an absolute value, which means we manually need to check the sign of the most significant byte, eventually perform a subtraction from 0 to get the positive value.
Let's continue digging...
Now we are talking.
clc ; Compute d=xn+yn
sec ; Compute b=xn-yn
lda _xn ; Compute c=xn*yn
ldx _c+1 ; Compute yn=c/d2-y1
lda _d ; Compute xn=d*b/d1-x1
This routine is both long, and full of slow code (Like calling mul16i twice).
This routine is called for every single point, up to 20 times, so any improvement done there will have a huge impact on the total running time.
Too Smart???So we know that this particular implementation is using a recursive divide and conquer in order to speed up calculations by not having to recompute pixels belonging to the same depth level, but do we actually know if this improved the computation time?
Sure, from a mathematical point of view we can expect better results, but in the real world it means we may have to compute any particular value of any particular point on the Mandelbrot set, which means we can't really use any of the usual optimizations:
If you iterate horizontally or vertically, you can just move a pointer forward instead of having to recompute the position in the buffer.
So what about trying first to hack around a function that just try to fill the screen from top to bottom and see if there is any speed difference?
As you can see this is ultra basic code: I did not change anything else, all I did was to call the horizontal drawing loop code 64 times.
jsr _Mandel_HLine ; h_line(0,0,largeur,XO,YO,PAS,buf,screen,pitch);
As expected the performance is not awesome (1 minute and 47 seconds), specially toward the bottom part (which was also slow on the original demo), but it could have been much worse, specially considering that we call the fullscreen buffer blitting after each scanline!
Keep It Simple StupidThere are some clear benefits in using a simpler code: It is generally much easier to optimize and debug.
So I continued on the KISS4 road, and I modified the blit routine to only display the scanline we just computed (instead of blitting the entire buffer), and then adjusted the computed width from 128 to 114 pixels (the actually visible part).
This made the whole computation time sink to 1 minute and 20 seconds.
Basically what that mean is that all the convoluted recursive cha-bang is not actually faster on this particular set, or more exactly, the overhead of doing complicated recursive code is larger than the performance boost given by not doing some of the computations.
Another possibility is that there's a bug in the recursive call, but this code is quite complex, so I think changing the code to just draw horizontal lines is probably the way to go!
I guess I'll have to dig a bit more, and who knows, maybe there will be a part III to this article.
1. I added spaces and tabs for readability.↩
2. The inner-most loop is where most of the time is spent, so always check there first because any optimisation done there is multiplicated by the number of iteration made by all the outer calls, which can be very many↩
3. On the borders of the set, this is not valid for the bounding box that contains the set itself.↩
4. See the title, it's a common idiom in programming, opposing complexity to simplicity↩