Related tags
History
 ← Overlay Memory Dungeon Master intro → Optimizing Buggy Boy (2)
Wed 13th May 2020
Optimizing Buggy Boy Part 1 Part 2

Now that the final scroller is finally running at an acceptable speed, the second most pressing issue is the Mandelbrot fractal part.

Optimizing a scroller is just a matter of finding a fast way to copy things around, but optimizing a fractal is much more complicated due to the limitations of the 6502 processor regarding doing mathematics.

## Theoric Magazine

The idea of having a fractal came from an old Oric Magazine called Theoric.

One of the issues contained a small BASIC program which could render arbitrary parts of the Mandelbrot set, in full color.

It was obviously terribly slow, but you could just change the parameters and let it run during the night, and on the next morning you would have your beautiful fractal on the screen ready to be saved to tape :) Just for the sake of completeness here is the complete original1 source code:
`  0 PAPER 0:INK 4 10 HIRES:DOKE #306,#FFFF 11 PAPER 0:INK 3 15 PR=19:D1=64:D2=32:D3=512:D4=15 20 FOR I=1 TO 199 STEP 2 30   CURSET 0,I,3:FILL 1,1,16+4 40   CURSET 6,I-1,3:FILL 1,1,1 50 NEXT 60 FOR X=11 TO 237 STEP 2 70   FOR Y=0 TO 199 STEP 2 79     N=PR:X1=X-100:Y1=Y-100:XN=X1:YN=Y1 80     REPEAT 81       D=XN+YN:B=XN-YN:C=XN*YN 82       XN=D*B/D1-X1:YN=C/D2-Y1:N=N-1 83     UNTIL N=0 OR (ABS(XN)+ABS(YN)>D3) 90     IF N AND 1 THEN CURSET X,Y,1 91     IF N AND 2 THEN CURSET X+1,Y,1 92     IF N AND 4 THEN CURSET X,Y+1,1 93     IF N AND 8 THEN CURSET X+1,Y+1,1120   NEXT125 NEXT130 DOKE #306,10000:END`
The program is not very complicated:
• Line 15 defines the Mandelbrot set to compute
• Lines 20-50 sets some attributes on the screen to achieve the color effects
• Line 60 is the outer loop iterating horizontally on the screen
• Line 70 is the second loop iterating vertically on the screen
• Lines 79-83 compute the actual color of the point at this location
• Lines 90-93 draw some points to achieve the dithering effect based on the point color
One doe not need a PhD to figure out that the core of the performance is spent in the REPEAT-UNTIL section2!

An important element to notice is the STEP 2 on each of the FOR loops: We are actually computing only a quarter of the total number of points, because each of the point is then rendered on screen using a 2 by 2 pixel matrix to achieve the nice colored dithering effect.

Something else to notice is that in this particular case the image is symmetrical along the horizontal axis, so a simple symmetry could have achieved a near doubling in performance... but that's definitely cheating:

Let's keep that as an option for later if we really need to.

If you are interested by the topic, you can read the very long wikipedia page about the Mandelbrot set (in this particular case we are particularly interested in the section called
Computer drawings), and there are also many different implementations in various languages on the Rosetta Code site.

## Optimization strategies

Until we profile the code we will not actually know where most of the time is spent, but we can already notice a few things:
• The BASIC variable format uses floating point representation, do we need that?
• What is the range of the variables, do they fit on one byte, do they get negative?
• There are two multiplications and two divisions in the inner loop...
• The loop test performs absolute value computations
• Drawing 2x2 sized pixels requires additional masking when blitting individually
If you have seen the demo, you would also probably have noticed that the fractal was not rendered linearly row by row or column by column, instead if appeared as some kind of recursively drawn squares:

This is a common optimization seen in fractal rendering code, it's based on the observation that if all the points on the border of a section of the Mandelbrot3 set have the same color (depth) then all the points inside ALSO have the same depth, thus do not require additional calculation.

I did the same thing as I did with the scroller, and modified the demo so it jumped directly to the fractal part, without playing the music.

Something worth noting is that without the music playing, the fractal finishes just a fraction of time after displaying the WRATHDESIGNS group name, skipping the WWW.C64.ORG and WWW.ORIC.ORG names.

Indeed, comparing the computation time with and without the music gives us a running time of 1:30 without and 1:40 with the music.

Not a huge difference, but worth keeping in mind when doing adjustments at the end if we want the music to be in sync with the effects for example.

So let see how the demo code is written so we can see where to optimize.

## Original implementation

The mandelbrot computation is called from main.c:
`	//	// Mandelbrot	//Mandelbrot:				SwitchToHires();	ClearHiresScreen();	Mandel_InitDisplay();	Mandel_DrawFractal();`
The file mandel.s contains the actual assembler code:
`_Mandel_DrawFractal	lda #0	sta _x	lda #0	sta _y	lda #128	sta _w	jsr _Mandel_HLine	lda #0	sta _x	lda #64	sta _y	lda #128	sta _w	jsr _Mandel_HLine	lda #0	sta _x	lda #0	sta _y	lda #64	sta _h	jsr _Mandel_VLine		lda #128	sta _x	lda #0	sta _y	lda #64	sta _h	jsr _Mandel_VLine	jsr _Mandle_BlitBigBuffer	lda #0	sta _VblCounter	lda #0	sta _x	lda #0	sta _y	lda #128	sta _w	lda #64	sta _h	jsr _Mandel_Divide	jsr _Mandle_BlitBigBuffer	rts`
What this routine does is to first compute the top, bottom, left and right side values, and then call a recursive routine.

While writing this article I noticed that some of the values seem to be computed more than once: When drawing a rectangle, the corner points are shared, and in this case it looks like the point (0,0) at least is computed more than once. If this happens on each recursive level, that may end up being a significant amount of time.

Another obvious thing is that despite being computed, we do not see the left vertical column being displayed!

So maybe we are actually computing more pixels than we are actually able to draw on screen...

Something else to check is if intermediate variables are using zero page locations or normal memory:
`	.zero_x			.dsb 1_y			.dsb 1_w			.dsb 1_h			.dsb 1_x1			.dsb 2_y1			.dsb 2_xn			.dsb 2_yn			.dsb 2`
Everything seems alright, we are indeed using zero page for all these variables.

## Horizontal drawing

So, let see how these subroutines are implemented:
`_Mandel_HLine	jsr _Mandel_ComputeParameters	lda _w	phaloop_x	clc	lda _x1+0	sta _xn+0	adc #2	sta _x1+0	lda _x1+1	sta _xn+1	adc #0	sta _x1+1	lda _y1+0	sta _yn+0	lda _y1+1	sta _yn+1	jsr _compute_outer	lda _n	ldy _x	sta (big_buff),y   ; Store pixel in buffer	inc _x	dec _w	bne loop_x	pla	sta _w	rts`
The lda _w/pha/pla/sta _w seem peculiar, but then we have to remember that the code is doing recursive calls, so it makes sense to store the original Width and Height values, else Bad Things Would Happen.

The rest of the code seems reasonable, first calling _Mandel_ComputeParameters to initialize some parameters, then iterate for each of the values along the requested _w parameters, for each calling _compute_outer and then storing the result in some intermediate buffer.

Let see how these two sub-routines looks like... dig down into the rabbit hole.

## Span initialization

`#define MANDEL_X_POS	32#define MANDEL_Y_POS	92_Mandel_ComputeParameters	ldy _y	lda _BufferAddrLow,y	sta big_buff+0	lda _BufferAddrHigh,y	sta big_buff+1			clc        ; Compute X1=-64+(x*2)	lda _x	rol	sta _x1+0	lda #0	rol	sta _x1+1  	sec	lda _x1+0	sbc #<MANDEL_X_POS	sta _x1+0	lda _x1+1	sbc #>MANDEL_X_POS	sta _x1+1		clc        ; Compute y1=-64+(y*2)	lda _y	rol	sta _y1+0	lda #0	rol	sta _y1+1  	sec	lda _y1+0	sbc #<MANDEL_Y_POS	sta _y1+0	lda _y1+1	sbc #>MANDEL_Y_POS	sta _y1+1	rts`
What this routine does is not very complicated, it takes the vertical position (stored passed in the _y variable) and use it to compute which scanline in the buffer will be used to store the result of the computations, as well as computing the X1=X-100:Y1=Y-100 part of the BASIC program, except here 100 was replaced by 64 because we prefer power of two values when doing assembler code.

This routine can most probably be improved considering all the initial values are 8 bit, but not being part of the inner loop this would seem to be a premature optimization to me.

## Iteration loop

`#define	MANDEL_ITER_BIS	19#define MAX_MANDEL_ITER	512_outer_exit	rts_compute_outer	//	n=pr	lda #MANDEL_ITER_BIS	sta _ncompute_outer_loop	jsr _compute_inner	dec _n		beq _outer_exit		lda _xn+1    ; Compute _axn=ABS(xn)	bmi xn_neg	sta _axn+1	lda _xn	sta _axn	jmp xn_absxn_neg	sec	lda #0	sbc _xn	sta _axn	lda #0	sbc _xn+1	sta _axn+1xn_abs	lda _yn+1   ; Compute _ayn=ABS(yn)	bmi yn_neg	sta _ayn+1	lda _yn	sta _ayn	jmp yn_absyn_neg	sec	lda #0	sbc _yn	sta _ayn	lda #0	sbc _yn+1	sta _ayn+1yn_abs	// sum axn+ayn => axn	// if axn+ayn>d3 exit loop	clc	lda _axn	adc _ayn	sta _axn	lda _axn+1	adc _ayn+1	cmp #>MAX_MANDEL_ITER	bcc compute_outer_loop	lda _axn	cmp #<MAX_MANDEL_ITER	bcc compute_outer_loopend_outer	rts`
This code is pretty much the straight conversion from the BASIC original code UNTIL N=0 OR (ABS(XN)+ABS(YN)>D3).

First it calls _compute_inner then it checks if we have reached MANDEL_ITER_BIS (the end of iteration), or if the computed _xn and _yn values gets larger than MAX_MANDEL_ITER.

The reason why the code is so long compared to what it actually does, is that the 6502 only have 8 bit registers, and does not have any way to compute an absolute value, which means we manually need to check the sign of the most significant byte, eventually perform a subtraction from 0 to get the positive value.

Let's continue digging...

## Depth computation

`_compute_inner	clc        ; Compute d=xn+yn	lda _xn	adc _yn	sta _d	lda _xn+1	adc _yn+1	sta _d+1		sec        ; Compute b=xn-yn	lda _xn	sbc _yn	sta _b	lda _xn+1	sbc _yn+1	sta _b+1		lda _xn    ; Compute c=xn*yn	sta op1 	lda _xn+1 	sta op1+1 	lda _yn	sta op2 	lda _yn+1 	sta op2+1 	jsr mul16i 	stx _c 	sta _c+1 		ldx _c+1   ; Compute yn=c/d2-y1		ldy _c	lda _TableDiv32_low,x	and #%11111000	sta _tempfastdiv	lda _TableDiv32_low,y	and #%00000111	ora _tempfastdiv	sec	sta _c	sbc _y1	sta _yn	lda _TableDiv32_high,x	sbc _y1+1	sta _yn+1		lda _d      ; Compute xn=d*b/d1-x1	sta op1 	lda _d+1 	sta op1+1 	lda _b	sta op2 	lda _b+1 	sta op2+1 	jsr mul16i 	stx op1 	sta op1+1	ldx op1+1	ldy op1	lda _TableDiv64_low,x	and #%11111100	sta _tempfastdiv	lda _TableDiv64_low,y	and #%00000011	ora _tempfastdiv	sec	sta op1	sbc _x1	sta _xn	lda _TableDiv64_high,x	sbc _x1+1	sta _xn+1		 	rts`
Now we are talking.

This routine is both long, and full of slow code (Like calling mul16i twice).

This routine is called for every single point, up to 20 times, so any improvement done there will have a huge impact on the total running time.

## Too Smart???

So we know that this particular implementation is using a recursive divide and conquer in order to speed up calculations by not having to recompute pixels belonging to the same depth level, but do we actually know if this improved the computation time?

Sure, from a mathematical point of view we can expect better results, but in the real world it means we may have to compute any particular value of any particular point on the Mandelbrot set, which means we can't really use any of the usual optimizations:

If you iterate horizontally or vertically, you can just move a pointer forward instead of having to recompute the position in the buffer.

So what about trying first to hack around a function that just try to fill the screen from top to bottom and see if there is any speed difference?
`_Mandel_DrawFractalNaive	lda #0	sta _VblCounter	lda #0	sta _yloop	lda #0	sta _x	lda #128	sta _w	jsr _Mandel_HLine   ; h_line(0,0,largeur,XO,YO,PAS,buf,screen,pitch);	jsr _Mandle_BlitBigBuffer	inc _y	ldy _y	cpy #64	bne loop	rts`
As you can see this is ultra basic code: I did not change anything else, all I did was to call the horizontal drawing loop code 64 times.

As expected the performance is not awesome (1 minute and 47 seconds), specially toward the bottom part (which was also slow on the original demo), but it could have been much worse, specially considering that we call the fullscreen buffer blitting after each scanline!

## Keep It Simple Stupid

There are some clear benefits in using a simpler code: It is generally much easier to optimize and debug.

So I continued on the KISS4 road, and I modified the blit routine to only display the scanline we just computed (instead of blitting the entire buffer), and then adjusted the computed width from 128 to 114 pixels (the actually visible part).

This made the whole computation time sink to 1 minute and 20 seconds.

Basically what that mean is that all the convoluted recursive cha-bang is not actually faster on this particular set, or more exactly, the overhead of doing complicated recursive code is larger than the performance boost given by not doing some of the computations.

Another possibility is that there's a bug in the recursive call, but this code is quite complex, so I think changing the code to just draw horizontal lines is probably the way to go!

I guess I'll have to dig a bit more, and who knows, maybe there will be a part III to this article.

Optimizing Buggy Boy Part 1 Part 2

1. I added spaces and tabs for readability.
2. The inner-most loop is where most of the time is spent, so always check there first because any optimisation done there is multiplicated by the number of iteration made by all the outer calls, which can be very many
3. On the borders of the set, this is not valid for the bounding box that contains the set itself.
4. See the title, it's a common idiom in programming, opposing complexity to simplicity
comments powered by Disqus 