OSDK - The Oric Software Development Kit -

Wed 13th May 2020

Optimizing Buggy Boy

Theoric Magazine Optimization strategies Original implementation Horizontal drawing Span initialization Iteration loop Depth computation Too Smart???Keep It Simple Stupid

Now that the final scroller is finally running at an acceptable speed, the second most pressing issue is the Mandelbrot fractal part.

Optimizing a scroller is just a matter of finding a fast way to copy things around, but optimizing a fractal is much more complicated due to the limitations of the 6502 processor regarding doing mathematics.

Theoric Magazine

The idea of having a fractal came from an old Oric Magazine called Theoric.

One of the issues contained a small BASIC program which could render arbitrary parts of the Mandelbrot set, in full color.

It was obviously terribly slow, but you could just change the parameters and let it run during the night, and on the next morning you would have your beautiful fractal on the screen ready to be saved to tape :)

4 video modes

Just for the sake of completeness here is the complete original¹ source code:


  0 PAPER 0:INK 4
 10 HIRES:DOKE #306,#FFFF
 11 PAPER 0:INK 3

 15 PR=19:D1=64:D2=32:D3=512:D4=15

 20 FOR I=1 TO 199 STEP 2
 30   CURSET 0,I,3:FILL 1,1,16+4
 40   CURSET 6,I-1,3:FILL 1,1,1
 50 NEXT

 60 FOR X=11 TO 237 STEP 2
 70   FOR Y=0 TO 199 STEP 2
 79     N=PR:X1=X-100:Y1=Y-100:XN=X1:YN=Y1
 80     REPEAT
 81       D=XN+YN:B=XN-YN:C=XN*YN
 82       XN=D*B/D1-X1:YN=C/D2-Y1:N=N-1
 83     UNTIL N=0 OR (ABS(XN)+ABS(YN)>D3)
 90     IF N AND 1 THEN CURSET X,Y,1
 91     IF N AND 2 THEN CURSET X+1,Y,1
 92     IF N AND 4 THEN CURSET X,Y+1,1
 93     IF N AND 8 THEN CURSET X+1,Y+1,1
120   NEXT
125 NEXT
130 DOKE #306,10000:END

The program is not very complicated:

Line 15 defines the Mandelbrot set to compute
Lines 20-50 sets some attributes on the screen to achieve the color effects
Line 60 is the outer loop iterating horizontally on the screen
Line 70 is the second loop iterating vertically on the screen
Lines 79-83 compute the actual color of the point at this location
Lines 90-93 draw some points to achieve the dithering effect based on the point color

One doe not need a PhD to figure out that the core of the performance is spent in the REPEAT-UNTIL section²!

An important element to notice is the STEP 2 on each of the FOR loops: We are actually computing only a quarter of the total number of points, because each of the point is then rendered on screen using a 2 by 2 pixel matrix to achieve the nice colored dithering effect.

Something else to notice is that in this particular case the image is symmetrical along the horizontal axis, so a simple symmetry could have achieved a near doubling in performance... but that's definitely cheating:

Let's keep that as an option for later if we really need to.

If you are interested by the topic, you can read the very long wikipedia page about the Mandelbrot set (in this particular case we are particularly interested in the section called
Computer drawings), and there are also many different implementations in various languages on the Rosetta Code site.

Optimization strategies

Until we profile the code we will not actually know where most of the time is spent, but we can already notice a few things:

The BASIC variable format uses floating point representation, do we need that?
What is the range of the variables, do they fit on one byte, do they get negative?
There are two multiplications and two divisions in the inner loop...
The loop test performs absolute value computations
Drawing 2x2 sized pixels requires additional masking when blitting individually

If you have seen the demo, you would also probably have noticed that the fractal was not rendered linearly row by row or column by column, instead if appeared as some kind of recursively drawn squares:

This is a common optimization seen in fractal rendering code, it's based on the observation that if all the points on the border of a section of the Mandelbrot³ set have the same color (depth) then all the points inside ALSO have the same depth, thus do not require additional calculation.

I did the same thing as I did with the scroller, and modified the demo so it jumped directly to the fractal part, without playing the music.

Something worth noting is that without the music playing, the fractal finishes just a fraction of time after displaying the WRATHDESIGNS group name, skipping the WWW.C64.ORG and WWW.ORIC.ORG names.

Indeed, comparing the computation time with and without the music gives us a running time of 1:30 without and 1:40 with the music.

Not a huge difference, but worth keeping in mind when doing adjustments at the end if we want the music to be in sync with the effects for example.

So let see how the demo code is written so we can see where to optimize.

Original implementation

The mandelbrot computation is called from main.c:


	//
	// Mandelbrot
	//
Mandelbrot:			
	SwitchToHires();
	ClearHiresScreen();

	Mandel_InitDisplay();
	Mandel_DrawFractal();

The file mandel.s contains the actual assembler code:


_Mandel_DrawFractal
	lda #0
	sta _x
	lda #0
	sta _y
	lda #128
	sta _w
	jsr _Mandel_HLine

	lda #0
	sta _x
	lda #64
	sta _y
	lda #128
	sta _w
	jsr _Mandel_HLine

	lda #0
	sta _x
	lda #0
	sta _y
	lda #64
	sta _h
	jsr _Mandel_VLine
	
	lda #128
	sta _x
	lda #0
	sta _y
	lda #64
	sta _h
	jsr _Mandel_VLine

	jsr _Mandle_BlitBigBuffer

	lda #0
	sta _VblCounter

	lda #0
	sta _x
	lda #0
	sta _y
	lda #128
	sta _w
	lda #64
	sta _h
	jsr _Mandel_Divide

	jsr _Mandle_BlitBigBuffer
	rts

What this routine does is to first compute the top, bottom, left and right side values, and then call a recursive routine.

While writing this article I noticed that some of the values seem to be computed more than once: When drawing a rectangle, the corner points are shared, and in this case it looks like the point (0,0) at least is computed more than once. If this happens on each recursive level, that may end up being a significant amount of time.

Another obvious thing is that despite being computed, we do not see the left vertical column being displayed!

So maybe we are actually computing more pixels than we are actually able to draw on screen...

Something else to check is if intermediate variables are using zero page locations or normal memory:


	.zero
_x			.dsb 1
_y			.dsb 1
_w			.dsb 1
_h			.dsb 1

_x1			.dsb 2
_y1			.dsb 2
_xn			.dsb 2
_yn			.dsb 2

Everything seems alright, we are indeed using zero page for all these variables.

Horizontal drawing

So, let see how these subroutines are implemented:


_Mandel_HLine
	jsr _Mandel_ComputeParameters

	lda _w
	pha

loop_x
	clc
	lda _x1+0
	sta _xn+0
	adc #2
	sta _x1+0
	lda _x1+1
	sta _xn+1
	adc #0
	sta _x1+1

	lda _y1+0
	sta _yn+0
	lda _y1+1
	sta _yn+1

	jsr _compute_outer

	lda _n
	ldy _x
	sta (big_buff),y   ; Store pixel in buffer
	inc _x

	dec _w
	bne loop_x

	pla
	sta _w

	rts

The lda _w/pha/pla/sta _w seem peculiar, but then we have to remember that the code is doing recursive calls, so it makes sense to store the original Width and Height values, else Bad Things Would Happen.

The rest of the code seems reasonable, first calling _Mandel_ComputeParameters to initialize some parameters, then iterate for each of the values along the requested _w parameters, for each calling _compute_outer and then storing the result in some intermediate buffer.

Let see how these two sub-routines looks like... dig down into the rabbit hole.

Span initialization


#define MANDEL_X_POS	32
#define MANDEL_Y_POS	92

_Mandel_ComputeParameters
	ldy _y
	lda _BufferAddrLow,y
	sta big_buff+0
	lda _BufferAddrHigh,y
	sta big_buff+1
		
	clc        ; Compute X1=-64+(x*2)
	lda _x
	rol
	sta _x1+0
	lda #0
	rol
	sta _x1+1  

	sec
	lda _x1+0
	sbc #<MANDEL_X_POS
	sta _x1+0
	lda _x1+1
	sbc #>MANDEL_X_POS
	sta _x1+1
	
	clc        ; Compute y1=-64+(y*2)
	lda _y
	rol
	sta _y1+0
	lda #0
	rol
	sta _y1+1  

	sec
	lda _y1+0
	sbc #<MANDEL_Y_POS
	sta _y1+0
	lda _y1+1
	sbc #>MANDEL_Y_POS
	sta _y1+1

	rts

What this routine does is not very complicated, it takes the vertical position (stored passed in the _y variable) and use it to compute which scanline in the buffer will be used to store the result of the computations, as well as computing the X1=X-100:Y1=Y-100 part of the BASIC program, except here 100 was replaced by 64 because we prefer power of two values when doing assembler code.

This routine can most probably be improved considering all the initial values are 8 bit, but not being part of the inner loop this would seem to be a premature optimization to me.

Iteration loop


#define	MANDEL_ITER_BIS	19
#define MAX_MANDEL_ITER	512

_outer_exit
	rts

_compute_outer
	//	n=pr
	lda #MANDEL_ITER_BIS
	sta _n

compute_outer_loop
	jsr _compute_inner

	dec _n	
	beq _outer_exit

	
	lda _xn+1    ; Compute _axn=ABS(xn)
	bmi xn_neg
	sta _axn+1
	lda _xn
	sta _axn
	jmp xn_abs
xn_neg
	sec
	lda #0
	sbc _xn
	sta _axn
	lda #0
	sbc _xn+1
	sta _axn+1
xn_abs

	lda _yn+1   ; Compute _ayn=ABS(yn)
	bmi yn_neg
	sta _ayn+1
	lda _yn
	sta _ayn
	jmp yn_abs
yn_neg
	sec
	lda #0
	sbc _yn
	sta _ayn
	lda #0
	sbc _yn+1
	sta _ayn+1
yn_abs

	// sum axn+ayn => axn
	// if axn+ayn>d3 exit loop
	clc
	lda _axn
	adc _ayn
	sta _axn
	lda _axn+1
	adc _ayn+1
	cmp #>MAX_MANDEL_ITER
	bcc compute_outer_loop

	lda _axn
	cmp #<MAX_MANDEL_ITER
	bcc compute_outer_loop

end_outer
	rts

This code is pretty much the straight conversion from the BASIC original code UNTIL N=0 OR (ABS(XN)+ABS(YN)>D3).

First it calls _compute_inner then it checks if we have reached MANDEL_ITER_BIS (the end of iteration), or if the computed _xn and _yn values gets larger than MAX_MANDEL_ITER.

The reason why the code is so long compared to what it actually does, is that the 6502 only have 8 bit registers, and does not have any way to compute an absolute value, which means we manually need to check the sign of the most significant byte, eventually perform a subtraction from 0 to get the positive value.

Let's continue digging...

Depth computation


_compute_inner
	clc        ; Compute d=xn+yn
	lda _xn
	adc _yn
	sta _d
	lda _xn+1
	adc _yn+1
	sta _d+1
	
	sec        ; Compute b=xn-yn
	lda _xn
	sbc _yn
	sta _b
	lda _xn+1
	sbc _yn+1
	sta _b+1
	
	lda _xn    ; Compute c=xn*yn
	sta op1 
	lda _xn+1 
	sta op1+1 
	lda _yn
	sta op2 
	lda _yn+1 
	sta op2+1 
	jsr mul16i 
	stx _c 
	sta _c+1 
	
	ldx _c+1   ; Compute yn=c/d2-y1	
	ldy _c

	lda _TableDiv32_low,x
	and #%11111000
	sta _tempfastdiv

	lda _TableDiv32_low,y
	and #%00000111
	ora _tempfastdiv
	sec
	sta _c
	sbc _y1
	sta _yn
	lda _TableDiv32_high,x
	sbc _y1+1
	sta _yn+1
	
	lda _d      ; Compute xn=d*b/d1-x1
	sta op1 
	lda _d+1 
	sta op1+1 
	lda _b
	sta op2 
	lda _b+1 
	sta op2+1 
	jsr mul16i 
	stx op1 
	sta op1+1

	ldx op1+1
	ldy op1

	lda _TableDiv64_low,x
	and #%11111100
	sta _tempfastdiv

	lda _TableDiv64_low,y
	and #%00000011
	ora _tempfastdiv
	sec
	sta op1
	sbc _x1
	sta _xn
	lda _TableDiv64_high,x
	sbc _x1+1
	sta _xn+1
		 
	rts

Now we are talking.

This routine is both long, and full of slow code (Like calling mul16i twice).

This routine is called for every single point, up to 20 times, so any improvement done there will have a huge impact on the total running time.

Too Smart???

So we know that this particular implementation is using a recursive divide and conquer in order to speed up calculations by not having to recompute pixels belonging to the same depth level, but do we actually know if this improved the computation time?

Sure, from a mathematical point of view we can expect better results, but in the real world it means we may have to compute any particular value of any particular point on the Mandelbrot set, which means we can't really use any of the usual optimizations:

If you iterate horizontally or vertically, you can just move a pointer forward instead of having to recompute the position in the buffer.

So what about trying first to hack around a function that just try to fill the screen from top to bottom and see if there is any speed difference?


_Mandel_DrawFractalNaive
	lda #0
	sta _VblCounter

	lda #0
	sta _y
loop
	lda #0
	sta _x
	lda #128
	sta _w
	jsr _Mandel_HLine   ; h_line(0,0,largeur,XO,YO,PAS,buf,screen,pitch);

	jsr _Mandle_BlitBigBuffer

	inc _y
	ldy _y
	cpy #64
	bne loop

	rts

As you can see this is ultra basic code: I did not change anything else, all I did was to call the horizontal drawing loop code 64 times.

As expected the performance is not awesome (1 minute and 47 seconds), specially toward the bottom part (which was also slow on the original demo), but it could have been much worse, specially considering that we call the fullscreen buffer blitting after each scanline!

Keep It Simple Stupid

There are some clear benefits in using a simpler code: It is generally much easier to optimize and debug.

So I continued on the KISS⁴ road, and I modified the blit routine to only display the scanline we just computed (instead of blitting the entire buffer), and then adjusted the computed width from 128 to 114 pixels (the actually visible part).

This made the whole computation time sink to 1 minute and 20 seconds.

Basically what that mean is that all the convoluted recursive cha-bang is not actually faster on this particular set, or more exactly, the overhead of doing complicated recursive code is larger than the performance boost given by not doing some of the computations.

Another possibility is that there's a bug in the recursive call, but this code is quite complex, so I think changing the code to just draw horizontal lines is probably the way to go!

I guess I'll have to dig a bit more, and who knows, maybe there will be a part III to this article.

Optimizing Buggy Boy

Part 1 Part 2

1. I added spaces and tabs for readability.↩
2. The inner-most loop is where most of the time is spent, so always check there first because any optimisation done there is multiplicated by the number of iteration made by all the outer calls, which can be very many↩
3. On the borders of the set, this is not valid for the bounding box that contains the set itself.↩
4. See the title, it's a common idiom in programming, opposing complexity to simplicity↩

comments powered by Disqus