After (re)discovering the semicolon bug in Atari BASIC revision A, I thought I’d spend a bit of time trying to find out exactly why BASIC was exhibiting this behaviour. In order to do this, I had to re-learn how BASIC stores programs in memory.
Atari BASIC uses tokenization to reduce the memory footprint and increase the execution speed of programs. Tokenization replaces keyword strings (such as PRINT) with single-character tokens (0x20). Ultimately, this bug is caused by the tokenization process and incorrect bounds checking.
First, let’s look at a simple PRINT statement, and how Atari BASIC tokenizes it. You may want to reference De Re Atari which has a good explanation of the tokenizing process, as well as a token table.
0A | 00 | 0A | 0A | 20 | 0F | 02 | 48 | 49 | 16 |
line number | llen | slen | string | strlen | H | I | eol |
Our first example print statement has a 2-byte string constant “HI”, followed by the token for end-of-line. 0x20 is the token for PRINT. llen and slen are the line length and statement length.
0A | 00 | 0B | 0B | 20 | 0F | 02 | 48 | 49 | 15 | 16 |
line number | llen | slen | string | strlen | H | I | ; | eol |
Our second example adds a semicolon to the end of line. The normal behaviour for semicolon is to suppress the automatic carriage return. Note that the string is still 2 bytes long, followed by the token for the semicolon (0x15).
0A | 00 | 0B | 0B | 20 | 0F | 03 | 48 | 49 | 15 | 16 |
line number | llen | slen | string | strlen | H | I | ^U | eol |
Our third example now has a 3-byte string constant with no semicolon. The only difference between this example and the previous example is the string constant length.
Now that we’ve seen how the different lines are tokenized, let’s look at the BASIC source code. We need to look at the XPRINT function, which begins at 0xB3B6.
B3B6 XPRINT B3B6 A5C9 LDA PTABW ; GET TAB VALUE B3B8 85AF STA SCANT ; SCANT B3BA A900 LDA #0 ; SET OUT INDEX = 0 B3BC 8594 STA COX ; B3BE A4A8 :XPR0 LDY STINDEX ; GET STMT DISPL B3C0 B18A LDA [STMCUR],Y ; GET TOKEN ; B3C2 C912 CMP #CCOM B3C4 F053 ^B419 BEQ :XPTAB ; BR IF TAB B3C6 C916 CMP #CCR B3C8 F07C ^B446 BEQ :XPEOL ; BR IF EOL B3CA C914 CMP #CEOS B3CC F078 ^B446 BEQ :XPEOL ; BR IF EOL B3CE C915 CMP #CSC B3D0 F06F ^B441 BEQ :XPNULL ; BR IF NULL B3D2 C91C CMP #CPND B3D4 F061 ^B437 BEQ :XPRIOD ; B3D6 20E0AA JSR EXEXPR ; GO EVALUATE EXPRESSION B3D9 20F2AB JSR ARGPOP ; POP FINAL VALUE B3DC C6A8 DEC STINDEX ; DEC STINDEX B3DE 24D2 BIT VTYPE ; IS THIS A STRING B3E0 3016 ^B3F8 BMI :XPSTR ; BR IF STRING ; B3E2 20E6D8 JSR CVFASC ; CONVERT TO ASCII B3E5 A900 LDA #0 B3E7 85F2 STA CIX ; B3E9 A4F2 :XPR1 LDX CIX ; OUTPUT ASCII CHARACTERS B3EB B1F3 LDA [INBUFF],Y ; FROM INBUFF B3ED 48 PHA ; UNTIL THE CHAR B3EE E6F2 INC CIX ; WITH THE MSB ON B3F0 205DB4 JSR :XPRC ; IS FOUND B3F3 68 PLA B3F4 10F3 ^B3E9 BPL :XPR1 B3F6 30C6 ^B3BE BMI :XPR0 ; THEN GO FOR NEXT TOKEN B3F8 :XPSTR B3F8 209BAB JSR GSTRAD ; GO GET ABS STRING ARRAY B3FB A900 LDA #0 B3FD 85F2 STA CIX B3FF A5D6 :XPR2C LDA VTYPE+EVSLEN ; IF LEN LOW B401 D004 ^B407 BNE :XPR2B ; NOT ZERO BR B403 C6D7 DEC VTYPE+EVSLEN+1 ; DEC LEN HI B405 30B7 ^B3BE BMI :XPR0 ; BR IF DONE B407 C6D6 :XPR2B DEC VTYPE+EVSLEN ; DEC LEN LOW ; B409 A4F2 :XPR2 LDY CIX ; OUTPUT STRING CHARS B40B B1D4 LDA [VTYPE+EVSADR],Y ; FOR THE LENGTH B40D E6F2 INC CIX ; OF THE STRING B40F D002 ^B413 BNE :XPR2A B411 E6D5 INC VTYPE+EVSADR+1 B413 :XPR2A B413 205FB4 JSR :XPRC1 B416 4CFFB3 JMP :XPR2C ; B419 :XPTAB B419 A494 :XPR3 LDY COX ; DO UNTIL COX+1 <SCANT B41B C8 INY B41C C4AF CPY SCANT B41E 9009 ^B429 BCC :XPR4 B420 18 :XPIC3 CLC B421 A5C9 LDA PTABW ; SCANT = SCANT+TAB B423 65AF ADC SCANT B425 85AF STA SCANT B427 90F0 ^B419 BCC :XPR3 ; B429 A494 :XPR4 LDY COX ; DO UNTIL COX = SCANT B42B C4AF CPY SCANT B42D B012 ^B441 BCS :XPR4A B42F A920 LDA #$20 ; PRINT BLANKS B431 205DB4 JSR :XPRC B434 4C29B4 JMP :XPR4 ; B437 2002BD :XPRIOD JSR GIOPRM ; GET DEVICE NO. B43A 85B5 STA LISTDTD ; SET AS LIT DEVICE B43C C6A8 DEC STINDEX ;DEC INDEX B43E 4CBEB3 JMP :XPR0 ; GET NEXT TOKEN ; B441 :XPR4A B441 E6A8 :XPNULL INC STINDEX ; INC STINDEX B443 4CBEB3 JMP :XPR0 ; B446 :XPEOL B446 A4A8 :XPEOS LDY STINDEX ; AT END OF PRINT B448 88 DEY B449 B18A LDA [STMCUR],Y ; IF PREV CHAR WAS B44B C915 CMP #CSC ; SEMI COLON THEN DONE B44D F009 ^B458 BEQ :XPRTN ; ELSE PRINT A CR B44F C912 CMP #CCOM ; OR A COMMA B451 F005 ^B458 BEQ :XPRTN ; THEN DONE B453 A99B LDA #CR B445 205FB4 JSR :XPRC1 ; THEN DONE B458 :XPRTN B458 A900 LDA #0 ; SET PRIMARY B45A 85B5 STA LISTDTD ; LIST DVC = 0 B45C 60 RTS ; AND RETURN
I know that’s a lot of code, but let’s follow the bouncing ball. The first key part happens at address 0xB3C8 – we look for an eol token (0x16). If we find one, we branch to XPEOL (0xB446). What’s the first thing we do at the end of line? We rewind one byte (DEY – decrement Y), and see if it’s the token for semicolon (0x15). If it is, we skip printing a carriage return.
But wait a minute. We blindly rewind one byte, even if that rewind takes us inside a string constant! There’s the bug. We should not be blindly rewinding one byte – we should be checking to see if we are inside a string constant our outside a string constant.
Looking at the code, similar behaviour will happen with the value 0x12, which is the token for a comma.
This bug has been fixed in revision C BASIC, but I’m not aware of commented source code being available for revision C.