Anatomy of a crash pt.2



If we look back to the point at which the stack stepped into the danger zone, we see that it was caused by the 60Hz EXEC VBLANK interrupt kicking in:

   6645 84F0 0000 0002 1EEA 0002 0310 168F S---I---  XORR R1,R0      
   E2B5 84F0 0000 0002 1EEA 0002 0310 1690 S---I-iq  RLC  R0    
   
   ; Start of ISR
        
   E2B5 84F0 0000 0002 1EEA 0002 0311 1004 S---I-iQ  PSHR R0         
  >>> Max stack increased to 0312 by code at 1004
   E2B5 84F0 0000 0002 1EEA 0002 0312 1005 S---I---  GSWD R0         
   8080 84F0 0000 0002 1EEA 0002 0312 1006 S---I---  PSHR R0         
  >>> Max stack increased to 0313 by code at 1006
   8080 84F0 0000 0002 1EEA 0002 0313 1007 S---I---  PSHR R1         
  >>> Max stack increased to 0314 by code at 1007
   8080 84F0 0000 0002 1EEA 0002 0314 1008 S---I---  PSHR R2         
  >>> Max stack increased to 0315 by code at 1008
   8080 84F0 0000 0002 1EEA 0002 0315 1009 S---I---  NOP             
   8080 84F0 0000 0002 1EEA 0002 0315 100A S---I-i-  PSHR R3         
  >>> Max stack increased to 0316 by code at 100a
   8080 84F0 0000 0002 1EEA 0002 0316 100B S---I---  PSHR R4         
  >>> Max stack increased to 0317 by code at 100b
   8080 84F0 0000 0002 1EEA 0002 0317 100C S---I---  PSHR R5         
  >>> Max stack increased to 0318 by code at 100c
   8080 84F0 0000 0002 1EEA 0002 0318 100D S---I---  MOVR R7,R5      
   8080 84F0 0000 0002 1EEA 100E 0318 100E ----I-i-  ADDI #$0006,R5  
   8080 84F0 0000 0002 1EEA 1014 0318 1010 ----I-i-  MVII #$0100,R4  
   8080 84F0 0000 0002 0100 1014 0318 1012 ----I-i-  SDBD            
   8080 84F0 0000 0002 0100 1014 0318 1013 ----ID--  MVI@ R4,R7      
   8080 84F0 0000 0002 0102 1014 0318 52B2 ----I-i-  CMPI #$0316,R6  
   8080 84F0 0000 0002 0102 1014 0318 52B4 -C--I-i-  B    $52EF      
   8080 84F0 0000 0002 0102 1014 0318 52EF -C--I-i-  MVI  $01D0,R0   
   0000 84F0 0000 0002 0102 1014 0318 52F1 -C--I-i-  MVO  R0,$0031   
   0000 84F0 0000 0002 0102 1014 0318 52F3 -C--I---  NEGR R0         
   0000 84F0 0000 0002 0102 1014 0318 52F4 -C-ZI-i-  ADDI #$0008,R0  
   0008 84F0 0000 0002 0102 1014 0318 52F6 ----I-i-  CMP  $0119,R0   
   0008 84F0 0000 0002 0102 1014 0318 52F8 -C-ZI-i-  BEQ  $5309      
   0008 84F0 0000 0002 0102 1014 0318 5309 -C-ZI-i-  J    $1126      
   0008 84F0 0000 0002 0102 1014 0318 1126 -C-ZI-i-  PSHR R5         
  >>> Max stack increased to 0319 by code at 1126
  
   ; Welcome to the danger zone...
  
   0008 84F0 0000 0002 0102 1014 0319 1127 -C-ZI---  MVI  $0102,R1   
   0008 0001 0000 0002 0102 1014 0319 1129 -C-ZI-i-  SWAP R1

The ISR firing in this way is entirely normal. So the question then is why is the stack up at $310 when the the EXEC kicks in?

If we now mark up the start (address $1004) and end (address $101c) of all VBLANK interrupt service routines, we see something very interesting. It looks as though the ISR routine that immediately precedes this one does not end! It gets kicked off with a stack value of $303, runs for a 60th of a second and then is interrupted again with R6 equalling $310. Further, apart from the initial stack check code (which would pass as the value is $30b at the time)...

   0000 0000 06A1 0037 0100 1014 030B 1012 ----I-i-  SDBD                   
   0000 0000 06A1 0037 0100 1014 030B 1013 ----ID--  MVI@ R4,R7        
   
   ; Let's do the stack check...
        
   0000 0000 06A1 0037 0102 1014 030B 52B2 ----I-i-  CMPI #$0316,R6
   
   ; Yup everything would be good, even if the following instruction
   ; did not ignore the test result
            
   0000 0000 06A1 0037 0102 1014 030B 52B4 S---I-i-  B    $52EF             
   0000 0000 06A1 0037 0102 1014 030B 52EF S---I-i-  MVI  $01D0,R0          
   0000 0000 06A1 0037 0102 1014 030B 52F1 S---I-i-  MVO  R0,$0031          
   0000 0000 06A1 0037 0102 1014 030B 52F3 S---I---  NEGR R0                
   0000 0000 06A1 0037 0102 1014 030B 52F4 -C-ZI-i-  ADDI #$0008,R0         
   0008 0000 06A1 0037 0102 1014 030B 52F6 ----I-i-  CMP  $0119,R0          
   0008 0000 06A1 0037 0102 1014 030B 52F8 -C-ZI-i-  BEQ  $5309             
   0008 0000 06A1 0037 0102 1014 030B 5309 -C-ZI-i-  J    $1126             
   0008 0000 06A1 0037 0102 1014 030B 1126 -C-ZI-i-  PSHR R5                
   0008 0000 06A1 0037 0102 1014 030C 1127 -C-ZI---  MVI  $0102,R1

...all this time is spent in the EXEC, rather than game code. This is weird, because it suggests that the data used by the EXEC is already corrupt in some way.

With a bit more hacking of JzIntv we can check for examples of one ISR handler crashing into another. It turns out this is very common in WCB. Not only do all games that crash or freeze do it, but so do all games that play through fine. In fact, because games that don't crash last longer, incidences of ISR crashing are typically higher in these games. There were an average of 29 ISR crashes in our original 100 games that played through to completion, verses 27 in the 100 games that crashed originally.

The obvious question is, do any other games exhibit such ISR crashes? Having checked a few titles, here are the results:
  • Original Baseball - no ISR crashes seen 
  • Original Football - a small number of ISR crashes seen when initialising play 
  • Triple Action - one or two ISR crashes seen when initialising the sub games 
  • Utopia - no ISR crashes seen 
  • Tron Deadly Discs - no ISR crashes seen 
  • BurgerTime - no ISR crashes seen 
  • Masters of the Universe - a large number of ISR crashes seen when initialising each screen 
  • Super Pro Football - ISR crash city, significantly worse than WCB 
  • Super Pro Basketball - a small number of ISR crashes seen 
  • Super Pro Hockey - a similar number of ISR crashes seen to WCB 
  • Thunder Castle - no ISR crashes seen 
So a small number of ISR crashes would not seem to be terribly unusual, with later games showing a greater propensity for them. However, in an environment like the Intellivision with such tight memory constraints, allowing any ISR crashes would seem to be a bad idea, especially in normal play where there is likely to be less control over the depth of the stack at any given moment. Surely it is just asking for trouble?

Comments

  1. (Apologies for the double-comment on the other thread. Apparently hitting "Back" is a bad idea?)

    OK, so this confirms the thought I had that the crash is indeed due to bad interrupt nesting.

    One hamfisted way to "fix" this is to have the ISR check for reentrancy and exit. Space Patrol has such a flag, although it doesn't check the flag until just before the "not safe to reenter" portion. It'll still update certain display elements, which avoids obvious hitches in the animations.

    [code]
    ;;------------------------------------------------------------------;;
    ;; Remainder of engine is not safely reentrant. ;;
    ;;------------------------------------------------------------------;;
    DIS ; 4 \
    MVII #1, R1 ; 8 | !!Critical Section!!
    CMP EBSY, R1 ; 10 | Check our "ENGINE IS BUSY" flag,
    ; |-- and if we're busy, just drop a
    MVO R1, EBSY ; 11 | frame. :-( Not that I like that.
    BNEQ E1.not_busy ; 9 /
    ;----
    ; 48
    ; 19
    ;====
    ; 67


    ADD FDROP, R1 ; \ If we drop a frame, disable all
    ; |__ tasks for this frame by just
    MVO R1, FDROP ; | returning Also, increment our
    JE MISCISR.1 ; / frame-drop count. (debugging)

    E1.not_busy:
    [/code]

    It's waaay to easy to run over on cycles when you only get a couple thousand instructions between interrupts, usually fewer.

    On NTSC, you get 14934 cycles between interrupts, and lose about 10% of them to cycle stealing. In practice, the average cycles/instruction is between 8 and 9, meaning you get 1500 - 1600 instructions between retrace interrupts. If you manage to stick to 6 cycle instructions, you get a bit over 2200.

    ReplyDelete
    Replies
    1. Ah, it rejected my pre tag, and it doesn't understand [code] either. So... enjoy ugly code?

      Delete
  2. One last thought: WC Baseball makes many calls to RANDRC (aka. X_RAND2 in dis1600), which is the "random number in range". RANDRC gets a random number of bits from RANDY, and then does a range compare, discarding out-of-range values.

    The BSRs have already implicated this function in other crashes, because it can loop an arbitrarily large number of times in unlucky cases. And RANDY itself is pretty slow,, as I've documented elsewhere—something like 92 cycles/bit.

    One call to RANDRC asked for a random number < $50. That's 7 bits. The odds of a 7-bit number being >= $50 is 48/128, or a bit over 1/3rd. The cost of generating 7 random bits and checking them is going to be close to 700 cycles.

    There other calls, some of which whose thresholds aren't constants. I suspect RANDRC may be an ingredient in the crashes.

    ReplyDelete

Post a Comment