Another Bugfix patch for java.io.StreamTokenizer

Dalibor Topic dtopic at socs.uts.edu.au
Thu Dec 16 06:22:26 PST 1999


Hi,

here is a new patch for java.io.StreamTokenizer fixing a bunch of bugs and
glitches.

Bugs that are fixed with this patch:
* when parsing numbers, a '-' alone without following numbers should be an
ordinary character, not a string. Reference: Java Language Specification.

* when parsing numbers, a single '.' should evaluate to 0.0
. Reference: Sun JDK 1.1.x/1.2.x behaviour.

* when parsing numbers, a single "-." should evaluate to -0.0
. Reference: Sun JDK 1.1.x/1.2.x behaviour.

* it was possible to set characters > 255 or characters < 0 to whitespace
chars, or to word chars, resulting in a corrupted TableEntry ordinary,
which is used for non ASCII characters. Setting it white space led to the
character -1 (EOF) being interpreted as white space, thus Kaffe would hang
on input forever. Reference: The Java Class Libraries Second Edition
Vol. 1, StreamTokenizer wordChars etc. method descriptions

* there was a bug when parsing numbers in streams that were something like
.4.4.4, Kaffe wouldn't stop before the second '.' Reference: Java Language
Specification

* when parsing quoted strings, kaffe wasn't able to handle octal escape
sequances. Reference: Java Language Specification

* when EOL was pushed back, the line number wasn't decreased so the last
token before the EOL had the wrong line number (too high).

* when parsing C or C++ comments and when / is not a comment character,
kaffe would drop the / in /4

* line numbers used to be wrong if the parsed quoted string contained
escaped EOL characters like in "\\\n"

* skipLine would be stuck in an infinite loop if the line ended
with an EOF instead of an EOL

* when parsing quoted strings, Kaffe would drop the newline if the string
ended prematurely

* when parsing numbers, Kaffe would try to parse 1.2- as a number instead
of parsing 1.2 as a number and - as an ordinary character.

Other things fixed:

* private class variable pushBack has the value false by default, so I
modified its declaration accordingly

* besides the three cases described ("-", "." and "-."), there should be
no other exceptions in number generation [1], thus I have removed the code
that would set ttype to TT_WORD and return a string with the number that
could not be parsed.

* I removed some code to handle EOL that was redundant in nextTokenType.

* with all the necessary modifications to make StreamTokenizer more
standard Java like, nextTokenType grew to big and I've split it into
several small functions. The semantics of the function should be clearer
now.

* I also changed the return type from nextTokenType to void, that allowed
to simplify nextToken, as well as to eliminate all the labourous token
type passing between nextToken, nextTokenType and token-type-specific
parsing functions.


Things to fix:
* The LineNumberReader eats \r ... so we don't see "\\\r" in a string as
"\\\r" but as "\\\n" .

* The numbers parsed are sometimes just a little different from what they
should be. For example, .4 comes out as n=0.40000000000000002, instead of
n=0.4 .

Attached is an example test program for your Kaffe breaking pleasure :),
and the diffs to the current CVS. If you run the test program with Kaffe
and with Sun's JDK 1.1.x/1.2.x you'll see that we are getting closer :))

Functions that were modified are:
java.lang.StreamTokenizer.chrRead
java.lang.StreamTokenizer.unRead
java.lang.StreamTokenizer.nextToken
java.lang.StreamTokenizer.nextTokenType
java.lang.StreamTokenizer.ordinaryChars
java.lang.StreamTokenizer.skipLine
java.lang.StreamTokenizer.whitespaceChars
java.lang.StreamTokenizer.wordChars

Functions that were added are:
java.lang.StreamTokenizer.parseWhitespaceChars
java.lang.StreamTokenizer.parseNumericChars
java.lang.StreamTokenizer.parseAlphabeticChars
java.lang.StreamTokenizer.parseCommentChars
java.lang.StreamTokenizer.parseStringQuoteChars
java.lang.StreamTokenizer.parseCPlusPlusCommentChars
java.lang.StreamTokenizer.parseCCommentChars
java.lang.StreamTokenizer.parseOctalEscape

Cheers,
dali

[1] Really huge numbers that can not be entirely represented in a double
are represented as infinity, according to Sun JDK 1.1.x behaviour.

-------------- next part --------------
import java.io.*;

class StreamTokenizerTest {

    public static void main (String argv[]) {
	//check the values of constants
	//	StreamTokenizer st = new StreamTokenizer(System.in);
	StringBuffer testBuffer = new StringBuffer();

	int i = 0;

	// Create octal strings galore.
	// The highest octal number converted to string is
	// \377 which is equal to 255 decimal.
	// Generate some more to test for correct behaviour
	// with higher octals as well.
	
	testBuffer.append('"');

	for (i = 0; i <= 300; i++) {
	    testBuffer.append('\\');
	    testBuffer.append(Integer.toOctalString(i));
	}

	// Insert a newline escape sequence to be parsed.
	// makes the output look better
	testBuffer.append("\\n");

	for (i = 0; i <= 255; i++) {
	    testBuffer.append('\\');
	    testBuffer.append((char) i);
	}

	// Insert a newline escape sequence to be parsed.
	// makes the output look better
	testBuffer.append("\\n");

	testBuffer.append("44\n4");

	testBuffer.append('"');

	testBuffer.append("44\n4");

	testBuffer.append("blAAA \n");

	testBuffer.append("-.\n-\n.-\n0.23-1..3-4--5\n");

	testBuffer.append(".4\n");

	StringReader testReader = new StringReader (testBuffer.toString());

	StreamTokenizer st = new StreamTokenizer(testReader);

	System.out.println("StreamTokenizer.TT_EOF = "
			   + st.TT_EOF);
	System.out.println("StreamTokenizer.TT_EOL = "
			   + st.TT_EOL);
	System.out.println("StreamTokenizer.TT_NUMBER = "
			   + st.TT_NUMBER);
	System.out.println("StreamTokenizer.TT_WORD = "
			   + st.TT_WORD);

	// check the StreamTokenizer.toString() function

	st.eolIsSignificant(true);
	st.slashSlashComments(true);
	st.slashStarComments(true);
	st.ordinaryChar('/');
	//st.whitespaceChars('A','A');
	//st.wordChars('A','A');

	// kaffe bug
	st.whitespaceChars(276,277);

	while (st.ttype != st.TT_EOF) {
	    try {
		st.nextToken();
		System.out.println(st);
		if (st.ttype == '\"' || st.ttype == '\'')
		    System.out.println("String = " + st.sval);
	    }
	    catch (Exception e) {
		System.exit(0);
	    }
	}
    }
}
-------------- next part --------------
*** kaffe/libraries/javalib/java/io/StreamTokenizer.java	Tue Dec 14 20:20:00 1999
--- /scholar/dtopic/java/code/StreamTokenizer.java	Fri Dec 17 00:32:30 1999
***************
*** 28,40 ****
  private Reader rawIn;
  private TableEntry lookup[] = new TableEntry[256];
  private TableEntry ordinary = new TableEntry();
! private boolean pushBack = false;
  private boolean EOLSignificant;
  private boolean CComments;
  private boolean CPlusPlusComments;
  private boolean toLower;
  private StringBuffer buffer = new StringBuffer();
  private boolean endOfFile;
  
  /**
   * @deprecated
--- 28,41 ----
  private Reader rawIn;
  private TableEntry lookup[] = new TableEntry[256];
  private TableEntry ordinary = new TableEntry();
! private boolean pushBack;
  private boolean EOLSignificant;
  private boolean CComments;
  private boolean CPlusPlusComments;
  private boolean toLower;
  private StringBuffer buffer = new StringBuffer();
  private boolean endOfFile;
+ private boolean EOLPushedBack;
  
  /**
   * @deprecated
***************
*** 56,62 ****
  private int chrRead() throws IOException {
  	if (endOfFile) {
  		return (-1);
! 	} else {
  		return (pushIn.read());
  	}
  }
--- 57,69 ----
  private int chrRead() throws IOException {
  	if (endOfFile) {
  		return (-1);
! 	}
! 	else {
! 	        /* if EOL was pushed back, increase line number again */
! 	        if (EOLPushedBack) {
! 		    EOLPushedBack = false;
! 		    lineIn.setLineNumber(lineIn.getLineNumber() + 1);
! 		}
  		return (pushIn.read());
  	}
  }
***************
*** 67,72 ****
--- 74,85 ----
  		endOfFile = true;
  	} else {
  		pushIn.unread(c);
+ 
+ 		/* decrease line number if EOL is pushed back */
+ 		if (c == '\n') {
+ 		        EOLPushedBack = true;
+ 		        lineIn.setLineNumber(lineIn.getLineNumber() - 1);
+ 		}
  	}
  }
  
***************
*** 92,270 ****
  	if (pushBack == true) {
  		/* Do nothing */
  		pushBack = false;
- 		return (ttype);
  	}
  	else {
! 		return (nextTokenType());
  	}
  }
  
! private int nextTokenType() throws IOException {
  	int chr = chrRead();
  
  	TableEntry e = lookup(chr);
  
  	if (e.isWhitespace) {
  		/* Skip whitespace and return nextTokenType */
! 		do {
! 			if (chr=='\n' && EOLSignificant) {
! 				ttype = TT_EOL;
! 				return (ttype);
! 			}
! 			chr = chrRead();
! 		} while (lookup(chr).isWhitespace);
! 
! 		/* For next time */
! 		unRead(chr);
! 		ttype = nextTokenType();
  	}
  	else if (e.isNumeric) {
  		/* Parse the number and return */
! 	        boolean dotParsed = false;
! 		boolean minusParsed = false;
  
! 		buffer.setLength( 0);
! 		while (lookup(chr).isNumeric) {
! 			buffer.append((char)chr);
! 			if (chr == '-') {
! 			        if (minusParsed)
! 				        break;
! 			        else
! 				        minusParsed = true;
  			}
! 			else if (chr == '.') {
! 			        if (dotParsed)
! 				        break;
! 				else
! 				    dotParsed = true;
  			}
- 			chr = chrRead();
  		}
  
! 		/* For next time */
! 		unRead(chr);
  
! 		try {
! 			nval = new Double(buffer.toString()).doubleValue();
! 			ttype = TT_NUMBER;
  		}
! 		catch ( NumberFormatException x) {
! 		        /* the first character was an '-'
! 		         * but no other numeric characters followed
  			 */
! 		         ttype = '-';
! 	
  		}
  	}
! 	else if (e.isAlphabetic) {
! 		/* Parse the word and return */
! 		buffer.setLength( 0);
! 		while (lookup(chr).isAlphabetic || lookup(chr).isNumeric) {
! 			buffer.append((char)chr);
! 			chr = chrRead();
! 			// what for?
! 			/*
! 			if (chr == '\n' && EOLSignificant)
! 				break;
! 			*/
! 		}
  
! 		/* For next time */
! 		unRead(chr);
  
! 		ttype = TT_WORD;
! 		sval = buffer.toString();
! 		if (toLower) {
! 			sval = sval.toLowerCase();
! 		}
  	}
- 	else if (e.isComment) {
- 		/* skip comment and return nextTokenType() */
- 		skipLine();
  
! 		ttype = nextTokenType();    
  	}
! 	else if (e.isStringQuote) {
! 		/* Parse string and return word */
! 		int cq = chr;
  
! 		buffer.setLength( 0);
! 		chr = chrRead();
! 		while ( chr != cq) {
! 			if ( chr == '\\' ) {
! 				chr = chrRead();
! 				switch (chr) {
! 				case 'a':
! 					chr = 0x7;
! 					break;
! 				case 'b':
! 					chr = '\b';
! 					break;
! 				case 'f':
! 					chr = 0xC;
! 					break;
! 				case 'n':
! 					chr = '\n';
! 					break;
! 				case 'r':
! 					chr = '\r';
! 					break;
! 				case 't':
! 					chr = '\t';
! 					break;
! 				case 'v':
! 					chr = 0xB;
! 					break;
! 				}
! 			}
! 			buffer.append((char)chr);
! 			chr = chrRead();
! 			if ( chr == -1 ) {
  				break;
  			}
  		}
  
! 		/* JDK doc says:  When the nextToken method encounters a
! 		 * string constant, the ttype field is set to the string
! 		 * delimiter and the sval field is set to the body of the
! 		 * string.
! 		 */
! 		ttype = cq;
! 		sval = buffer.toString();      
  	}
! 	else if (chr=='/' && (CComments || CPlusPlusComments)) {
! 		/* Check for C/C++ comments */
! 		int next = chrRead();
! 		if (next == '/' && (CPlusPlusComments)) {
! 			/* C++ comment */
! 			skipLine();
! 
! 			nextTokenType();
! 			return (ttype);
! 		}
! 		else if (next == '*' && (CComments)) {
! 			/* C comments */
! 			skipCComment();
  
! 			nextTokenType();
! 			return (ttype);
! 		}
! 		else {
! 			unRead(next);
! 		}
  	}
  	else {
! 		/* Just return it as a token */
! 		sval = null;
! 		if (chr == -1) {
! 			ttype = TT_EOF;
! 		}
! 		else {
! 			ttype = chr;
! 		}
  	}
  
! 	return (ttype);
  }
  
  public void ordinaryChar(int c) {
--- 105,390 ----
  	if (pushBack == true) {
  		/* Do nothing */
  		pushBack = false;
  	}
  	else {
! 	        /* pushBack is false,
! 		 * so get the next token type
! 		 */
! 		nextTokenType();
  	}
+ 
+ 	return (ttype);
  }
  
! private void nextTokenType() throws IOException {
!         /* Sets ttype to the type of the next token */ 
! 
  	int chr = chrRead();
  
  	TableEntry e = lookup(chr);
  
  	if (e.isWhitespace) {
  		/* Skip whitespace and return nextTokenType */
! 	        parseWhitespaceChars(chr);
  	}
  	else if (e.isNumeric) {
  		/* Parse the number and return */
! 	        parseNumericChars(chr);
! 	}
! 	else if (e.isAlphabetic) {
! 		/* Parse the word and return */
! 	        parseAlphabeticChars(chr);
! 	}
! 	else if (e.isComment) {
! 		/* skip comment and return nextTokenType() */
! 	        parseCommentChars();
! 	}
! 	else if (e.isStringQuote) {
! 		/* Parse string and return word */
! 	        parseStringQuoteChars(chr);
! 	}
! 	else if (chr=='/' && CPlusPlusComments) {
! 		/* Check for C++ comments */
! 	        parseCPlusPlusCommentChars();
! 	}
! 	else if (chr=='/' && CComments) {
! 		/* Check for C comments */
! 	        parseCCommentChars();
! 	}
! 	else {
! 		/* Just return it as a token */
! 		sval = null;
! 		if (chr == -1) {
! 			ttype = TT_EOF;
! 		}
! 		else {
! 			ttype = chr;
! 		}
! 	}
! }
! 
! private void parseWhitespaceChars(int chr) throws IOException {
!         do {
! 	        if (chr=='\n' && EOLSignificant) {
! 		        ttype = TT_EOL;
! 			return;
! 		}
! 
! 		chr = chrRead();
! 	} while (chr != -1 && lookup(chr).isWhitespace);
! 
! 	/* For next time */
! 	unRead(chr);
  
! 	nextTokenType();
! }
! 
! private void parseNumericChars(int chr) throws IOException {
!         boolean dotParsed = false;
! 
! 	buffer.setLength( 0);
! 
! 	/* Parse characters until a non-numeric character, 
! 	 * or the first '-' after the first character, or
! 	 * the second decimal dot is parsed.
! 	 */
! 	do {
! 	        if (chr == '.') {
! 		        if (dotParsed) {
! 			        /* Second decimal dot parsed,
! 				 * so the number is finished.
! 				 */
! 			        break;
  			}
! 			else {
! 			        /* First decimal dot parsed */
! 			        dotParsed = true;
  			}
  		}
  
! 		buffer.append((char)chr);
! 		chr = chrRead();
  
! 	} while (lookup(chr).isNumeric
! 		 && chr != '-'
! 		 && !(chr == '.' && dotParsed));
! 
! 
! 	/* For next time */
! 	unRead(chr);
! 
! 	try {
! 	        nval = Double.parseDouble(buffer.toString());
! 		ttype = TT_NUMBER;
! 	}
! 	catch ( NumberFormatException x) {
! 	        if (buffer.toString().equals("-")) {
! 			/* if the first character was an '-'
! 			 * but no other numeric characters followed
! 			 */
! 		        ttype = '-';
  		}
! 		else if (buffer.toString().equals(".")) {
! 			/* A sole decimal dot is parsed as the 
! 			 * decimal number 0.0 according to what the
! 			 * JDK 1.1 does.
  			 */
! 		        ttype = TT_NUMBER;
! 			nval = 0.0;
  		}
+ 		else {
+ 			/* A minus and a decimal dot are parsed as the 
+ 			 * decimal number -0.0 according to what the
+ 			 * JDK 1.1 does.
+ 			 */
+ 		        ttype = TT_NUMBER;
+ 			nval = -0.0;
+ 		}		
  	}
! }
  
! private void parseAlphabeticChars(int chr) throws IOException {
!         buffer.setLength( 0);
  
! 	while (lookup(chr).isAlphabetic || lookup(chr).isNumeric) {
! 	        buffer.append((char)chr);
! 		chr = chrRead();
  	}
  
! 	/* For next time */
! 	unRead(chr);
! 
! 	ttype = TT_WORD;
! 	sval = buffer.toString();
! 	if (toLower) {
! 	        sval = sval.toLowerCase();
  	}
! }
  
! private void parseCommentChars() throws IOException {
!         skipLine();
! 
! 	nextTokenType();
! }
! 
! private void parseStringQuoteChars(int chr) throws IOException {
!         int cq = chr;
! 
! 	/* Save the correct line number in case the string
! 	 * contains escaped EOL characters. Reset line number
! 	 * later accordingly.
! 	 */
! 	int stringLineNumber = lineIn.getLineNumber();
! 
! 	buffer.setLength( 0);
! 	chr = chrRead();
! 	while ( chr != cq && chr != '\n' && chr != -1) {
! 	        if ( chr == '\\' ) {
! 		        chr = chrRead();
! 			switch (chr) {
! 			case 'a':
! 			        chr = 0x7;
! 				break;
! 			case 'b':
! 			        chr = '\b';
! 				break;
! 			case 'f':
! 			        chr = 0xC;
  				break;
+ 			case 'n':
+ 			        chr = '\n';
+ 				break;
+ 			case 'r':
+ 			        chr = '\r';
+ 				break;
+ 			case 't':
+ 			        chr = '\t';
+ 				break;
+ 			case 'v':
+ 			        chr = 0xB;
+ 				break;
+ 			default:
+ 			        if ('0' <=  chr && chr <= '7') {
+ 				        /* it's an octal escape */
+ 				        chr = parseOctalEscape(chr);
+ 				}
  			}
  		}
+ 		buffer.append((char)chr);
+ 		chr = chrRead();
+ 	}
+ 	if ( chr == '\n' ) {
+ 	        unRead(chr);
+ 	}
  
! 	/* JDK doc says:  When the nextToken method encounters a
! 	 * string constant, the ttype field is set to the string
! 	 * delimiter and the sval field is set to the body of the
! 	 * string.
! 	 */
! 	ttype = cq;
! 	sval = buffer.toString();
! 
! 	lineIn.setLineNumber(stringLineNumber);
! }
! 
! private void parseCPlusPlusCommentChars() throws IOException {
!         int next = chrRead();
! 	if (next == '/') {
! 	        /* C++ comment */
! 	        skipLine();
! 
! 		nextTokenType();
  	}
! 	else {
! 	        unRead(next);
  
! 		ttype = '/';
! 	}
! }
! 
! private void parseCCommentChars() throws IOException {
!         int next = chrRead();
! 	if (next == '*') {
! 	        /* C comment */
! 	        skipCComment();
! 
! 		nextTokenType();
  	}
  	else {
! 	        unRead(next);
! 
! 		ttype = '/';
  	}
+ }
  
! private int parseOctalEscape(int chr) throws IOException {
! 	int value = 0;
! 	int digits = 1;
! 	boolean maybeThreeOctalDigits = false;
! 
! 	/* There could be one, two, or three octal
! 	 * digits specifying a character's code.
! 	 * If it's three digits, the Java Language
! 	 * Specification says that the first one has
! 	 * to be in the range between '0' and '3'.
! 	 */
!         if ('0' <= chr && chr <= '3') {
! 	        maybeThreeOctalDigits = true;
! 	}
! 
! 	do {
! 	        value = value * 8 + Character.digit((char) chr, 8);
! 	        chr = chrRead();
! 	        digits++;
! 
! 	} while (('0' <= chr && chr <= '7')
! 		 && (digits <= 2 || maybeThreeOctalDigits)
! 		 && (digits <= 3));
! 
! 	unRead(chr);
! 
! 	return (value);
  }
  
  public void ordinaryChar(int c) {
***************
*** 279,284 ****
--- 399,412 ----
  }
  
  public void ordinaryChars(int low, int hi) {
+         if (low < 0) {
+ 	        low = 0;
+ 	}
+ 
+ 	if (hi > 255) {
+ 	        hi = 255;
+ 	}
+ 
  	for (int letter=low; letter<=hi; letter++) {
  		ordinaryChar(letter);
  	}
***************
*** 342,350 ****
  }
  
  private void skipLine() throws IOException {
! 	while (chrRead() != '\n')
! 		;
! 	if (EOLSignificant) {
  		unRead('\n');
  	}
  }
--- 470,484 ----
  }
  
  private void skipLine() throws IOException {
!         /* Skip all characters to the end of line or EOF,
! 	 * whichever comes first.
! 	 */
!         int chr = chrRead();
! 
! 	while (chr != '\n' && chr != -1)
! 	        chr = chrRead();
! 
! 	if (chr == '\n') {
  		unRead('\n');
  	}
  }
***************
*** 376,382 ****
--- 510,525 ----
  }
  
  public void whitespaceChars(int low, int hi) {
+         if (low < 0) {
+ 	        low = 0;
+ 	}
+ 
+ 	if (hi > 255) {
+ 	        hi = 255;
+ 	}
+ 
  	for (int letter = low; letter <= hi; letter++) {
+ 	    
  		TableEntry e = lookup(letter);
  		e.isWhitespace = true;
  		e.isAlphabetic = false;
***************
*** 385,390 ****
--- 528,541 ----
  }
  
  public void wordChars(int low, int hi) {
+         if (low < 0) {
+ 	        low = 0;
+ 	}
+ 
+ 	if (hi > 255) {
+ 	        hi = 255;
+ 	}
+ 
  	for (int letter = low; letter <= hi; letter++) {
  		lookup(letter).isAlphabetic = true;
  	}    


More information about the kaffe mailing list