[kaffe] Slow byte to char conversion

Mon Aug 28 01:01:14 PDT 2000

oops! somehow an unfinished version of this e-mail escaped from from my
computer. here's the full text:

Hi Godmar,

sorry for the delay, but I was on holidays last week, and away from my
mail.

Am Sam, 19 Aug 2000 schrieben Sie:
> From what I understand, and someone correct me if I'm wrong,
> there shouldn't be any reason not to include the change you suggest -
> if someone implements it, of course.

Done. I have a patched version of Encode.java. I 'll
clean it up when a definite solution has stabilized.

> If I understand your proposal right, you'd use an array for
> the first 256 values and a hashtable or something like that 
> for the rest.  I don't think there would be a problem with changing 
> it so that it would both serialize an array and a hashtable.
> One or two objects in *.ser shouldn't make a difference. 

Yes. It should work nicely for ISO-8859 based encodings, and 
then for some. 

Actually, for byte to char conversion you don't even
need a hash table, since all ISO-8859-X assign unicode chars 
(simply speaking) to byte values in the range 0-255. 

For the reverse way (char to byte conversion) I'd need to do some
experiments to figure out a better way. In most character to byte
encodings, there is no single range from character x to character y
where all characters are mapped from. So the array based approach is
space-inefficient. A combination of arrays and hashmaps might be
interesting. But for the time being, I'm playing around with
java.io.InputStreamReader, so I'm trying to fix byte to char conversion
first.

> You could even stick a flag at the beginning if the array shouldn't
> pay off for some encodings.

I'd prefer a more class hierarchy based approach. We already have
kaffe.io.ByteToCharHashBased. We could have ByteToCharArrayBased, too.
Something like this (warning: untested code ahead):

abstract public class ByteToCharArrayBased extends ByteToCharConverter {

	// map is used to map characters from bytes to chars. A byte
	// code b is mapped to character map[b & 0xFF].
  	private final char [] map;

	public ByteToCharArrayBased ( char [] chars) {
		map = chars;
	}

	public final int convert (byte[] from, int fpos, int flen,
char[] to, int tpos, int tlen) {
		// Since it's a one to one encoding assume that
		// flen == tlen.
		for (int i = flen; i > 0; i --) {
			to[ tpos++] = convert( from [ fpos++ ]);
		}
		return flen;
	}

	public final char convert (byte b) {
		return map[b & 0xFF ];
	}

	public final int getNumberOfChars(byte [] from, int fpos, int
flen) {
		return flen;
	}
}

Now a (byte to char) conversion class has three choices:
a) it uses all byte values from 0-255 -> it extends
ByteToCharArrayBased, and makes the constructor use the
appropriate char array.
b) the encoded byte values are sparsely distributed through the range
of all legal byte values -> it extends ByteToCharHashBased due to
its space efficiency.
c) there is a huge block of bytes used in the encoding, but there are
also many bytes outside that block's range used in the encoding -> it
extends ByteToCharConverter and uses fields for ArrayBased as well
HashBased conversion. The convert method checks whether a byte is
within the block and uses the array, or uses the hash table otherwise.