Tue Jan 22 12:20:42 PST 2008
based to being array based), for byte to char conversion option (a)
takes little memory (256 chars for the table) and is very fast. As I
explained in my previous post, it beats option (b) in time-efficiency.
I suppose it beats it in space-efficiency as well, as long as most
bytes are convertable into characters.
When there are only a few lagal byte values, which can be encoded into
characters, the hash based conversion could be more space-efficient. On
the other hand, the array based implementation doesn't waste much
memory in that case. Even in the worst case, a fictive encoding that
solely uses a specific single byte to encode some character, there are
255 * 2 = 510 bytes wasted. That's not much, and can be improved upon,
by introducing range checks and similar techniques.
The choices really start to matter when you're going the other way
round, from chars to bytes. Take a look at ISO-8859-8 (a.k.a. hebrew).
It encodes 220 characters. Of these 220, only 32 are *not*
mapping a byte value to itself. They are either mappings into the
range between \u05D0 and \u05EA, mappings within the first 256
characters, or mappings to a few special characters like LEFT-TO-RIGHT
> One would have to see what the actual sizes of the
.ser files would be; > keeping those small is certainly desirable. From
what I understand, > they're more compact than any Java code
representation. > Edouard would know more since he wrote that code, I
think. > > > On a related note, this whole conversion thing stinks. >
Why can't people stick to 7-bit ASCII? > For instance, the JVM98 jack
benchmark calls PrintStream.print > a whopping 296218 times in a single
run. Every call results in a new > converter object being
newinstanced, just to convert a bunch of bytes. > (The new converter
was one of the changes done to make the > charset conversion
thread-safe.) This is one of the reasons > why we're on this test some
7 or 8 times slower than IBM. > And that's not even using any of the
serialized converters, just > the default one (which is written in
> - Godmar
> > Hi,
> > I wrote a simple program to show a Java charmap (
> > something like Encode.java in developers directory).
> > It essentially creates a byte array with size 1, and
> > creates a string with the appropriate Unicode char
> > using the encoding in question for every value a byte
> > can take.
> > When displaying a serialized converter like 8859_2,
> > the performance is very bad. Comparing current kaffe
> > from CVS running on SuSE Linux 6.4 with jit3 and IBM's
> > JRE 1.3 running in interpreted mode, kaffe is about 10
> > times slower.
> > While I consider the idea to use serialized encoders
> > based on hashtables a great one, it is very
> > inefficient for ISO-8859-X and similar byte to char
> > encodings. These encodings use most of the 256
> > possible values a byte can take to encode characters,
> > so I tried using an array instead. I achieved
> > comparable running times to JRE 1.3.
> > Why was the hashtable based conversion chosen over
> > alternatives (switch based lookup, array based
> > lookup)?
> > Dali
> > =====
> > "Success means never having to wear a suit"
> > __________________________________________________
> > Do You Yahoo!?
> > Send instant messages & get email alerts with Yahoo! Messenger.
> > http://im.yahoo.com/
> Do You Yahoo!?
> Get your free @yahoo.com address at http://mail.yahoo.com
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com
More information about the kaffe