Tuesday 19 January 2010

Unicode, PostgreSQL, JDBC and truncated data

The problem

Unicode
Unicode is a 16 bit character encoding that is destined to replace ASCII as the universal character encoding standard. Unlike ASCII, which was design for (and is largely limited to) the american character set, Unicode caters for the entire world's character sets, even the awesome array of Chinese characters! If you can type it on a keyboard, you can store it in Unicode. At last! A truly global, properly usable character encoding that doesn't assume that the entire world is an additional state of the USA!

Er, right.

Problem number one is that ASCII is a de facto standard across most of the world's machines and software.and things get seriously tricky when you start trying to map local incarnations of character sets that have been shoe-horned into the basic ASCII encoding (limited to 7 or 8 bits) to the universal big brother of
them all: Unicode.

PostgreSQL

PostgreSQL is the best database around. Yes, we are biased. Yes there are other excellent databases (you may think Oracle or Ingres, but we think of MySQL, which very seriously rocks as well. PostgreSQL supprts many different types of character sets (see the create database command for more details of its encoding option).

Symptom


The problem is that when you connect to PostgreSQL via JDBC and you select text rows as a string from a
table, if those rows contain 8 bit characters (for example a pound (£) sign for the UK), then you may find that the data for that column gets truncated just before that character. I.e., this command fails:
 read_rs.getString(1);

Trawling the news groups, it seems that this is caused by Java, which also uses Unicode internally, not being able to map the character coming back from Postgres into a valid Unicode character. Often, the problem lies with the original insert into the Postgres table. The character in question is sent from some client software (with its own character set) to Postgres (which stores it in the character set for that database). On insert, Postgres doesn't check that the value stored in the table is a legal map from the client character encoding set, it merely stores that ASCII value for that character (there is a mapping option that you can turn on when you
build Postgres, see the Postgres Manual for more information on this).

It appears that the data is truncated due to a fault in the error handling of the JDBC software.

We don't really agree with this analysis. Although the common concensus is that this is a Postgres fault, we're not actually convinced about this. There are numerous reports of the same problem with different databases, including Oracle, MySQL and DB2. We think that the problem lies with the JDBC system and it not being able to determine what the character set is that the data is stored in. This may well come down to the database
supplying more information to JDBC but it may equally be that JDBC needs to examine the environment or use some other resolution mechanism
.
The solution

The best solution is to force Java to read the column as a set of byte values and then explicitly tell Java what the character set is that the table is stored in and then it can do the translation no problem! SO, converting this line:
  read_rs.getString(1)
into this line:
  new String(read_rs.getBytes(1),"ISO-8859-1")
does the trick. Of course, in the above example the table was in ISO-8859-1 format. This is the most likely format if Unicode translations are failing, but you do need to check which character set is used in you local software! See here for a list of possible character codes, what they are, and a very handy discussion of the codes and the characters in each coding.

Honeypot: spam@kieser.net

No comments:

Post a Comment