Python, DB2, and FOR BIT DATA Columns

I’ve been doing some research lately on how well various languages and web application frameworks integrate with IBM DB2 and happened across some less than ideal results. An errant strlen() call nearly killed the project before it ever got off the ground.

As you’d expect, the DB2 drivers for Python and Ruby are written in C and compiled to modules for their respective language targets. What you might not expect is that DB2 has a CHAR/VARCHAR type modifier that makes it difficult, if not impossible to work out of the box with nearly every mainstream framework. It’s called CHAR FOR BIT DATA. Here’s what IBM has to say about it:

A CHAR FOR BIT DATA type allows you to store byte strings of a specified length. It is useful for unstructured data where character strings are not appropriate.

So, these basically take a string that looks like '20100322014820681369000000' and turns it into something like ' \x10\x03"\x01H h\x13i\x00\x00\x00'. DB2 also includes a function called generate_unique() that, not surprisingly, generates IDs for these column types that are guaranteed unique across an entire database. Since the Python C API knows how to take a byte array and turn them into a Python string, everything should be good, right? Wrong.

What happens if you try to compress data for one of these columns, but it looks like '20000000000000000000000000' instead? You get this:

>>> from binascii import a2b_hex
>>> a2b_hex('20000000000000000000000000')
' \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'

That’s a lot of null bytes, but it’ll store in the database just fine. Retrieving that from the database, however, has some unexpected results: Python only gives you one of the 26 characters you stored.  Turns out the driver does a strlen on the binary data sent from the database engine, which sees a null byte and treats it as a string terminator, truncating your data.

The good news is the fix was trivial once a convention was agreed upon. The group responsible for the Ruby driver also picked up the fix and will be releasing a comparable patch it in the near future. Big thanks to both groups for helping us get over a substantial hurdle.

Update (3/31/2010): The devs for the Python ibm_db project have released a new version that includes the fix mentioned above. I finally got around to testing the change, and it looks to have solved our problem. Hopefully the Ruby driver is corrected upstream as well shortly.

Comments

Leave a Reply