Friday, March 9th, 2012

Unicode is driving me nuts

Has anyone working in Python experienced what I like to call Unicode Hell? If your not sure what I am talking about, consider this:

You are developing an application using both Unicode and non-Unicode aware Python packages. A non-unicode aware package is python-ldap. I am currently building an web application in Django which connects to an Active Directory domain to read information, now I am working on the modification of data. When data is sent to Django from the web browser, it is in Unicode format, and since Python ldap does not like Unicode, and isn't smart enough to convert it to a regular string, or list of strings, I am now in the process of writing this component into my code.

Yesterday I was trying to troubleshoot why it wasn't saving back to LDAP correctly, when my class was working through a standard Python shell. It just hit me today, that the problem must be related to Unicode, as I had an original issue with this when I was retrieving user inputted fields from AD, and that complained that it couldn't use field names in Unicode format.

Hopefully all strings are forced to be Unicode in Python 3, if that is not implemented(most likely because it may break backwards compatibility, but P3 breaks that anyways), they should add a way to force everything to unicode.

I really love Unicode and wish that every language properly supported it fully(*cough* PHP *cough*), that way developing multilanguage application would be much easier. Some languages support Unicode, but not to the extent that Python has. There are also some languages which are finally just adding Unicode after centuries of having multiple languages in the world. Python also has really good internationalization support built in, especially Django.

Comment #1: Posted 2 years, 6 months ago by Nick Coghlan

Making all the text Unicode is basically the *reason* Python 3 exists.

The other backward compatibility breaks in Python 3 really just came along for the ride after Guido committed to the big Unicode break :)

Comment #2: Posted 2 years, 6 months ago by Jean-Paul

Python 2.7, "from __future__ import unicode_literals", yay.

Comment #3: Posted 2 years, 6 months ago by Lee

Don't forget Java, which has been unicode from day 1.

Comment #4: Posted 2 years, 6 months ago by Krys Lawrence

My guess is that Active Directory returns strings in Windows-1252 encoding. I would also guess that python-ldap just blindly passes strings through.

So, can you not just decode from Windows-1252 all strings comming from AD and encode to that all strings going in?

A wrapper class/function around python-ldap could even make this all transparent to the rest of your app.

Just a thought. I have not tried it.

Good luck!

Comment #5: Posted 2 years, 6 months ago by LionKimbro

What Nick Coghlan said. Unicode is THE "3" in Python 3.

Comment #6: Posted 2 years, 6 months ago by Karlo Smid

Hi!
I also had problem with understanding how Unicode works in Python, but following link (http://www.evanjones.ca/python-utf8.html) helped me to understand it finally.

Based on experience at my project, I think that Krys stated the solution. I had same problem, but instead of AD, I received data from Informix database. Database was in code page 'iso-8859-2', and rest of the system was using 'utf-8'.

Regards, Karlo.

Comment #7: Posted 1 year, 10 months ago by peter

maybe you can extend this function for you
http://pastebin.com/wmefFzS3

About Me

My Photo
Names Kevin, hugely into UNIX technologies, not just Linux. I've dabbled with the demons, played with the Sun, and now with the Penguins.




Kevin Veroneau Consulting Services
Do you require the services of a Django contractor? Do you need both a website and hosting services? Perhaps I can help.

This Month

If you like what you read, please consider donating to help with hosting costs, and to fund future books to review.

Python Powered | © 2012-2014 Kevin Veroneau