Unicode Support for WSGI¶
Title: | WSGI Unicode Handling |
---|---|
Author: | Armin Ronacher <armin.ronacher@active-4.com> |
Status: | Rejected |
Created: | 1-Nov-2006 |
Rejected¶
This proposal is rejected mainly because of those reasons:
- It’s easy enough for applications to do that on their own
- Many applications don’t use unicode objects
- there should be an easier and more flexible way for that issue
From Ian Bicking:
I’ll add some commentary here, since I was the primary critic (of the limited audience before Armin withdrew this specification). Leaving this proposal here hopefully will be useful to later people considering this problem.
Changing the response app_iter is pretty heavy, and isn’t really an extension to WSGI, it’s a change to the core specification. Current WSGI implementors really expect
str
responses. Whenstr
goes away in Python 3000, they will have to expectbytes
responses too, but that’s a relatively straight-forward (though not trivial) change. Dealing with backward compatibility is quite difficult.The use cases I personally see in this is avoiding the confusion and overhead of encoding and decoding responses when there are intermediaries which handle the response in its unicode form. This is not uncommon – for instance, XML processing happens on unicode data, and ideally all text responses should be handled as unicode. Deciding the encoding, and then doing the proper decoding, is not completely trivial (though not terribly hard). It is hard enough that people will and have avoided it, potentially working with
str
data when that was not correct. Similarly, it is important to send either properly-encoded data, or to change the encoding in the headers. Since encoding information can show up in multiple places (unfortunately) this can also be error-prone.Despite these problems, sending unencoded data opens up a whole bunch of other problems, and realistically we get the union of all problems because we definitely cannot remove the sending of encoding text data. So everyone has to deal with both cases now, instead of just one case.
Anyway, that’s my take on this. – Ian
Abstract¶
This specification proposes a possible implementation of unicode
support in WSGI. Current all WSGI application have to output str
objects instead.
Motivation¶
Python ships two types of strings subclassing the abstract base class
basestring
. str
and unicode
. In Python 3 unicode
will
replace str
and a new class bytes
will be introduced
(PEP 3100#atomic-types, PEP 3137). Also today many developers
use unicode
objects because support a wider range of characters
and functions like len()
still return the correct output, even
when using multibyte encodings like utf-8
.
But at the moment all WSGI applications have to yield str
objects
which require that uses encoder their data to a special encoding by
hand. WSGI middlewares don’t know about the charset the application is
using etc.
Specification¶
A possible solution would be a new key in the environ called
wsgi.charset
. The WSGI gateway would set this to None
per default which means that yielding of unicode
objects results
in an exception. But if the charset is correctly defined all returned
unicode
objects get encoded in the defined encoding by the WSGI
gateway.
Middlewares could use this value too convert incomming form data to unicode automatically so that the application developer doesn’t have to take care about this issue.
Problem¶
If this environment key is updated by the application middlewares
would still see None
as charset because it’s updated on first
iteration only. So an application developer would need to wrap the
whole application including middlewares afterwards again with a new
middleware that updates this key. Another possibility would be that
the WSGI gateway provides a configuration value for the charset.
If encoding the output of the wsgi application the gateway must also
get the wsgi.charset
key each time a unicode object is
found. Caching won’t work because the application must be able to
change the charset before each iteration:
def app(environ, start_response):
start_response('200 OK', [('Content-Type', 'text/plain')])
environ['wsgi.charset'] = 'utf-8'
yield u'Hällo Wörld'
environ['wsgi.charset'] = 'iso-8895-15'
yield u'Hällo Wörld'
Implementation¶
Here a very simple CGI gateway that implements this functionality:
import os
import sys
def run_with_cgi(app, charset=None):
environ = dict(os.environ.items())
environ['wsgi.charset'] = charset
environ['wsgi.input'] = sys.stdin
environ['wsgi.errors'] = sys.stderr
environ['wsgi.version'] = (1,0)
environ['wsgi.multithread'] = False
environ['wsgi.multiprocess'] = True
environ['wsgi.run_once'] = True
if environ.get('HTTPS','off').lower() in ('on','1'):
environ['wsgi.url_scheme'] = 'https'
else:
environ['wsgi.url_scheme'] = 'http'
headers_set = []
headers_sent = []
def write(data):
if not headers_set:
raise AssertionError('write() before start_response()')
elif not headers_sent:
status, response_headers = headers_sent[:] = headers_set
sys.stdout.write('Status: %s\r\n' % status)
for header in response_headers:
sys.stdout.write('%s: %s\r\n' % header)
sys.stdout.write('\r\n')
if isinstance(data, unicode):
charset = environ['wsgi.charset']
if charset is None:
raise AssertionError('application returned unicode without '
'defined charset')
data = data.encode(charset)
sys.stdout.write(data)
sys.stdout.flush()
def start_response(status,response_headers,exc_info=None):
if exc_info:
try:
if headers_sent:
raise exc_info[0], exc_info[1], exc_info[2]
finally:
exc_info = None
elif headers_set:
raise AssertionError('Headers already set!')
headers_set[:] = [status,response_headers]
return write
result = app(environ, start_response)
try:
for data in result:
if data:
write(data)
if not headers_sent:
write('')
finally:
if hasattr(result,'close'):
result.close()