Abstract
As of version 0.9.6.1, Pylons natively support large file uploading, through some cgi.FieldStorage and tempfile.TemporaryFile magic. It is somewhat efficient, yet not fully optimized. Moreover, there's no (or little) support for file size restriction, which means, technically, a user could upload a single extremely large file via a Pylons app to eat up all your server's free space, and you have no (or little) method to stop it. (If I'm wrong, please, please, correct me and tear this article in half!) Here's a hacky way to solve the problem. Source code
and sample app
provided. (I hope to have a chance of merging this into Pylons' trunk
)
Let the Game Begins
There are several ways to restrict the length of a file (form field) which is about to upload. I'm going to create a sample project called Hello at first, and then show you how to do the job.
Open a terminal and enter the following commands:
Open Hello/hello/controllers/file.py in your preferred text editor, and modify the code like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26 | import logging
from hello.lib.base import *
log = logging.getLogger(__name__)
class FileController(BaseController):
def index(self):
# Return a rendered template
# return render('/some/template.mako')
# or, Return a response
return 'Hello World'
def upload(self):
return """
<form action="/hello/file/receive" enctype="multipart/form-data"
method="post">
<h2>Large File Upload Test</h2>
File: <input name="myfile" type="file" />
<input type="submit" />
</form>
"""
def receive(self):
return "We are going to return something meaningful here."
|
Switch back to the teminal, under Hello directory type:
Visit http://127.0.0.1:5000/file/upload
in your web browser, you should see a web page with a file upload form. Now we've got a working environment to continue.
The Essentials of File Uploading
In the above example, the uploaded file can be accessed in a controller method via request.POST['myfile'].file, which is a file-like object in Python. Here's where request.POST comes from:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17 | class WSGIRequest(object):
# ...omitted for brevity...
def _POST(self):
return parse_formvars(self.environ, include_get_vars=False)
def POST(self):
# ...docstring omitted...
params = self._POST()
if self.charset:
params = UnicodeMultiDict(params, encoding=self.charset,
errors=self.errors,
decode_keys=self.decode_param_names)
return params
POST = property(POST, doc=POST.__doc__)
# ...omitted for brevity...
|
The UnicodeMultiDict is basically a Python dict wrapper, so the most important part is parse_formvars, which comes from:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26 | def parse_formvars(environ, include_get_vars=True):
"""Parses the request, returning a MultiDict of form variables.
(Here's a simplified version for demonstration only!)
"""
source = environ['wsgi.input']
if 'paste.parsed_formvars' in environ:
parsed, check_source = environ['paste.parsed_formvars']
if check_source == source:
return parsed
input = environ['wsgi.input']
fs = cgi.FieldStorage(fp=input,
environ=environ,
keep_blank_values=1)
formvars = MultiDict()
if isinstance(fs.value, list):
for name in fs.keys():
values = fs[name]
if not isinstance(values, list):
values = [values]
for value in values:
if not value.filename:
value = value.value
formvars.add(name, value)
environ['paste.parsed_formvars'] = (formvars, source)
return formvars
|
environ['wsgi.input'] is a stream representing the HTTP body, which is provided by the WSGI server, and passed as the fp argument (which means "file-pointer", a file-like object) of cgi.FieldStorage class. The input stream is then cached and parsed by a cgi.FieldStorage instance, and then wrappered in a MultiDict which is returned.
The Quick & Easy Way, with Drawbacks
We can restrict upload length quickly and easily via cgi module, which defines a property named maxlen. As the source code comment says, this is the "maximum input we will accept when REQUEST_METHOD is POST", in bytes. It has an instant effect upon several methods and classes of cgi module, including FieldStorage, which raises ValueError if input stream exceeds length limit. The default value is 0, which means "unlimited input".
For our example, the request.POST['myfile'] actually returns a cgi.FieldStorage instance; input stream will be parsed the first time request.POST is accessed, and cached into cgi.FieldStorage instances. Let's write some codes to set a maximum length of 1,000,000 bytes to upload size:
1
2
3
4
5
6 | # ...omitted for brevity...
def receive(self):
import cgi
cgi.maxlen = 1000000
return request.POST['myfile'].filename
|
Save file.py, visit http://127.0.0.1:5000/file/upload
and try to upload a file that is slightly larger than 1MB. You'll get a fancy debug page describing the ValueError exception. You can then write some code to catch the exception and do whatever you like.
However, this scheme has some serious drawbacks:
- The limit is about the entire POST body, not a single field.
- The effect of the restriction is global (for every request).
- Modifying cgi.maxlen is not thread-safe
.
Obviously, we need a better solution.
Clean up the Barriers
Before we move on, there are problems to solve. Let's keep the above code unchanged (thus remain the size limit), and try to upload a REALLY large file, say, 200MB in size. After you click the "Submit" button, the hard drive starts to drum, the CPU fan screws heavily, and after a minute or two, the fancy debug page pops up. If you upload a file with 400MB in size, time consuming is doubled. What happened? It seemed that the WHOLE file was uploaded before the exception was raised. How about the 1MB size limit?
The problem is from paste.cascade.Cascade middleware, which lives in Hello/hello/config/middleware.py:
1
2
3
4
5
6
7
8
9
10
11 | from paste.cascade import Cascade
# ...omitted for brevity...
def make_app(global_conf, full_stack=True, **app_conf):
# ...omitted for brevity...
# Static files
javascripts_app = StaticJavascripts()
static_app = StaticURLParser(config['pylons.paths']['static_files'])
app = Cascade([static_app, javascripts_app, app])
return app
|
Cascade middleware copy the whole input stream into a temporary file, and then do the rest of the work. Therefore, even if we set the file size restriction, a user could still fill up the server's hard drive by uploading an extremely large file, because the whole file is cached on the disk before we can stop it. This is a problem.
Since Cascade is a middleware in generic usage, it has the need of caching input streams for reuse. Luckily, StaticURLParser and StaticJavascripts aren't interested in it, so I created a substitute named DirectCascade, which is a subclass of Cascade and doesn't copy input streams. I put it in Hello/hello/lib/fieldmaxlen.py:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38 | from paste.cascade import Cascade
class DirectCascade(Cascade):
"""Cascade-like middleware which doesn't copy wsgi.input.
When the app handles large file uploads, this will save a considerable
amount of time and resources.
(Code mostly pilfered from *paste.cascade.Cascade*)
"""
def __call__(self, environ, start_response):
failed = []
def repl_start_response(status, headers, exc_info=None):
code = int(status.split(None, 1)[0])
if code in self.catch_codes:
failed.append(None)
return _consuming_writer
return start_response(status, headers, exc_info)
def _consuming_writer(s): pass
for app in self.apps[:-1]:
environ_copy = environ.copy()
failed = []
try:
v = app(environ_copy, repl_start_response)
if not failed:
return v
else:
if hasattr(v, 'close'):
# Exhaust the iterator first:
list(v)
# then close:
v.close()
except self.catch_exceptions, e:
pass
return self.apps[-1](environ, start_response)
|
And the code of middleware.py becomes:
1
2
3
4
5
6
7
8
9
10
11
12
13 | #from paste.cascade import Cascade
from hello.lib.fieldmaxlen import DirectCascade
# ...omitted for brevity...
def make_app(global_conf, full_stack=True, **app_conf):
# ...omitted for brevity...
# Static files
javascripts_app = StaticJavascripts()
static_app = StaticURLParser(config['pylons.paths']['static_files'])
#app = Cascade([static_app, javascripts_app, app])
app = DirectCascade([static_app, javascripts_app, app])
return app
|
Now the barriers are cleaned up, so let's do something really cool.
Create a series of restricted objects
After some time of investigation on Pylons source code, I found that the best place to restrict file upload size is in cgi.FieldStorage class. So I created the first version of RestrLenFieldStorage, which is a subclass of cgi.FieldStorage:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57 | import os
import cgi
# ...other imports omitted for brevity...
maxlendict_key = 'rockallite.maxlen_dict'
class FieldTooLongError(Exception): pass
class RestrLenFieldStorage(cgi.FieldStorage):
"""FieldStorage-like class with field length restriction."""
def __init__(self, fp=None, headers=None, outerboundary="",
environ=os.environ, keep_blank_values=0, strict_parsing=0):
if headers and 'content-disposition' in headers:
cdisp, pdict = cgi.parse_header(headers['content-disposition'])
if 'name' in pdict:
name = pdict['name']
maxlen_dict = environ.get(maxlendict_key, {})
if name in maxlen_dict:
self.maxlen = maxlen_dict[name]
# Since cgi.FieldStorage is an old-style class
cgi.FieldStorage.__init__(self, fp, headers, outerboundary, environ,
keep_blank_values, strict_parsing)
def read_binary(self):
maxlen = getattr(self, 'maxlen', 0)
if maxlen > 0 and self.length > maxlen:
raise FieldTooLongError, "Maximum field length exceeded, " \
"field name [%s]" % self.name
# Since cgi.FieldStorage is an old-style class
cgi.FieldStorage.read_binary(self)
def _FieldStorage__write(self, line):
# Override the internal __write() method, thus break encapsulation
maxlen = getattr(self, 'maxlen', 0)
if maxlen > 0 and self.file.tell() + len(line) > maxlen:
raise FieldTooLongError, "Maximum field length exceeded, " \
"field name [%s]" % self.name
# Since cgi.FieldStorage is an old-style class
cgi.FieldStorage._FieldStorage__write(self, line)
def __del__(self):
fileobj = getattr(self, 'file', None)
if fileobj: fileobj.close()
def __repr__(self):
"""Monkey patch for FieldStorage.__repr__
(Borrowed from Pylons)
"""
if self.file:
return "FieldStorage(%r, %r)" % (
self.name, self.filename)
return "FieldStorage(%r, %r, %r)" % (
self.name, self.filename, self.value)
|
It does almost the same thing as its super class, except it raises a FieldTooLongError when the length of a field exceeds what we specified. In our example, we can specify a maximum length of 1,000,000 for myfile field in this way:
1
2
3
4
5
6 | # ...omitted for brevity...
def receive(self):
# Must be set before accessing request.POST or request.params
request.environ['rockallite.maxlen_dict'] = {'myfile': 1000000}
return request.POST['myfile'].filename
|
There's still something to do before we can use this. First make a copy of parse_formvars and modify it to take advantage of RestrLenFieldStorage:
1
2
3
4
5
6
7
8
9
10
11
12
13 | def parse_formvars(environ):
"""Parses the request, returning a MultiDict of form variables.
(This is a monkey-patched version of paste.request.parse_formvars,
using RestrLenFieldStorage instead of cgi.FieldStorage.)
"""
# ...same as the original, omitted for brevity...
fs = RestrLenFieldStorage(fp=input,
environ=environ,
keep_blank_values=1)
# ...same as the original, omitted for brevity...
|
Then, create a subclass of WSGIRequest which utilizes the modified version of parse_formvars. I also added two extra methods, set_field_maxlen and maxlen_fields, which simplify the modification and representation of request.environ['rockallite.maxlen_dict']:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47 | from paste.wsgiwrappers import WSGIRequest
# ...other imports omitted for brevity...
maxlendict_key = 'rockallite.maxlen_dict'
class parse_formvars(environ):
# ...omitted for brevity...
class RestrLenWSGIRequest(WSGIRequest):
"""WSGIRequest-like class with field length restriction."""
def _POST(self):
return parse_formvars(self.environ)
def set_field_maxlen(self, name, length=None, kb=None, mb=None, gb=None):
"""Set maximum length (in bytes) of a field.
If length <= 0, the maximum length is unlimited (key deleted from the
environ dict).
"""
# Check the arguments correctness
if len(filter(lambda arg: arg!=None, [length, kb, mb, gb])) != 1:
raise ValueError, \
"One and only one of [length, kb, mb, gb] should be specified"
if length: length = long(length)
if kb: length = long(float(kb) * 1024)
if mb: length = long(float(mb) * 1048576)
if gb: length = long(float(gb) * 1073741824)
env = self.environ
if length <= 0:
env.get(maxlendict_key, {}).pop(name, None)
else:
env.setdefault(maxlendict_key, {})[name] = length
if not env.get(maxlendict_key, None):
env.pop(maxlendict_key, None)
@property
def maxlen_fields(self):
"""Return a copy of the maxlen dict."""
# DO NOT alter the returned dictionary!
return self.environ.get(maxlendict_key, {}).copy()
|
Meanwhile, create a subclass of PylonsBaseWSGIApp which utilizes RestrLenWSGIRequest as the global parameter request in controllers:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19 | from paste.wsgiwrappers import WSGIRequest
import pylons
from pylons.wsgiapp import PylonsBaseWSGIApp
# ...other imports omitted for brevity...
class RestrLenWSGIRequest(WSGIRequest):
# ...omitted for brevity...
class RestrLenBaseWSGIApp(PylonsBaseWSGIApp):
"""PylonsBaseWSGIApp-like class with field length restriction."""
def setup_app_env(self, environ, start_response):
# PylonsBaseWSGIApp is a new-style class
super(RestrLenBaseWSGIApp, self).setup_app_env(environ, start_response)
registry = environ['paste.registry']
req = RestrLenWSGIRequest(environ)
registry.register(pylons.request, req)
|
Finally, modified middleware.py in order to take advantage of RestrLenBaseWSGIApp:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20 | from helloworld.lib.fieldmaxlen import RestrLenBaseWSGIApp, DirectCascade
# ...other imports omitted for brevity...
def make_app(global_conf, full_stack=True, **app_conf):
# ...omitted for brevity...
# The Pylons WSGI app
#app = PylonsApp()
app = PylonsApp(base_wsgi_app=RestrLenBaseWSGIApp)
# CUSTOM MIDDLEWARE HERE (filtered by error handling middlewares)
# ...omitted for brevity...
javascripts_app = StaticJavascripts()
static_app = StaticURLParser(config['pylons.paths']['static_files'])
#app = Cascade([static_app, javascripts_app, app])
app = DirectCascade([static_app, javascripts_app, app])
return app
|
And the controller code can be written as:
1
2
3
4
5
6
7
8 | # ...omitted for brevity...
def receive(self):
# Must be set before accessing request.POST or request.params
# Also can be written as:
# request.set_field_maxlen('myfile', mb=1)
request.set_field_maxlen('myfile', 1000000)
return request.POST['myfile'].filename
|
We have already finished making the first version of our restricted app. Now visit http://127.0.0.1:5000/file/upload
, and try to upload a file with 200MB in size. The upload will be interrupted quickly because of the 1MB restriction of the field length, and a URL will be shown in the console indicating the fancy debug page of the FieldTooLongError exception. But wait! Why is there a "connection reset" page in the web browser? Shouldn't it be the famous fancy debug page?
After googling a bit, I found some interesting links:
LimitRequestBody and closed connections
Hi everybody,
I am looking for advise on using LimitRequestBody in Apache conf file.
So I set LimitRequestBody to a certain number to control mammoth-size posts. When I try to post tons of data via form using a Web browser it just cuts off like the server dropped a connection. I don't see expected 413 status code and the corresponding error doc that I set up. Both IE and FF do that. However when I tried to post the same request via curl it did show me 413 in headers and the error doc as well. Any idea what's going on here? Is it the browser's fault? Ideally, I would like the browser to show 413 error message otherwise the end-user gets an impression that the server just died. Your help is greatly appreciated. Thanks!
Server: Apache/2.0.55 (Ubuntu) mod_ssl/2.0.55 OpenSSL/0.9.8a
And a more detailed explanation:
...
1.5. Problems with ErrorDocument 413 processing
Normally, there are two components in the web server which can detect the post size limit problem:
...
The HTTP protocol also allows the browser to detect the problem. The optimal way to handle such limits is for the browser to implement the optional Expect: 100-continue handshake with the server so that the browser first tells the server the size of the upload, then the server tells the browser whether or not it is okay based on the size, then the browser either continues with the upload or displays an error to the user.
Popular web browsers such as Internet Explorer do not implement the Expect: 100-continue handshake. Instead, they simply start uploading the data, however large, and continue uploading until done. It is only at the end of the upload that they may see a 413 response from IBM HTTP Server (probably sent long before).
During the time that the web browser is still uploading the request body, IBM HTTP Server may drop the connection (after having already sent an error response). If the browser sees that the connection is dropped during the upload, it often will not see the error message which was sent previously, so the user will not see any error message.
IBM HTTP Server will not read unlimited amounts of request body when it has already been identified as too big. That ties up web server resources for too long and could consititute a Denial of Service. Given that the web server may drop the connection before the entire request body has been read, and that the web browser will not process the error response if it finds out that the connection has been dropped before it finishes uploading, the error message from the web server may or may not be displayed, depending on:
- size of upload (smaller uploads increase likelihood of seeing error response)
- size of error response (larger error responses increase likelihood of seeing error response)
- network bandwidth
A plug-in module could be written for IBM HTTP Server 2.0 and above which will cause the web server to read the entire request body before the error is sent. That usually results in the web browser displaying the desired message. However, this is not recommended for production use since it can tie up web server resources for a long period of time (as long as it takes the client to upload an arbitrary amount of data). It could be used for a denial of service attack.
...
The conclusion is, that the (improper) HTTP POST implementation of "popular web browsers" such as IE or Firefox stops them from reading the response when the web server thinks the upload is too big. After the server does what it should do, which is, sending a 413 or 500 or whatever suitable response for a too big request body, all that it can do is simply droping the connection in order to free server resources. So the web browser thinks that the server "just died" and throws out an unfriendly built-in "connection reset" error page. The behavior is the same for this scheme and the previous cgi.maxlen approach.
Make it graceful, make it safe
To make the upload size restriction more friendly, there are mostly two kinds of improvement:
- Client-side: we need to do some client-side scripting, AJAX, etc., so that the web browser detects whether the upload process is finished or interrupted, and gives out appropriate messages.
- Server-side: the web server gracefully accepts the whole request body no matter how large it is, however it silently discards the redundant data and leaves an indicator for later process when the upload exceeds the size limit. This is similar to the "plug-in module" mentioned in the above IBM recipe.
Either approach has drawbacks. The former one is more complicated, harder to debug, and has cross-browser issues. The latter one may tie up web server resources, although not be as serious as the IBM recipe describes (because we'll only keep a necessary size of data and throw away the rest). However, the server-side approach is easier to impliment. Since this is an article about hacking, I'll leave the first one for you, and go at the server-side solution.
I modified RestrLenFieldStorage a bit, and created the second version: