Got my first patch in to Python. The important parts are just as I sent them; the cleanup bits were spiffed up and done better. Thanks for the fast work, Neal!
A while back I pointed to http://python.org/sf/1365916 as I lamented how long it was going without response. Since then good things have happened. Armin Rigo and Neal Norwitz weighed in, first to see if Joe was willing to check for other bad format strings in mmapmodule.c, and then when I pointed them to my work they enthusiastically supported me creating a patch.
Creating a patch for this problem which crosses over many parts of the C implementation of Python is no easy task. Having looked at the simple output of my format checker, I expected things to be pretty cut and dry. Some of them are. Parsing an int format into a size_t variable is as bad as parsing it into a long. Scenarios like this need to be changed to use an intermediate variable of an explicitly understood size.
But some of them aren't. The ones that seem most confusing are the signed/unsigned conflicts. There are only subtle differences between signed and unsigned integers, but sometimes these differences are very desirable. Furthermore the difference on the Python parsing layer between signed and unsigned isn't the same as on the C level. If I read the code correctly, when parsing a signed integer, Python will first check that the value is between -2**(bits-1) and 2**(bits-1)-1 and raise an error if it is not. But it treats unsigned integers as bitfields, so does not verify the value is between 0 and 2**bits-1.
For the changes I'm making this is pretty good. It means user code won't suddenly start throwing exceptions after replacing a signed format with an unsigned one. However in the odd chance they were relying on the out of signed-bounds exception on a semantically unsigned value, there could still be issues when Python stops throwing the exception.
For now I'm planning to change each ParseTuple signed-format with unsigned variable to use an unsigned format, but I'll leave alone unsigned-format with signed variable. Fixing bugs like this without impacting everyone is impossible, but hopefully this is the right compromise.