Shared publicly  - 
 
I just received a bug report on JSONTokener.java. It was failing to parse texts that were longer than 2GB in length because had I declared an index variable to be an int instead of a long. The bug occurred because it never occurred to me that a text could be that large. At the time I wrote that code, 2GB was a good size for a disk drive. We now have cheap machines with RAM that is much larger.

It took over a decade before someone stumbled over this bug. Ultimately, it occurred because I had a choice of several number types, and I chose the wrong one.

When we were programming Z80s, it made sense to have an assortment of int sizes because RAM was so incredibly rare. We very carefully invested as few bits as possible into each variable. That kind of economy makes no sense today, but most of our languages today still invite that mistake.
56
19
Behrang Noruzi Niya's profile photoToumani Bagayoko's profile photoBhalchandra Kadam's profile photofish zheng's profile photo
7 comments
 
It seems pretty drastic to throw away different-sized types because of something like this. I still find it useful to minimize memory use on things where I might be dealing with hundreds of millions of records. Like an age field in a struct in a user table, for example. I don't need a 128-bit long long to store human ages. An 8-bit unsigned int is plenty and it might mean the difference between being able to store all my user records in RAM vs. not.

And yes, when humans start to live beyond 255 years I will have a problem. But then I will have plenty of time to fix it.
 
I really appreciate how Python does away with the distinction between small numbers that fit into a C int and arbitrarily big numbers that require variable-length storage. On those rare occasions when I do need to pack numbers tightly, it is easy to do, in part because Python is never just Python, it is always Python + as much C as you need. I can use any open source tool that I want (almost all written in C), such as BerkeleyDB, JudyTree, sqlite, or write my own tool in C and then use it from Python. (I've used both approaches more than once, successfully.)

I wish Python would likewise de-emphasized and hidden binary floating point numbers in favor of decimals.
 
+Zooko Wilcox-O'Hearn Sure, in scripting languages you can do that. PHP does too. But he was talking about compiled languages like Java and C here. It would mean your BerkeleyDB, JudyTree and SQLite would no longer be able to pack stuff efficiently using native types.
 
It might also suffice if the default integer type is arbitrary-length -- the type which is described to new programmers and which is used when the programmer doesn't specify a more specialized type.
 
Shorter ints make sense for structs that occur in great numbers. They no longer make sense for scalars. The hazard overwhelms the utility.
 
Interesting topic. Some thoughts (but no punch line, sorry):
* Size matters but the degree varies with how the data is used, particularly, as Rasmus notes, how ephemeral or persistent it is. Breaking the tokener's operation on new text is one thing, but imagine if all the JSON data out there was potentially impacted and had to be adapted. Y2K fun!
* So, agreed, in some situations carefully engineering data formats is vital. But one could argue that a great proportion of the time it's actually OVER engineering. I recall Alan Kay complaining bitterly long ago about the vast amount of human time and talent that's been wasted worrying about packing bits into boxes instead of delivering value, and I feel he has a point.
* For instance I suspect in Doug's case here all the code REALLY cares about is the semantics of mathematical integers--incrementing, comparing etc. That an "int" declaration also carries a size limit is often an overspecification, a gratuitous gotcha clause in the canned type's contract that's just there waiting to be broken. Ideally, declarations should be exactly necessary and sufficient (the converse case is being unable to assert that, say, it's a POSITIVE integer).
* Moreover, how often does anyone actually want bizarre mod-256 semantics?
* One could imagine a runtime environment where overflow would simply cause the dataflow to adapt, preserving the "integer" semantics without the size encumbrance.
* In fact, the tagged-data architecture used in the old Lisp machines et al supported exactly that. Most small operands just zipped through the hardware ALU, with overflow exceptions triggering a graceful roll-over into bignums. Of course back then folks also argued memory miserliness was a virtue. Today the net advantage of sprinkling in a few tag bits to make data more self-describing seems like it might be worth revisiting.
 
Python (since a few years ago) automatically promotes ints (some fixed-size type provided by C) to longs (arbitrary-size) when the int overflows.