How To Delay The Apocalypse of 2198
The ID Problem
Business and Innovation Leaders are often asked during interviews or panels, “What’s your long-term plan?” The response usually involves forward-thinking of, say, five to twenty years. But what about 50? What about 500?
One upcoming shortcoming that captures the imagination is the problem with unique identification tokens. These tokens are commonplace in the world of data, and they are used to single out a particular record. For example, Facebook associates a number with each post. Here’s one: 10211035557083058.
That identification number (ID) must be unique so that every time you do some action with a post, Facebook knows which post you are referring to. Because of that uniqueness, as people add more and more posts to Facebook’s system, the number will grow larger and larger.
At some point, one might say, the ID tokens will contain more data than the posts themselves!
And it’s not just Facebook posts. Almost everything that’s stored in a typical database has an ID. All types of data, from each product and sale on Amazon to every book and checkout at the local library.
What happens when, a million years from now, our databases are mostly full of storing the ID numbers of the actual data we care about? There will be so many Facebook posts, so many Twitter tweets, or so many library books that, to be unique, the IDs are massive.
In a world charging towards “more data,” there is a tangible benefit from taking a step back and rethinking how we store and retrieve information, what we store, and how we impact our future. Hordes of database engineers are already thinking about this; can thinking hundreds of years into the future help them?
The problem of growing IDs is more than just the storage required to store the IDs themselves. Sharing information is also affected. How can I tweet a link to my favorite book when the ID of the book, an important part of the URL, is over 280 characters? What if I retweet a tweet? Now the ID is stored at least twice. This feedback loop of more data causing more data must be resolved.
The Problem is not Limited to Data and Tech
This issue of numbers growing unbounded first impressed itself on me at a commencement dinner I attended. After finishing the meal, we were each presented with a dessert: a piece of cake topped with a chocolate medal, on which the graduating class year was boldly pressed: “2019”. It suddenly occurred to me that this dessert would be much more costly in a few hundred thousand years. That is, as the calendar year gets more and more numerals.
The manufacturer of this candied coin already has this thought out. First, they will shrink the font to fit the additional digits, but soon their customers will complain about readability. Next, the candy grows until it’s no longer a little candy on top of a piece of cake; now the candy is a large hat, spilling over all sides of the comparatively tiny piece of cake. Customers can no longer afford both the cake and the candy, and eventually, the table itself is the candy.
All to fit the ever-increasing graduation year.
This scenario ends when the entire universe fills up with this candy that merely wants to display the graduating year of the class. By this time every resource in the universe is required to produce the last candy. The digits are now on both sides of the massive disk: this was a decision made in the year 1 quadrillion. Comprehending the number of digits printed on the intergalactic coin is just as unfathomable as understanding this demise of the universe. Scientists could have seen this coming: Our calendar year gets larger and larger, and yet the space in the universe is unchanging.
Enough about the problem — time to talk about solutions. Specifically, what can we do about the ID problem?
Tempting Solutions to the ID Problem
Before getting to real solutions, let’s get some of those false solutions you might think of out of the way. The goal is to prevent the ID numbers of records from getting too big to manage.
Faux Solution 1: Separate The Data
Suppose that we want our largest ID to be less than 1000, and already we have tracked 999 library books. One thing you can do is make two different databases, one full of the first 999 books and one starting out empty to fill the next round of books. Every time you get too many books in a database you can just create a new one and the ID will start at 0! Whew, that was a close one.
This does not work because now if you want to identify a book you need two things: one ID for the book and one ID for the database that stores the book. You can then combine those two to form a “Master ID” and you’re back where you started from: really long IDs.
Faux Solution 2: Clean Up Old Data
It’s tempting (albeit inconsiderate) to suggest we clean up Facebook posts once someone dies and Coco forgets about them. In this way, you could reuse old IDs and potentially reach some steady state at a fixed number of posts.
After tackling with disgust the lack of empathy in this idea, there’s still the problem that some data is relevant forever. Three examples of data that will be relevant forever are books, weather patterns, and constitutional amendments. The future is hard to predict, and while there may come a time when a book is thought of for the last time, saying that a book is gone from memory is paradoxical.
Faux Solution 3: Create New Characters
Most IDs nowadays are some combination of numbers and letters. With case sensitivity that gives you 26 + 26 + 10 = 62 different characters. Why not introduce new characters? New emojis are added daily, why not use them in our IDs?
When entering data into a database, that data is not characters as a human might think of them, but ones and zeros (bits). Each character is encoded into binary and takes up some number of bits. Adding new characters means that all the computers agree when they see a stream of, say, seven bits in a particular sequence they know it represents a specific character. There are limited characters for any bit-length, and so adding new characters does not help here.
Faux Solution 4: Stop Creating Data
Good luck with that.
How to Delay the Data Apocalypse
There’s no denying that ultimately we will run out of space to store information. In fact, we know when that happens: the year 2198 (Cambria et al. 2). That’s the year when, if we could use every atom on Earth to encode information, we’d run out of atoms because of our continually growing production of data.
Not to worry, though; data storage and compression improve frequently. Specifically when thinking about the ID problem, there is some solace in this observation: If the IDs are encoding more information than the data, it is possible to recycle data without ridding the old information.
Recall that the problem is not that we have too much data — we have too much metadata. Can we devise a system such that, for any new piece of data, the ID is not larger than the data itself? Sure we can, and here are the instructions (are you listening, database engineers?):
- Calculate the next ID for a piece of data.
- If the ID is smaller than the piece of data, great! Use it.
- If not, scrap that ID and use the data itself as the ID.
This solution guarantees that the ID of any piece of data is not larger than the data itself. This means that even after trillions, gazillions of Facebook posts, the ID tokens associated with each post will not get so massively large as to take up most of the storage space in the universe. Let’s head to a café and celebrate our success.
There is a catch, though. Each piece of data stored under this system does not have a unique identifier. If two people post the same string of words to Facebook, their IDs will be the same. If I want to like one person’s post and not the other, we need to have extra metadata to keep track of that, too.
I present this idea outside the “Faux Solutions” heading not because it is a real solution but because I like it better than the other fake solutions. Hopefully it will jog those overworked database engineers’ mind and they’ll come up with a solution before
Cambria et al: “Storages Are Not Forever” published in 2017 in Cognitive Computation. https://www.osti.gov/pages/biblio/1420053-storages-forever
That’s it. Everything else is fallible.
Photos by me.