The “One True” Fallacy – Data Roads Foundation

Although “The ‘One True’ Fallacy” may eventually qualify for a list of logical fallacies, I don’t have the Ph.D. nor time required to present all the reasoning behind adding it. I really just want to talk about data files.

Where is my file?

There is a human tendency to need to equate digital objects with physical ones. We often say “I have the file” even though we know perfectly well that copies of “the file” are strewn all over the Internet. If there is anyone who “has” the file as in legal ownership, then it’s probably a shadowy Copyright holding corporation somewhere, that may exist only in a government file cabinet (and that file cabinet may actually be a digital figment as well). This goes back to early human confusions with the nature of copies. In the same way we claim to “have the file”, a person just after the turn of the prior century might have claimed to “have the newspaper”, knowing full well that they only held one of many reproductions from a particular newsprint source. Just because I have a USB stick with a file stored on it, that I can hand to you, doesn’t mean I “have” any file. All I have is a digital representation of some data that you might find useful, in a medium that I can transfer to you conveniently. The “file” is the method for finding that data in a specific medium — it is not a physical thing unto itself.

I invented language!

This data=object confusion leads to a completely different sort of confusion, over where the “one true file” is. This is like finding the “one true inventor”, or the “original idea” for a new device — neither notion can be confirmed with any reliability, as the muddled history of any human patent law system shows. Ideas (and their sibling notion of data) are viral, progressive, integrative, and transmitted only through abstract communications — they are never created from nothingness, fully formed at inception, like some big-bang theory of thought. All our notions of origination are illusory.

Depending on who you’re talking to, this “one true file” may generally be regarded as the most popular web site where the file is available, or it may be the secret source from which this file was first propagated. The “one true” news column may thus be on a newyorktimes.com server farm, or it may be sitting in the hard drive of a desktop computer in a busy newsroom in New York — behind a firewall, far away from any server that is connected to the public web. This confusion exists even when the file you see in the web browser, the file delivered by the server, the file sitting in the writer’s desktop computer, and the files sitting in some proxy servers in between, are all an exact copy of the same data.

[Geek Alert: Skip the following paragraph if neither Computer Science nor Information Philosophy hold any interest for you… of course if that’s the case then why continue reading at all?]

I’ll grant this exact copy case is rare, since edit file formats tend to differ from read-only display file formats, for very good reasons about the difference in uses; but those differences and reasons aren’t really germane to this discussion. I would even posit that any two files that have only one (batch) linear or pre-defined transformation step between them are the same, especially when access to said transformation steps are equal to file access; but that supposition is more technical than necessary. Let’s just say that if two humans can separately read the same text, or see the same image, even with minor variations due to monitor or equipment differences, then they are essentially viewing the same file. They also implicitly have access to the same file data, because the view mediums always require a local data copy.

[End self-indulgent geekery.]

My true name is Fred, but you can call me Larry.

This “one true” file fallacy extends to the naming systems of the Internet, including both the URL and URI systems we use to browse the web. Both TLA’s are short for phrases that start with the two words “uniform” and “resource”. The administrative word “uniform” is a common longhand term for concepts like “one true” or “central” way of doing things. In computer science, “resource” tends to be a shorthand for digital objects (like files). We use such misleading words because we know deep down that these computer concepts are completely made up, so words like “resources” make our roles sound more substantial than in reality. When any human system starts out by claiming to be “uniform” or worse “universal”, you should know that they are claiming to hold the “one true”, “central”, or “correct” way of going about things. In the case of naming systems, they are trying to provide the “one true” way to both name things, and find things via their names later.

I can hear one objection already — many are going to claim that by “uniform” they really mean something akin to the phrase “globally unique.” They start going to outer space with words like “global”, and they use such words because everybody likes space stuff, and the universe of stars is much more fun to think about than long division. They really just mean the “unique” thing is contained with the “global” extent of the system described. So what they really really mean is that the name is unique to their system as a whole — past, present, and (hopefully) future. In the case of Internet “Uniform Resource Locators”, is that really true? If http://newyorktimes.com/OJstory.html and ftp://losangelestimes.com/OJstory.html are copies of the same file (let’s just say they got it from the same freelance writer), then are these files and names really “unique”? The further counter-argument I’m hearing now is that the (full path) names have to be unique in the naming system, but we don’t care about the data underneath, because we’re talking about naming systems not full data systems. Some people might even regard it as a feature that the same data can be found with two or more different names.

Wearing the mask of identity.

Let’s temporarily put all those arguments about digital files aside, and talk about something a little more obvious: digital identity. Most humans have a pretty good sense of their own identity, and know that digital information (data) about them is incomplete at best, or grossly inaccurate at worst. Sometimes the inaccuracies in our online personas are completely intentional — we don’t want them linked to our real life identities at all. Yet this is not how we treat our identity online when conveying it to people in real life. We often say “use my email” or “find my Facebook”, as if either these diverse-open or immense-proprietary systems could be “held” or “owned” enough to use a possessive word like “my”. What we really mean is that we have some sense of exclusive access to these resources. Is this sense of exclusivity ever really true, either? As a Systems Administrator and Programmer with over 16 years in IT management roles, I can tell you this answer confidently: if it’s digital, then it’s never exclusive.

The difference between feeling and reality.

The best IT standards and practices can provide the feeling of exclusive access to your online accounts, or even temporary true exclusivity. If a legal or administrative need arises, then you can be sure the true server owners (the people with physical access to the server hardware) will feel a perfect right to access or even alter your data. Given this fact, can we ever say that we have “one true” identity online? My answer is: never. The best you can hope for is running your own private Internet servers, on your own private property, and hope that everyone else online defers to those servers in cases of contention. Also, even if you have the resources for all the personal property and server technology required, you better pray the servers don’t get hacked into without your knowledge. That’s a form of property protection that the Fortune 500 still haven’t mastered, so you have little hope of attaining such absolute protection as an individual.

I wrote that after you did, but I posted it first!

The implications of this lack of digital identity on any notion of digital “origination” or “authorship” should be obvious. Without any consistent or provable form of identity, how can you ever establish authorship?

It should be clear by now that everything online is a copy of something else. So the real problem is less about tracking or naming what is unique — the real problem is detecting and tracking copies. To complicate the problem: partial, imperfect, and translated copies are both useful and normal. The most common form of translation isn’t even between human languages — it’s between computer software formats. Some of these translations are perfectly reversible, where a perfect copy of the “original” data can be derived from the translated version. Microsoft Office DOC files, DOCX files, OpenOffice.org ODT files, and even PDF and HTML files are all interchangeable, based on the available translation tools that can go back and forth between these different formats. It’s all just text data, with slightly different ways of representing formatting.

I’ll send that to you again, even though it’s a waste of time and energy.

So how do we track all these copies, in a way that prevents waste? The biggest forms of waste online don’t involve storage space, because storage is plentiful, cheap, and low power. The biggest waste is transferring the same data over the same network repeatedly, or from farther away than necessary. The bandwidth and power wasted are tremendous. In my previous post titled “When Global Agreements Aren’t Necessary”, I posited a general method of naming things online without any central authority like ICANN or IANA required. I’m going a different route with copy-tracking, but the only authority needed is the “original” author of any given set of data. This wont prevent people from “forking” or altering data to misrepresent origins, but we can track the “originator” ID conveniently enough that bothering to alter it will seem like it’s not worth the time or trouble. The ability to trace a file to its origins can also be intrinsically linked with a social sense of authenticity. When translations between different formats and languages are involved, we can simply track the translator along with the data source. If a reversal procedure is available for any translator, we can simply use that reverse process to get back to the source data, without needing to waste network resources by retrieving it separately.

Y’arr matey! Did I already say that? No? Good. Y’arrrrr….

So how do we keep all this information, like translators and “originator” or “owner” IDs, with diverse copies of all these files? The necessary concept is already being used with the better journaling file systems available, and it comes from naval tradition: a manifest. A ship manifest is not the ship, but a log of all information concerning the travels, cargo, and crew of a ship. A file manifest is like a log of all meta-data, or data that is only about a file — how/where/when the data in a file has been created, how/where/when it has changed, and everywhere it has traveled. The file name, contents, and hardware location could all change, yet the top of the manifest (the “unique” object and origination ID) would stay the same for every copy. The best manifest data would be considered read-only, and new manifest data would be append-only (just like the best logs in general).

Any two people with access to the same top manifest copies would know they have access to a file from the same source, and it would just take a little coordination to resolve differences between their two copies. Such coordination would eliminate the need for sending the whole file copies back and forth. For example, one person might have the same exact manifest up to a certain date, and thus recreating the steps described in the manifest with the newer dates is all that is needed, to resolve the two differing versions of the same file to the same state. File changes from the matching manifest date onward, and not the whole file, could be sent to reproduce an exact copy of the same file data. By avoiding retransferring the whole file, time and resources are saved in the process. When all potential transfer sources are known, the file changes can also be sent from only the nearest sources, saving even more latency time and transmission resources. The speed of light is a limit that can only be overcome with proximity.

It is time to unlearn what you have learned.

So this is all to say that, I think we have been taught something wrong about files and data: that the “uniform” or “unique” name is the most important detail. Obviously the data itself is the most important aspect of any file, but that data is too abstract to consistently track by humans. It’s like trying to track all the crabs in a crab trawler by memorizing each and every crab’s appearance — it’s an impossible task for humans. We have tried labeling the crabs, the crates, and the trawlers, but they all keep on reproducing and spreading to new seas, untracked and unabated. The names we gave them don’t help us to identify the reproductions as apart from the originals, nor that they’re even related to the originals. Instead we should acknowledge that these names we’re giving everything are all transitory and illusory. Instead, we should form consistent manifest standards, that every trawler captain can consistently apply, and then stick to them.

256 binary bits are sufficient to uniquely enumerate all the atoms in the known universe. Knowing that, with something like a random cryptographic 256-bit (32-byte) originator and object ID at the top of every manifest, we could consistently and efficiently track every file ever produced. With these manifest standards, notions of “origination” have become much more straightforward to trace, and become less tied to specific servers and timing. The manifest thus frees the data — letting each file become more traceable, yet without the bonds of any “uniform” naming conventions. The act of naming things, after all, is a matter of human convenience, not of creating any true persistent identity. A manifest history is much closer to the ideal of tracing identity, which is the true goal of naming.