Could you reconstruct a document based on its hash sum?
A hash function is a unidirectional function for mapping a large dataset onto a much smaller one. Used in cryptography (e.g. digital signatures), hash sums allow easy verification that a document is indeed the document we expect it to be. The opposite task, to reconstruct the document from the hash sum, has been made deliberately difficult. Other examples of unidirectional or lossy encoding are file formats like jpeg or mp3 and, perhaps surprisingly, natural language.
What all these have in common is loss of information in the encoding process, and reliance on an intelligent receiver.
Discarding information is the very feature of lossy encoding that makes reasonably sized sound, image and video files possible. A large part of the original information is unrecoverably lost, meaning that it is completely impossible to reconstruct the original image or sound in its detail. Only by using their experience about what originals usually look or sound like, are receivers able to construct a satisfactory understanding of what the original must have been. Likewise, a digital signature is only useful if we already have the document to verify it with.
How do you recognise examples of lossy encoding? By the size of the nomenclature on both sides of the function. Going from a larger set to a smaller set, or discarding information, is always possible and actually easy. To do the opposite, you have to be able and prepared to supply the missing information.
In natural language, the set of possible thoughts is much larger than the set of possible expressions. Information content of the word “leaf”, for instance, is tiny compared to the information needed to describe all the leaves in the world, or even in the shared experience of this particular speaker and listener. Therefore, it is easy to express thoughts using words – you just discard most of the detail and utter a heavily compressed version of your thought.
Reconstructing a speaker’s message based on just the words they uttered is exactly as hopeless as reconstructing a document based on just its digital signature. Words, as the smaller set, can not convey information about thoughts, the larger set, in the additive or compositional sense. Hearers only get hints about how to reduce their uncertainty about the speaker’s intentions.
This has at least two consequences for hearers:
- The understanding we get from a text is always richer in information content than the actual sequence of characters or sounds that we received. The hearer always adds information in the comprehension process.
- The hearer must be able to do that, i.e. have some uncertainty about the speaker’s intentions, or in other words, be able to fill in the information that was missing in the incoming signal.
In terms of a dictionary system like qlaara, this is part of the rationale behind the design choices made. No dictionary can be smart enough to substitute the hearer’s human intellect. And since the user is smart anyway, the dictionary can be made explicitly dumb. Instead of attempting to take the burden of thinking away from the user, qlaara only provides quantitative linguistic data to help the user in their data acquisition (or language learning) process.