|A sample from the qlaara word relatedness graph|
It is not common for dictionaries to present empirical relatedness measures, especially if these are the only or at least the main type of data that the user gets to see. So why have we chosen this path? Here are the four most important reasons.
Division of responsibility in usage
It is always easiest when someone else tells you what to do. Users consult dictionaries for definite “correct” answers on which word to use, and dictionary authors aim to provide these answers. Ideally, the dictionary should formalise all the information that people will ever need for choosing words. The ideal dictionary is so smart that no additional knowledge is required from the user.
First, this ideal is not within reach. Language itself is an idealisation of how people communicate, and even the most comprehensive dictionaries are an idealisation of language. There will always be words missing from a dictionary, and nuances missing from words that the dictionary does contain. Maybe even worse, since the authors are human and the compilation process is manual, there will always be errors in dictionaries.
Second, we know that there are people who do not want to be told what to do. We are among them ourselves. Verbal communication has evolved to be extremely robust, people manage to communicate in very noisy environments, with foreigners and children, in hostile negotiations and so on. We are convinced that the only way this can be explained is that people are smart. They don’t really need all this manually formalised knowledge about words, their own intuition often gives results that may even be better, but are at least good enough for the communicative situation at hand. Since the user is smart, the dictionary can be correspondingly stupid, limiting itself to offering hints to trigger the user’s intuitions.
Division of labour in compilation
The creation of a dictionary has traditionally been a laborious process, taking at least years, normally decades, or in outstanding cases even centuries. This kind of work doesn’t scale: adding the 10005th word takes as long as adding the 5th word, and adding the 22nd language takes as long as adding the 2nd language. Each word in each language requires the compiler’s personal attention and manual editing. On the positive side, the tools to be used are quite simple, dictionaries can and have been compiled even using pen and paper. The analogy from manufacturing physical goods would be a skilled craftsman, producing items of excellent quality in small batches, being completely unable to increase production volume without sacrificing quality.
The obvious alternative is to shift effort from manufacturing a single piece to developing the tools, so that each single piece takes less effort in the future, ideally eliminating single processing of items altogether. This is exactly what we are doing with qlaara. Our effort goes into the algorithms, and each dictionary is processed as a whole, regardless of the number of entries and the number of languages. Given the tools and the text corpora, creating a monolingual dictionary with 1000 headwords takes roughly as long as creating a dictionary with 20 languages, 100 000 headwords each. The resulting dictionaries are as good as the algorithms that create them.
Relatedness is practically the only approach to semantics or meanings that is computationally feasible at present. The current state of the art in natural language processing has no access whatsoever to what words “mean”. All of the processing is completely formal, dealing only with the sequences of characters in texts. Consequently, any approach to compiling dictionaries that contain any information about meanings (even relative meanings) must be based on formal measures of texts. For this, the methods of distributional semantics or vector semantics, which qlaara also uses, are the current state of the art.
Especially if used with cognitively plausible algorithms, it is a small step towards AI. Provided that we cannot formalise how people make sense of what they hear in communication, we are also unable, as a matter of principle, to explicitly program machines to perform this sense-making. On the other hand, generic learning mechanisms both in humans and in other animals (starting from Pavlov’s dog) are well researched and understood, and several machine learning approaches mimic the way biological systems learn. Such algorithms may eventually provide a path to creating artificial systems that understand language, although their creators don’t exactly understand how. The only thing that is expressly programmed is the learning mechanism; the understanding mechanism emerges on its own after sufficient learning has taken place. Just like a child.