q is for quantitative

Once upon a time, there was a termbase system called Klaara. Initially only meant as a teaching and demonstration tool, it ended up being used in production for compiling a series of learners’ dictionaries. The internal data structure was onomasiological, documenting concepts and how they are called. A traditional semasiological view into the database, documenting words and what they mean, was also available for publishing (yes, back in 2004 dictionaries used to also have paper versions).

Using Klaara in production provided valuable feedback, as did the development of other dictionary systems that I witnessed more or less closely over the years: termbases.eu, EELex, kn.eki.ee. In addition to terminology with its relatively limited set of concepts and strong conventions, all of these were also used for creating and/or publishing general language dictionaries. The vagueness and variability of general language provided material for even more interesting case studies of what kind of information can (not) be reliably presented in a dictionary.

Pursuing the practical goal of developing software that would help formalise lexical knowledge in a dictionary, I ran into fundamental theoretical issues. What is lexical knowledge anyway? What is a word? Do words have meanings, and if yes then what would be a reliable method for finding out those meanings? The persistent attempts of many lexicographers to control language with their dictionaries, instead of describing it, did not simplify the theoretical picture either.

Such questions are notoriously difficult for mainstream linguistics. Basic concepts like word or meaning are either explicitly left undefined, or then defined inconsistently across researchers or even within the work of one researcher. The differences are not even in the details, but fundamental, like whether meaning is objective or subjective. There is a good reason for this: while words and meanings (and the proposition that words have meanings) make intuitive sense, they are not real objects like flowers or butterflies that could be measured objectively.

They are theoretical constructs themselves, and therefore entirely theory-dependent. Words, meanings and even languages themselves are devices used by linguists and laypersons alike to make sense of communication; they are not psychologically real components of communication.

So the time has come for the next iteration of dictionary software, the name now spelled with q for quantitative. The objective of qlaara is to provide measurable, empirical, quantitative information about how people communicate, without resorting to subjective opinions or theory-dependent constructs. The data originates from text corpora and describes those texts, rather than language (whatever that may be). Instead of saying what words mean or what language is like, qlaara documents how people have been communicating until now.

The similarity graph, for instance, shows what other words have been used in the same neighbourhoods. Have a look, play around with the graph and let us know what you think. Many more quantitative measures are in the pipeline, and I’ll write more about the details in the following posts here.

Have fun
Arvi

Leave a comment