Digital Sonata

Digital Sonata
 intelligent solutions for language processing

Linguistic Data

Sources of data

Sources of data

There are different approaches to the linguistic data acquisition. In addition to the well-known (and not very reliable) corpora harvesting, and manual creation of machine-readable dictionaries, another way of obtaining the dictionaries is re-using the "paper" dictionaries, which have been around for centuries. Remarkably, while machine translation (and other NLP software) developers embraced the usage of paper dictionaries for their internal needs, there have been few attempts to create a generic tool converting these paper dictionaries to a machine readable format. This is not because the "paper" dictionaries are not available electronically. They are available, but in their original unstructured form. This is great for humans, but useless for natural language applications.

However, "unstructured" is not an accurate way to describe the dictionaries. Every article is arranged in a special way, which is fairly universal in the world of lexicography. First, there is the source word; it is delimited by a hyphen, or an equal sign, or a space. Then, there is a tag for part of speech. After which there is a list of possible meanings, grouped into synonym sets, sometimes with short notes pointing out the precise meaning. Sometimes there are further complications, such as multiple parts of speech, or references, or substitutions marked by a tilde - but all these also follow a specific fashion.

Solution

The main purpose of Carabao Linguist Edition modules is to transfer bulk lexical data between external lexical resources and Carabao's databases. To enable effective and affordable data acquisition, we also included a module that transforms these semi-structured dictionaries into OLIF XML files. In a nutshell, the user maps all the aspects of the dictionary via a wizard GUI, and then the file is being transformed. As OLIF is an XML file, it is possible to refine the results via XSLT scripts or any other XML transformation means. We also included a few XSL scripts and an XSL processor to facilitate common tasks, such as refining the generated OLIF files, or transforming OLIF XML to the native Carabao eXchange XML format (which allows deep-matching the entries to any language in Carabao's database).

This is how the transformation configuration wizard looks like:

Configuration of the transformation of semi-structured dictionaries to OLIF

Examples

Examples of transformed dictionaries

We compiled a handful of examples for minority languages, for which the dictionaries are freely available in the internet. If you require an additional example or not sure our parser can transform your dictionary (or a specific fragment), please contact us and we will check it.  

Languages ASCII original Machine readable (OLIF) Note
Cebuano -> EnglishCeb-EngDictionary.txt [size: 73 K]CebEngOLIF.xml [size: 1539 K] 
Czech -> Englishslovnik_data.txt [size: 6905 K]CzEngOLIF.xml [size: 24365 K]Based on GNU/FDL English-Czech dictionary project 
English -> SwahiliSwahili.txt [size: 2516 K]EnSwOLIF.xml [size: 41751 K]Based on The Kamusi Project dictionaries 

As Carabao's lexical database is interlingual, once matched, it is possible to export any combination of languages in parallel.

Purchase Carabao Linguist Edition
Purchase linguistic data