
Linguistic Data |
|||||||||||||||||||
Sources of data
|
There are different approaches to the linguistic data acquisition. In addition to the well-known (and not very reliable) corpora harvesting, and manual creation of machine-readable dictionaries, another way of obtaining the dictionaries is re-using the "paper" dictionaries, which have been around for centuries. Remarkably, while machine translation (and other NLP software) developers embraced the usage of paper dictionaries for their internal needs, there have been few attempts to create a generic tool converting these paper dictionaries to a machine readable format. This is not because the "paper" dictionaries are not available electronically. They are available, but in their original unstructured form. This is great for humans, but useless for natural language applications. However, "unstructured" is not an accurate way to describe the dictionaries. Every article is arranged in a special way, which is fairly universal in the world of lexicography. First, there is the source word; it is delimited by a hyphen, or an equal sign, or a space. Then, there is a tag for part of speech. After which there is a list of possible meanings, grouped into synonym sets, sometimes with short notes pointing out the precise meaning. Sometimes there are further complications, such as multiple parts of speech, or references, or substitutions marked by a tilde - but all these also follow a specific fashion. |
||||||||||||||||||
Solution |
The main purpose of Carabao Linguist Edition modules is to transfer bulk lexical data between external lexical resources and Carabao's databases. To enable effective and affordable data acquisition, we also included a module that transforms these semi-structured dictionaries into OLIF XML files. In a nutshell, the user maps all the aspects of the dictionary via a wizard GUI, and then the file is being transformed. As OLIF is an XML file, it is possible to refine the results via XSLT scripts or any other XML transformation means. We also included a few XSL scripts and an XSL processor to facilitate common tasks, such as refining the generated OLIF files, or transforming OLIF XML to the native Carabao eXchange XML format (which allows deep-matching the entries to any language in Carabao's database). This is how the transformation configuration wizard looks like: |
||||||||||||||||||
Examples
|
We compiled a handful of examples for minority languages, for which the dictionaries are freely available in the internet. If you require an additional example or not sure our parser can transform your dictionary (or a specific fragment), please contact us and we will check it.
As Carabao's lexical database is interlingual, once matched, it is possible to export any combination of languages in parallel. |
| Home | Products | Download | Services | Forums | Contact| Digital Sonata Pty Ltd © 2007-2008| Privacy Policy| Terms of use |