|
|
Frequently Asked Questions
|
Introduction
|
- What is Carabao Language Kit - in layman's terms?
Carabao's main purpose is to "understand" text and transform it - from any language in the database to any language in the database. The database is open for editing, and the end user (or a deployment team) can add languages or modify existing ones.
Carabao takes whole sentences and converts the words into numbers corresponding to entries in its dictionary. An entry is not just a word; it is a concept, a set of synonyms. Once the text is analyzed, Carabao can generate an equivalent text with a different style, or in another language, or return the results for use by external software.
Carabao is enhanced with powerful lexicon import tools, as well as innovative techniques for more accurate translation and interpretation.
- What is Carabao Language Kit - in technical terms?
Carabao is a family of multipurpose linguistic tools for text analytics and automatic translation. It provides the following services:
- Sense disambiguation
- Part of speech tagging
- Named entity extraction
- Detailed, sentence by sentence domain extraction
- Deep morphological analysis and synthesis
- Automatic linguistic profiling
- Idiom extraction
- Universal measure conversion
- Transliteration between scripts
- Machine readability evaluation of texts
- Automatic translation between languages
- Integrated unambiguous (per sense) thesaurus
The most distinctive feature of Carabao is its complete linguistic abstraction. All the linguistic logic resides in a database complete with a powerful GUI data editor. By removing the linguistic logic from the source code, a few goals are achieved:
- Separation of tasks between software developers and linguists
- Fast and more reliable development of new linguistic engines which does not require participation of IT people
- Ease of programmatic use and customization
- What kind of components do you provide?
Currently we supply COM / .NET classes. If required, we can provide Java (JNI only), plain C or perhaps another kind of programming interface. Please contact us to inquire about a connectivity feature.
- What are the desktop suites?
The desktop suites are tools for development and testing. While they support functionality of all the server components, it can't be used from external solutions. You'll have to acquire a server component for that.
- If the desktop suites contain all the functionality encapsulated in components, can't I just download the free edition and use the underlying components in my solutions?
You can't. It is neither legal nor feasible. Trying to reverse engineer our code (as the practice shows) is, simply put, a waste of time. Given the relatively low cost of our solutions, it is also pointless.
|
Functionality
|
- Can Carabao components be used to enhance other machine translation packages?
Yes, they can.
Many machine translation packages have the capabilities to select domain of discourse and subsequently improve the accuracy of translation. However, the task of selection is delegated to the end user. If the user selects an erroneous domain, or when more than one domain exist in a fragment of text, there will be no quality boost. In fact, things are more likely to malfunction. (For example, consider the computer term 'retry' used in legal context.)
With Carabao DeepAnalyzer component, it is possible to detect domains at the sentence level. From there, the domain parameter in the 3rd party package will be set accordingly and the translation quality is likely to be significantly improved.
Another possible usage is in the translation workflows. The Machine Readability Index assesses the probability an NLP software can understand a specified text correctly. Machine Readability Index is calculated by Carabao Machine Readability Indicator, Carabao DeepAnalyzer and Carabao Translation Server.
Read more on the subject in the whitepaper Boosting Performance of 3rd Party MT Products
- What's the point of "translating" from one language to itself?
The main purpose of this is to verify the interpretation of a particular text.
A user is presented with an unambiguous (that is, only one sense) thesaurus article for every word. If the thesaurus article reflects the correct sense, the system has interpreted the word (or the idiom) correctly, and the translation will be correct as well. Otherwise, the user can try to re-phrase the sentence, making it less ambiguous, or correcting the errors. By introducing the "pre-editing" stage, Carabao ultimately improves the reliability of the machine translation process, and, with users' cooperation, the accuracy of the results.
It is also possible to manipulate styles, either by avoiding or forcing them. The console is also used for testing where all functions of Carabao can be run and examined. Thesaurus articles are not displayed in the online demo, but can be viewed in the FREE Standard Edition.
- Can Carabao paraphrase sentences?
Yes, it can.
When you are selecting styles to be forced or avoided, the engine is looking for the suitable equivalents of the same concepts. For example, by enforcing "obsolete" style, one can change "away" to "forth". Or, by avoiding British spelling, all the verbs will be re-generated with American spelling. Or, by requesting to avoid longer words and prefer shorter words, it is possible to reduce the size of a text.
It is also possible to pick less ambiguous equivalent terms. However, this feature should be used with caution.
The paraphrasing functionality can be tested in the desktop suites (including the free one). Carabao Translation Server must be used if the functionality is to be incorporated in external software.
- How do I add words to an existing dictionary?
You need to use Carabao Data Manager. In the main menu, navigate to Data -> Dictionary -> Words. This is where the words are stored.
If you want to add synonym(s) to an existing word, look for it by using either the Locate field or Find substring button. Make sure it is the correct one by checking its sense (examine its thesaurus article and grammatic information). Highlight the record and click Copy. Clear all the grammar units. Set the correct part of speech. Enter your word in Lemma field. Make sure the grammar has been correctly identified (if the autodetection is configured for this language and part of speech). Press OK to save.
If you want to add a new sense ("family" in Carabao lingo), click Insert. Select your language in Language field, and if it is different from the default, click on New to generate a new family ID. Now Insert a new language independent rule unit - a part of speech for your new word. Now enter your word in Lemma field. Make sure the grammar has been correctly identified (if the autodetection is configured for this language and part of speech). Press OK to save. Follow the instructions in the next passage to create translations.
If you want to add a translation to an existing word, look it up in the dictionary. Make sure it is the correct one by checking its sense (examine its thesaurus article and grammatic information). Highlight the record and click Copy. Clear all the grammar units except for the part of speech, and change the language to the language you want to add the translation for. Enter your translation in Lemma field overwriting the existing text. Make sure the grammar has been correctly identified (if the autodetection is configured for this language and part of speech). Press OK to save.
- When I select a language in Carabao Translation Console, a message box pops up saying that the language is not marked as active!
This means that the language is still in development and not ready for use. You can use it, but the results will probably be of very low quality.
Note that the multilingual translation-enabled dictionaries are yet to be released in the upcoming months.
- How do I add a new language?
-
It is important to mention that it greatly depends on the purpose of adding a language. If the purpose is transliteration, then the effort is minimal. The steps are as following:
- Add a language record to Languages table.
- In Phonemes table, copy phonemes one-by-one from any other language you know (English would be the best choice), editing the text and the language fields, creating equivalents in your script. You should be done within an hour or two.
Setting up a new language for morphology is more difficult. The steps are as following:
- Add a language record to Languages table.
- Create metarules: Required rule units, Rule units required on data input, optionally, Generation of inflected forms and Lemmatized form.
- Create affixes for the required parts of speech (typically these are nouns, verbs, adjectives, and adverbs).
With these steps you can already build a fully functional morphological processor, and if you know where to go, this can be accomplished in a matter of a few days.
Adding a language for translation or domain / style extraction is the most complex task and the guidelines published here are merely an overview. Read the chapter Designing a database in the user's guide for more detailed explanation.
- Add a language record to Languages table.
- Copy syntax delimiters from an existing language, skipping unnecessary ones and adding new ones.
- Design the morphological rules by setting up Metarules table and Affixes for the newly created language.
- Set up syntactic structures by creating sequences.
- Import data you have using the automatic import facilities (the easy way) or enter a dictionary one by one (the hard way).
- How do I design transfer rules for automatic translation?
The transfer rules are implicitly generated from the sequences. This saves you the trouble of designing transfer rules for every possible combination of languages. For example, once you have set up equivalent sequences for English, Russian, Mandarin and French, you will be able to translate Mandarin to French, French to Russian, or Mandarin to Russian, or English to Mandarin, or any other combination between the 4.
- How do I extract named entities?
Both entities that consist of one word only and those that consist of more than one word, are associated with the lexicon. The multi-word entities are defined as sequences; the detected instances of these are stored in the list of 'idioms' (visible in the translation console and accessible in Carabao DeepAnalyzer and Carabao Translation Server components).
The single-word entities (including variable patterns, e.g. time, date, surnames of specific ethnic origin) are extracted by inspection of the tokens which are associated with the concepts in the dictionary.
- What is linguistic profiling?
Linguistic profiling is using linguistic characteristics or dialect to identify an author's characteristics, such as social origin or native tongue.
Carabao provides basic linguistic profiling capabilities by detecting words and expressions peculiar to a specific social group. For example, if a text contains a lot of verbs ending with "-ise" and not "-ize" (initialize, recognize, etc.), and nouns ending with "-our" rather than "-or" (flavor, neighbor, etc.), or certain words like "indeed" or "rather", it is likely to be authored by a British or Australian English speaker. Using the sequence mechanism, it is possible to detect specific constructs favoured by certain groups, or characteristic mistakes.
- How accurate are Carabao's interpretations?
Carabao performs best with texts related to warfare, politics, crime, religion, business, music, and general news. It is a bit worse when it comes to texts abundant with technical or scientific lingo, colloquial language, and abbreviations not defined in the lexicon. The most difficult texts are essays abundant with metaphores, and analytical pieces with long sentences.
Generally the accuracy fluctuates between 85% to as high as 97% in the "easy domains". We never witnessed it go lower than 75%. Needless to say that the accuracy can be increased for a particular domain by carefully mapping important keywords, and providing special parameters in runtime.
- Why is that so many times I encounter the domains of 'linguistic communication', 'written language' and 'mind'?
Nearly every text (especially when information is relayed from a firsthand source by someone else) contains clauses like 'they say', 'I think', 'I believe', 'according to someone's report', and others. These are references to the use of language, or mind. People normally ignore them, as this information is not useful in most cases. However, the references still exist.
|
Architecture
|
- So what kind of architecture is used in Carabao? Is it statistical or rule-driven?
-
Carabao's kernel is neutral in its approach. The lexical database can be either example-based, or rule-based, or anything else - you can import it all via the XML import facility. However, we recommend (and use in our engines) hybrid approach.
Humans use grammar, context of discourse, common sense, etc. Carabao mimics these processes via the use of rule-driven, statistical methods when the final result is determined by neural network-like weighted response between several factors. The weight of each component is determined per language, which allows for greater flexibility and even customization per installation.
- What is the difference between freely available n-gram statistical libraries for tokenization and segmentation, and full morphological analysis?
Tokenization is the process of breaking content into words. While it is trivial in English, other languages may not have 'white spaces', or have joined words, etc. The process of breaking a non-delimited chunk of text into words is called segmentation.
In simple words, n-gram statistical libraries are probabilistic machines. The results are very rough approximations. Morphological analysis, on the other hand, uses heuristics based on a language's grammar. Here is an excellent whitepaper comparing the two approaches. When it comes to Semitic languages which concatenate prepositions and conjunctions as prefixes, or Germanic languages which use compounds, n-gram segmentation is simply useless.
- What is the difference between statistical classifiers and your dictionary-based domain / style extraction?
Both have strengths and weaknesses. Statistical tools require training on customer's datasets; draw conclusions based on similarity. Dictionary-based domain extractors such as Carabao DeepAnalyzer do not require training and can categorize much wider array of domains. The downside of the latter is that it requires a dictionary to be built for every new language, while statistical extractors can be quickly trained.
Statistical domain extractors, generally speaking, are the quick and dirty solutions. From our experience, they should be used only under following conditions:
- The input has a limited number of domains
- There are extensive training sets for each of the domain being recognized (and creating "etcetera" or "unknown" domain does not work)
- Every domain has certain distinctive formatting features or different set of frequencies for different characters.
For example, statistical classifiers work well when it is required to distinguish between resumes, technical texts, and medical articles. Another field where they can be used efficiently is language guessing.
When these conditions are not met, the results are extremely unreliable to the point when they appear random. Of course, extraction of domains per sentence (like it is done by Carabao) is out of the question.
- Is it possible to automatically populate the dictionaries of Carabao?
Yes.
While some parts of the data must be set up manually, the bulk is imported from dictionaries in "human-readable" format. People have been building dictionaries for millenia and most of them are structured after a certain pattern. Human "hand-picked" dictionary data is much more accurate than the results of any statistical data harvester, and much more widely available - even for minority languages.
Among other features, Carabao is able to parse plain ASCII (not XML or anything) data structured after the regular human dictionary pattern, such as:
entryInSourceLanguage1 - I. part of speech 1 1. translationSense1Synonym1, translationSense1Synonym2 2. translationSense2 3. translationSense3Synonym1, translationSense3Synonym2, translationSense3Synonym3 II. part of speech 2 translationSense4
entryInSourceLanguage2 - part of speech entry2Translation
The parsing tool is configurable for maximum flexibility. More information is available on the Linguistic Data page.
- Carabao and its components run natively only on Windows. Is it robust and scalable enough for large number of users?
While Carabao runs natively only on Windows, it requires much less resources than most NLP software packages. Typically, natural language applications (and machine translation in particular) load the entire set of idioms and the lexicon. For every request, the software has to search through thousands of unrelated words. Particularly ill-designed ones use XML DOM to go through tons of data.
Carabao uses a proprietary AdaptiveLoading™ technology which makes it load only the words, idioms and even grammatical rules used in the current session. The data is then cached for faster access. While Linux distributions can be more robust and cost effective than Windows servers, high level of optimization in Carabao outweighs the costs of Windows. Furthermore, with relatively stable ReactOS, a Windows-compatible free open source system, it is possible to reduce the cost even more and enjoy both worlds.
- What are the 'sequences'?
'Sequences' are interlingual means of description of any combination between words, syntax delimiters, and more. Their purpose is to determine the connection between words in a sentence (or, in technical jargon, create dependency trees in the sentences). They also serve to disambiguate and generate transfer rules when translating. Sequences are represented by a mini-language similar to regular expressions which describes grammar, style, ID numbers, or any other aspect of a word in a given context ('word' does not have to be white-space delimited). There is no need to know this language, as they are normally created by a GUI tool.
- Is 'domain' the same as a topic or subject?
Not always, and not necessarily. Domain is a logical sphere where most of the terms from a specific fragment belong to. Subject is more like a "plot", a main idea or set of main ideas in a text.
The purpose of detecting domains is to give an NLP application a hint on how to disambiguate or interpret a particular word. Knowing a subject might not be of any help in this issue.
- Are 'styles' the same as domains?
Not exactly. Domains are fields of activity, geographical areas, or complex structures where terms, geographical points, components are used. Styles (or linguistic profiling tags) are, simply put, different ways of using the same concepts. For example, the word barrister is a British way of saying lawyer. Both words are related to the domain law. Barrister has a style tag UK, but it is generally not related to the domain of UK. That is, if someone uses the term 'barrister', it doesn't mean they are talking about Great Britain.
|
Licensing
|
- What are the licensing options for Carabao?
The standard edition of Carabao development suite, which includes a translation console and access to all the tables in the database, is FREE. You can process data on unlimited number of desktops, create your own databases and even sell them. In fact, we encourage this and we plan on setting up a virtual marketplace of Carabao databases.
Note: we do not provide any kind of technical or linguistic support for this version by default. It is possible, however, to purchase support services separately.
Carabao Professional Edition includes the same features as Carabao Standard Edition plus Instant Dictionary Creator. The Dictionary Creator module allows for printing customized dictionaries, phrasebooks and thesauri based on the underlying database. For example, a user can print a dictionary of words for a certain domain (e.g., specialized accounting dictionary) or a phrasebook of the most common words. We plan on including integration plug-ins to provide Carabao text processing functionality to internet browsers, office suites and instant messaging software.
Carabao Linguist Edition has mass import and export functionality in addition to the above. It is aimed at the linguists and translators who plan to build and sell their own translation engines. One of the powerful features included in Carabao Linguist Edition is the ability to transform unstructured ASCII dictionaries to XML. While it is possible and completely legal to create and sell databases with other editions, too, they do not have the powerful data manipulation utilities necessary to control large amounts of lexical data.
Components used either by developers or as servers have to be purchased. The price is per server. However, developer and unlimited licenses are available. Typically, an unlimited license costs about 10 times the cost of a regular license. You must purchase this type of license if you plan on renting your Carabao server to your customers.
All non-free products are supplied with technical and linguistic support package for the first 2 years. On subsequent years, the support costs 12% of the purchased license. Special discounts are available for resellers.
- Why some of your components are so expensive?
-
As the complexity involved in natural language processing is very high, so are the development costs. However, when compared to other vendors of analogous analytics products, our products are very affordable.
Other vendors never set a price below US$25,000. As license costs on this level are rarely published, we can only refer to 3rd party information sources, such as:
As always with the corporate products like these, the bulk of the cost is not even the licensing, but the deployment costs. For example, Inxight's training costed about US$8,000 person a day before they were acquired by SAP. Usually this doubles or triples the cost of ownership.
If you get a quote for a text analytics solution lower than ours, bring it to our attention, including the details, and we will beat the price.
In addition, we offer a quality beat guarantee.
- How does your "quality beat guarantee" works?
If you find software that extract entities or domains more accurately than ours, we will give you a massive discount of 80% off the current price. You need to prove it with at least two fragments of text.
|