Digital Sonata

Digital Sonata
 intelligent solutions for language processing

Text Analytics and Transformation Solutions

Task

Task

Text analytics is a process of information extraction whose goal is to automatically extract structured or semistructured information from unstructured machine-readable documents. (From Wikipedia, the free encyclopedia.)

Business data does not always come prepackaged in spreadsheets, relational databases or XML. Developers in different industries frequently face challenging tasks involving processing unstructured text. As the expectations of computers being 'smarter' persist, text mining and text analytics become increasingly important parts of the Business Intelligence.

Solution

Task

Our products "shred" content into a list of numbers each corresponding to a particular concept behind a word or an idiom. The concepts are semantically interconnected. In addition to the usual phone numbers, emails, locations, addresses, and other kinds of recognizable patterns, the flexible architecture of our products offers a choice of over 100,000 entities (closer to 120,000). It is possible to find, for example, names of illicit drugs (including street names), or locations in US, or components of explosive devices, or nuclear physics related terms, or words with British spelling. The same applies to the domains of discourse. As our software is dictionary-based, it does not require training sets.

Thorough text analysis requires much more than simple string matching. For example, the word "nice" in a sentence like 'Nice to see you here' is not a city, while 'Nice is a great place to relax' does mention a city. Our products are able to distinguish between different meanings of the same word.

Sample customer scenarios

 

Scenario 1:
travel advisory

Travel advisory

Situation: A travel advisory website aggregates news feeds. An alert needs to be sent to a supervising editor when a disease outbreak, or armed conflict is reported in a particular region.

Solution: Carabao DeepAnalyzer processes the news feeds. The customer's source code searches the collection of ID numbers returned by Carabao for any kinds of 'diseases', or anything related to 'riots', 'conflicts' or 'terrorism'. If found, the relevant news feed is emailed to the supervisor. For example, out of 10,000 news articles received daily, the supervisor receives one or two alerts.

Note that there is no need to explicitly look for every disease or conflict; as the concepts are linked, the customer's source code only searches for a common parent, e.g. 'disease'.

Costs involved: 1 license of Carabao DeepAnalyzer + about 3 hours of development by an average software developer.

Scenario 2:
corporate semantic search

Semantic search

Situation: A real estate agency holds free flowing natural language descriptions of its real estate agents. As their strengths and specialization differ, it is required to search their profiles by numerous features rather than keywords, such as: expertise in particular demographic groups (adults, young couples, immigrants), special linguistic skills (e.g., Spanish speaker), personal characteristics (educated, patient, etc.).

Solution: Carabao DeepAnalyzer is used to index word senses in each profile. A user is presented with a dropdown list of possible meanings of each term entered in the search query, using Carabao MorphoLogic's built-in thesaurus. The indexed profiles are then searched for the disambiguated meanings. On a later stage, it is possible to ehnance the result page with a custom dropdown list, presenting only features found in the profiles (e.g., sort by demographic group match).

Costs involved: 1 license of Carabao DeepAnalyzer + 1 license of Carabao MorphoLogic + 6 to 10 hours of development by an average software developer.

Scenario 3:
news categorization

Newspaper article categorization

Situation: A news agency aggregates news feeds from different sources, and needs to assign labels according to geographical region(s) and relevant subject(s).

Solution: Carabao DeepAnalyzer is used to extract dominant domains of discourse and geographical information. The customer's program simply calls a method in Carabao DeepAnalyzer class to process the data and extract the associated domains of discourse. High accuracy and robust algorithms of Carabao allow it to distinguish between homonymous words, such as Paris in Texas and Paris in France.

Costs involved: Site license of Carabao DeepAnalyzer + 1 hour of development by an average software developer.

Scenario 4:
bulk contact details extraction

Contact details extraction

Situation: A customer service operator needs to search for loosely defined resellers of a particular product and contact them with all the available means (emailing, faxing, calling) to communicate emergency announcements (such as product recall) from the manufacturer. The phone and faxing assignments are delivered on a geographical basis.

Solution: Carabao DeepAnalyzer is used to extract names, phone numbers, emails and locations. Names are associated with the phone numbers, emails and locations. If information is missing or seems to be incomplete, an alert is sent to a supervisor. The bulk of the faxing and emailing operations is automated, minimizing personnel costs.

Costs involved (mining phase only): 1 license of Carabao DeepAnalyzer + 8 hours of development by an average software developer.

Scenario 5:
style manipulation

Style manipulation

Situation: A word processor grammar checker incorporates functionality to offer substitutes for certain words and phrases (obsolete terms, obscenities, slurs, regionally used, etc.).

Solution: Carabao MorphoLogic is used to obtain stylistic information and to list possible alternatives to the specified words. The word processor's GUI lists the alternative words including a thesaurus article.

Costs involved: Developer license of Carabao MorphoLogic + about 10 hours of development by an average software developer.

Conclusion

Our products are built on a powerful generic basis. The possibilities are numerous, and if you are not sure if our software can handle it, or want to evaluate the costs involved, please feel free to request a free evaluation without any obligations.