What Recognant Is
Recognant is the single most feature complete NLP engine available. Recognant is the fastest NLP engine available. Recognant is built full stack without the use of license restrictive libraries from third parties. But more importantly it is a means for unlocking unstructured data.
The vast majority of data is not in tables or spreadsheets, it is in text. Computational Linguistics can pull some of this data out, like word frequency, or numbers in very specific formats, but over all, text is a mystery to big data mining solutions.
Recognant converts text to data and meta-data. Depending on the operation that needs to be performed the information from text can be stored for querying later, or be converted to structured data for use in real-time. The data Recognant generates has some pretty geeky identifications as a result of the types of people who work in the space being very technical. In some cases, the names of the features are misnomers as the functionality is named after how other systems generate the data, and while Recognant can generate the same information, the mechanism may be very different. It is also worth noting that not all of the functionality is uniform across product. Sentiment for example is subjective, both because of cultural bias, and industry bias. Calling a wrestler small is an insult, but calling a silicon chip small is a compliment. As a result not all systems with a given feature set are interchangeable and not all features are commoditized.
Part of Speech Tagging
If you remember diagraming sentences in school you have done this. Part of speech tagging is useful for things like determining which adjectives are being used to describe which nouns as part of business intelligence where you might want to know that people describe your product as awesome in 6% of articles but underpowered in 7%. Part of Speech Tagging is the fundamental feature in an NLP engine, and any system that doesn’t have one is not doing NLP it is doing computational linguistics.
A period is not always the end of a sentence, and as simple as it is for humans, even just figuring out where sentences start and end is a chore for computers, and makes a big difference in how documents are tagged. If content extraction is going to be done this is even more important. Building a summary, or pulling a relevant sentence out of a document doesn’t work if the computer thinks U. S. S. R. is four sentences. Sentence Disambiguation has applications even in doing simple tasks like creating text snippets around a word or phrase in search, or being able to calculate reading difficult based on average sentence length.
Word Sense Disambiguation
Knowing if someone is using stock to describe shares in a company, a broth for soup, the inventory on hand, or animals ready for market is essential when someone is searching for how to invest in stock, or how to use stock in cooking. The ability to disambiguate is limited to the context in which words appear, just like a human. Determining what stock means in the sentence “I am going to buy some stock this afternoon” is next to impossible, so the most common use of “shares in a company” is used, and a system is not as smart as a human so if an article is primarily about financial instruments and then contains, “Bob’s mom stopped by to see if he wanted lunch, and brought an amazing stock of items from her garden” it is unlikely that a system will detect this. Recognant is far better than average because it has access to the prominence of words on the web, and leverages this data to get answer more often in short texts where there is less to build context from.
Really dumb systems determine keywords by counting all of the words. Better systems discount the stop words (the, a , of, etc). The best systems look at the frequency of words relative to their frequency in all documents. Recognant looks at the relationship of words in context, the prominence of the words in the popular use on the Internet, and calculates based on that, but also gives the output for those other methods. The result is that the keyword tagging is much better than any other system, and as it supports Named Entities it can also look at variations on the use of words so that the mention of a First Name or Last Name counts towards the importance of the Full Name in the index.
Summarization and Sentence Significance
Summarization is what it sounds like, it is the ability to shorten a document. This is done using Sentence Significance and Sentence Dependency. Sentence Significance is ranking the importance of each sentence in a document, and Sentence dependency is determining if the sentence can stand alone, or if the previous or next sentence is required for context. This happens when a sentence only use “he, she, or it” and doesn’t name the entity which is being talked about, or if a sentence starts with “Conversely” such that the sentence is a continuation of the thought from the previous sentence.
Sentiment Analysis is making a determination if a sentence is positive or negative with regard to its subject. Sentiment analysis has to be done phrase by phrase rather than document by document because a sentence that says, “WonderCar fails to deliver on the promise that BlinkyCar achieves” is negative for WonderCar and positive for BlinkyCar. Such sentence constructs and sarcasm are the reason that most competitors in this space only achieve about 60% accuracy. If a system isn’t 80% or better one might as well flip a coin rather than rely on it for analysis.
Alliteration is the repetition of letters for effect, but it can create tongue twisters or documents that are difficult to read aloud. Alliteration Detection finds these phrases so that they can be used to detect the writing style, or to judge the odds that as part of a sales or call script that may cause issues for the speaker.
Named Entity Recognition
This is also called Noun Entity Extraction. There is a difference between Sarah from Austin and Sarah Austin. Named Entity Recognition is the ability to tell correctly tag documents based on multi-word nouns rather than the single words that make up the entity’s name.
Clustering is a fancy word for identifying related content. Clustering looks at how similar the topics of a document are. Rather than doing the actual clustering, NLP tools really provide the metadata necessary to determine the strength of the relationship. This is because a cluster around a verb may not be relevant for someone researching a given noun.
A variation of Plagiarism Detection, Author Profiling builds a vocabulary and writing style profile for an author. This can be used to detect forgeries, and to detect when an author was sponsored to say something, or if the author is going through a psychologically draining life event.
There is a vocabulary and set of euphemisms that are representative of any given psychographic. By using the data from a body of work by that psychographic, it can be determined if an author is also of that psychographic.
Fact and Statistic Extraction
Unstructured data contains lots of information that could be made in to structured data. “Nearly half of all people are female” is equivalent to ~50%. These kinds of extractions are only possible through NLP. By finding a fact and then performing searches of other documents to corroborate the fact, fact checking and validation can be done.
Character Language Modeling
Character Language Modeling is the statistical probability of a given text existing. This is essentially a way to determine if a document is gibberish or really technical, or a series of cliche phrases. Each word’s prominence is used in relation to other words to determine the odds of the text occurring.
Ism’s are simple structures in language that define something as an another thing. “Obama is the president” would be an example. Isms are not limited to “is” but can be any verb of being for example, “are,” “was,” “were.” The best isms are reciprocal isms where the ism works in both directions as in “George Washington was the first president of the United States” which works just as well as “The first president of the United States was George Washington.