From Applying NLP / NLU

Challenges to Big Data


Big Data adoption has been slow. There are a number of reasons ranging from the fact that analysis of the data can be difficult, to just getting all of the data normalized can be hard. GE did a study of the top challenges facing the implementation of Big Data and the below chart is the result.

While some of these issues are not technology as much as they are legal and user issues, such as the #1 concern of security, consolidation of disparate data, collection, and quality issues are all things that Recognant addresses. The talent requirements to interface with the data is also addressed by NLP. All in all, 6 of the top 10 issues from above are addressed by Recognant and for 54% of respondees their #1 issue is addressed by Recognant. That is a powerful step in reaching the promise of Big Data. Thus far the barrier to using NLP has been the cost and speed issues.

Big Data has has the most impact in industries with lots of structured data, and with smaller volumes of data. Real Estate, Supply Chains, and Procurement where data is straightforward and easily obtained has seen lots of growth in the use of Big Data. Sales, Finance, and Production are seeing products that leverage Big Data. These are areas of great opportunity, but likely the greatest potential markets and where the least impact has been realized so far is in Customer Service, Marketing, and BizDev. These areas haven’t had the the growth because the metrics are softer, and the data gathering is harder. Knowing what makes a good customer is often contained in the words on their website rather than in a column on a spreadsheet. A company that works in Cosmetics could sell, manufacture, or apply them and telling the difference from data is not easy. Trying to sell a product to a company that isn’t a fit, or is a competitor is a waste of both companies’ time.


Insights you can see. Actions you can take.

Correlation and causation can be hard to determine, but the ability to visualize data extracted from text can drastically improve the likelihood that the insights you draw from data are actual trends, not just the result of coincidence.


Knowing where your users are is far less helpful than knowing which ones like your product instead just happened to have bought your product based on the marketing.

The above image is a map of the sentiment of comments about a large dating app. Only comments where the person mentioned a city, county, or other location were included which left 4763 data points.

Sentiment analysis is mapped with red being negative, green neutral, and blue positive.

There is a lot more blue on the east coast, and while the west coast is neutral there are some positive and some negative.

This chart shows a lot of information about adoption, market fit, and user experience. No machine learning required.

Where ML shines is that if this map is combined with the number of single men vs single women in a given area, an ML system can see that the highest satisfaction comes from places with the most single women.

This map was generated using data output of Recognant from processing 35,000 web pages, and then visualized with OpenHeat Map.

We are under NDA with client, but they agreed to allow us to share the visualization.

Natural Language Processing and Understanding Applied to Search

googlesuckGoogle focuses on Internet search because their algorithm doesn’t really work for Internal documents, and they rely on the sheer volume of documents to provide answers to questions. Essentially they rely on the fact that everything written on the Internet multiple times so the various ways you might search for it are all likely to find something. This is not true of most corporate intranets, nor is it true for things that are topics of research.

Often what we are searching for, and what someone wrote are very different. Consider the example of a computer processor. Say you work at Intel and someone asks how fast is the new Octium process. “Fast” could be the clock speed 6.2 Ghz or it could be the number of Petaflops, 1.7, or it could a multiple of the Pentium it replaced 9.5x.  In the docs it is unlikley that it says “Fast” anywhere on these metrics.

Natural Language Processing and Understanding solve this issue. When combing through the docs on the intranet, or the Internet the software indexes sentences that contain “speeds” this could be as Mhz, Ghz, Petaflops, Teraflops, or “9.5x the speed of”. There are dozens of way to say something is faster.

When results are returned the relevant sentences can be highlighted so that the person searching can pick with measure is most appropriate for their needs, but they don’t have to do a bunch of searches looking for things like “Octium Ghz” which isn’t “natural” at all.

This extends to non-technical searches as well. Consider asking “which presidents died in office.” To answer this question you have to look for presidents who have a year of death that matches the year their term ended. The ability to find this answer requires parsing either structured data, or structuring data from documents to then do the calculations against.

Currently Google can’t even distinguish between a question, and a phrase to be found. “Why does…” is not a phrase a users is seeking in a document, it is the preface to a question that they want answered. Without NLP answering direct questions is not possible.

Speaking the Same Language

Recognant is a solution for converting English in to intents and metrics for consumption by data processing, action systems, and query systems.

If you have someone who doesn’t speak the same language as you ask you a question, it doesn’t matter how smart you are you can’t help them. This is the problem facing everyone working on AI, Data Mining, Search, and Voice Interfaces. The computer can’t understand what you are asking, all the data on the web, or even a simple command, because it doesn’t “speak” English.

You can’t have cognitive computing without Natural Language Processing and Understanding. For a computer to give a cognitive response it needs to be able to parse the language. Every single one of our competitors is using a language engine that is more than 30 years old and has “topped out” on how far it can go, and is lacking. Our engine is just getting started and surpasses the old engines in many ways already. Our NLP is based on Cognitive Theories and this makes us uniquely positioned to have a huge impact on the field.

One of our favorite stories about why Recognant exists is told by our CTO:

AI and Deep Learning can’t happen without NLP/NLU. IBM’s Watson, Microsoft’s Cortana Predictions, use a small amount of NLP, but for the most part they work like Google and just look at key phrases, and do N-Gram Analysis.

Philosophiae Naturalis Principia Mathematica
Philosophiae Naturalis Principia Mathematica

I have a copy of Philosophiae Naturalis Principia Mathematica by Sir Isaac Newton on my night stand. It is in Latin.

About once a month someone picks it up and says “What’s that?”

And I resist the urge to say “As the most famous work in the history of the physical sciences there is little need to summarize the contents.” (J. Norman, 2006)

And instead I say “That is the book that every mathematician and scientist who has ever done something great in the past 200 years read.”

Then they say “It’s not in English.”

And I say “Nope.”

And then I say, “That’s a reminder of why I don’t have to fear computers for a while. I am seemingly the only person working to teach computers enough English that they can learn from the most important books in the history of mankind.”

And then they say, “But this is in Latin.”

And I say, “There is a good translation of almost everything worth reading in to English.”

And they say “Oh, I don’t get it”

Most people don’t get it. It seems obvious once you really point it out, but language is so fundamental to everything we do, and the fact that computers are constantly displaying things we can read, and Siri talks to us, that people don’t really think about how all that unstructured data in the world is completely inaccessible to computers.