Most companies make great efforts to listen to their customers. They may sponsor a survey, assemble a focus group, or even hire a market research firm. Such efforts emphasize getting specific answers to specific questions, i.e., structured data. Collecting structured data requires that you choose the structure, before collecting it. But, what’s on a customer’s mind may not fit into the categories you define. And of course, structured data can’t possibly answer questions that you don’t even know you want to ask!
What if you could find out what your customers are thinking, without even asking them? What if you could do it in real time? What if you got information about new, critically important ideas before you even knew they existed? What if you didn’t have to pay somebody to type it into readable format, because your customers were doing that for you?
In all likelihood, you’re receiving such data from your customers already. If you have a feedback form on your website, host a customer community, or if you just have an e-mail address your customers can write to, then you’re getting this data on a daily basis. Moreover, customer conversations are taking place all the time in online forums like discussion groups, review sites, social networks, blogs, and Twitter. In every channel, your customers are continuously expressing what’s on their minds, and while they don’t necessarily expect a response, they do expect you to listen.
However, the greatest challenge in leveraging such information is that it’s unstructured and difficult to measure. Customers don’t express their thoughts and emotions on a numerical scale. Rather, they will often provide loosely associated details about their experiences, telling you “how” and “why,” revealing how they feel about things – in a myriad of fluid, informal, and uncontrollable formats – and telling you what you never even knew you’d want to know.
Using Unstructured Data
Obtaining immediate value from unstructured comments is not a problem when the volume is small; all you have to do is read the verbatim text. But, for a company discussed in thousands of social media posts, or receiving countless comments through direct feedback channels, it’s extremely expensive and time-consuming. Nevertheless, there’s so much useful information in unstructured customer comments, that many companies would rather bear the expense of paying to have them reviewed and classified manually, than ignore the voice of their customers altogether.
However, there are several limitations to this manual approach. First, it’s extremely expensive, with an estimated cost of $1- $4 per record manually processed. Second, it’s time-consuming – much faster than waiting for the results of a market research study, but not fast enough to be considered actionable. Manual review by necessity limits its focus on a small sample of relevant messages, and the results must then be statistically extrapolated to the entire population of relevant text messages to achieve insights. Finally, the process of manually reading and categorizing such text is subject to human error and fatigue. As a result, leading companies have been turning to automated, less expensive, and more accurate ways to evaluate the mountains of unstructured data at their disposal.
Automated Text Analysis: Linguistics vs. Statistics
The development of algorithms for finding meaning in unstructured text has become a dedicated discipline of artificial intelligence known as “text analysis.” There are two principal approaches: the first approach is to have the system automatically extract the concept and meaning in text messages. Such “linguistics-based” approaches typically involve building a comprehensive taxonomy of the language, and clustering the text data to find characteristics that map to pre-defined rules and dictionaries. The advantage of this method is that, after initial configuration, it’s entirely automatic.
However, such “linguistic” approaches to text analysis have several unique disadvantages. First and foremost, they require extensive – and expensive – tuning prior to deployment. Linguists and other experts must first define the concepts, themes, and rules by which unstructured text will be evaluated. These parameters and dictionaries must then be tuned to process industry-specific terms, slang, and the grammatical errors and poor spelling inherent in most consumer-generated text. The result is a large and often-unwieldy collection of custom concept libraries that require even larger sets of unstructured data (up to hundreds of thousands of records) in order to give the linguistic taxonomy enough observations to reveal themes of interest.
The themes that such language-oriented systems identify may also not relate to what you’re most interested in. If the concept “customer dissatisfaction” doesn’t naturally emerge from the automated analysis, you won’t get any information about it from this approach. Defining the classification rules independently of the type of data being evaluated creates an analytical bias for determining what is significant.
A second, more accurate and cost-effective approach is to use human knowledge to “train” a computer how to recognize concepts of interest. Such “machine-learning” methods apply statistics and artificial intelligence to identify important patterns and correlations inherent in unstructured text that are sometimes impossible to deduce manually. For example, a business user interested in measuring “customer dissatisfaction” can choose that as one of the concepts they wish to identify, and can then train advanced algorithms to recognize that concept by examining example verbatim representative of that thematic category, and creating a model that looks at the words and patterns that are true for that category. The machine learning process only requires a small sample of data (~1000 records) and produces a model that has “learned” how to classify customer dissatisfaction when presented with new unknown data.
With statistics-driven machine-learning, automation takes over the classification effort, using the samples provided to identify how to distinguish between those verbatim that fit a category and those that don’t.
Supervised machine-learning enables text categorization that accurately classifies unstructured verbatim text based on how customers are actually expressing themselves about their brand, products, or service. This includes the ability to handle unique expressions, grammatical variations, and other characteristics of speech. The approach benefits from reduced analytical bias, real-time processing, rapid deployment, and increased accuracy. However, the biggest advantage of a supervised learning approach, is that users can define any category whatsoever, from the very concrete (“problem with our latest printer cartridge”) to the most abstract (“good poetry” vs. “bad poetry”).
Statistical text analysis is further enhanced by unsupervised learning algorithms. These algorithms examine data outside of any “training set” or pre-defined categories to indentify clusters of data that are inherently similar. In some cases we don’t necessarily know what makes them similar, but the algorithms are capable of finding these relationships between data points and group them in significant ways. While supervised algorithms aim to minimize the classification error, unsupervised algorithms aim to create groups or subsets of the data where data points belonging to a cluster are as similar to each other as possible, while making the difference between the clusters as high as possible. By finding relationships inherent in clustered data, unsupervised learning helps companies identify new categories and emerging topics of conversation that they may have not been listening for in the first place.
Overtone’s OpenMic is built upon the industry’s most advanced proprietary machine learning and clustering algorithms. The result is a statistics-based classification system that is fast, accurate, and inexpensive compared to manual classification and linguistic analysis approaches. OpenMic’s core text analysis engine reads and processes every incoming verbatim. This means that even if the statistical model of the concept is imperfect (e.g., most implementations produce 80% accuracy), it will still give a much better picture of what people are talking about than a limited sample. Furthermore, OpenMic is designed to conduct its own internal audit, enabling you to evaluate its effectiveness. Finally, unsupervised learning and clustering algorithms also allow the user to easily identify emerging issues, create new concepts based on new information, and to apply that concept not only to future incoming verbatim, but to historical data as well.
Key advantages of statistical approach to text analysis
The Power of Machine Learning
As described, automatic extraction is based to a large extent on linguistics expertise, human intuition and experience. As valuable as that may be, it is no match for the statistical machine learning techniques used by Overtone’s OpenMic. Indeed, many of the clues to category membership are counterintuitive. For example, one of the categories in a comparative study conducted by Overtone examined the category “legal issues.” The concept library for a legal issues category can be assembled to include all the usual suspects and their variants (e.g., “law suit,” “lawyer,” “brief,” etc.) as well as grammatical parsing rules.
However, when machine learning is applied, it quickly becomes clear that many clues exist which would likely never occur to the designer of an automated extraction system, even an expert. It turns out that two of the best clues to messages about legal issues are “sir” and “madam” – messages beginning with the phrase “Dear Sir or Madam” are far more likely to pertain to legal issues than not. Indeed, many of the linguistic clues which are most useful are counterintuitive. This includes valuable clues which wouldn’t occur to a human (like “Dear Sir or Madam”) as well as clues which human intuition would strongly indicate but which turn out to degrade classification accuracy.
In terms of sentiment analysis, linguistics-oriented solutions perform tagging functions based upon universal libraries of positive and negative words. These “one size fits all” libraries offer the lowest level of accuracy, and are unmatched by a brand specific set of Positive and Negative Tone categories trained by a machine learning processor. OpenMic leverages machine-learning to understand positive / negative expressions and tone within a specific category, product, or industry (e.g., allowing it to differentiate the words “sick” or “bad” as positive in some while being negative in others).
Theory and Practice
While it’s true that text analysis methods have been around for decades, most such algorithms have not been available for practical business applications. This is because an algorithm without an interface requires prohibitive expense and expertise to implement. For example, programs for many cutting-edge algorithms in text categorization are freely available for download from university research departments. Such programs are fine for research purposes, but poorly suited to business application. To implement them, you’d be well advised to hire a statistician and/or computer scientist, and when all was said and done, you’d end up with a system that did one task well, but utterly failed to interface efficiently with the rest of your information systems.
Some large corporations have taken the next step by providing an interface and incorporating them into their analytical software. Even so, they usually require considerable expertise to be put to most effective use; the complexity of their implementation and operation makes their text-analysis solutions impractical for all but the largest companies. And, most of them rely on taxonomy and clustering rather than statistically-based machine learning; practical experience shows that the concepts which emerge from automated extraction and cluster analysis alone often don’t meet the needs of business decision-makers.
Overtone’s OpenMic which combines state-of-the-art text categorization, a user interface so simple anyone can use it, and a suite of analytical and insight creation tools to enable you to put the information to good use. OpenMic delivers:
