Sentiment Analysis: When Machines Can Beat Humans

Guest blogger Dr. Taras Zagibalov of Brandwatch addresses criticisms leveled at automatic sentiment analysis used for social media monitoring. Humans, he posits, often struggle to determine the sentiment of a piece of text because, just like the machine, they do not have the relevant knowledge available. Have current developments in machine analysis improved the process when compared with human analysis? And what of its own, inherent limitations? One perspective — after the jump.

Dr. Taras Zagibalov holds a doctorate in Informatics and heads the natural language processing research team at social media monitoring company Brandwatch. His work centers on improving real-time sentiment analysis to deliver accurate and up to date information for brands.

It’s not hard to find criticism of automatic sentiment analysis. Many of the most persuasive examples focus on illustrating how poor machines are at understanding emotions expressed through the complexities of human language. In some ways, they’re right.


Because these expressions are often only fully comprehensible with additional contextual and background information – information that may not be available to the machine.

As humans, we might like to think we’re better than machines in this sense; that it’s always easy for us to accurately decide what is positive, what is negative and what is neutral. However, this isn’t always the case. Humans often struggle to determine the sentiment of a piece of text because, just like the machine, they do not have the relevant knowledge available. Two humans having different perspectives, different knowledge-bases, different life experiences and different frames of reference can mean their analyses of the same text vary greatly.

Obstacles to both humans and machines

Here are some examples of issues that face humans, as well as machines:

“It has grown by 10%”

Is this good or bad? Firstly of course, the answer depends entirely on what “it” is (for instance, income or unemployment) and secondly what we know about growth in that context; is 10% a good or bad amount to grow by? Is growth a good thing at all? Ambiguities like this are not rare; it is extremely common that, to be analysed accurately, pieces of text require some expertise or knowledge that is not commonly possessed.

“The delivery was good”

An academic study showed that, in the context of eBay user feedback, the word ‘good’ is in fact a slight indicator of negativity. Someone without much online selling experience may conclude that the above is positive while the same review may upset a seasoned eBay seller. Similarly, for an ultra-luxury brand ‘good’ might not be good enough.

“The price has dropped, it’s really cheap now”

A final example to illustrate the perspective-dependent nature of any sentiment analysis – the above may be good news for those interested in buying the product, but shareholders of the company selling it will be less pleased about the implications of the statement.

Examples like the above are often mistakenly cited solely as obstacles to automatic sentiment analysis, when really they are just as applicable to human analysis. To perform accurately and judge statements like these correctly, humans need to be fully informed of the related context, background, standards and so on – machines are no different.

Human-specific issues: time, boredom, concentration

So, there are many examples when humans may find it difficult to agree on what sentiment a text has, but the situation can become further exacerbated when they are required to make quick decisions while processing large amounts of data. Though we might not like to admit it, humans often tend to get tired, bored, annoyed by the work they do. Humans are not “designed” for doing monotonous work. We are only good at consistent and accurate evaluation of textual content for limited sessions and in the right environment.

Our studies on inter-annotator agreement have shown how difficult it is to get a high level of agreement between two humans (yes, just two). I witnessed less than 30% agreement on sentiment annotation on a not-very-large dataset annotated by trained and educated native speakers. I also remember a triple-annotated corpus used in one academic workshop in which one of the annotators seemed to be simply pressing the same button all the time. Perhaps they were bored or unable to concentrate, or perhaps they were trying to complete the task as quickly as possible – the manual nature of human analysis means time is a particularly significant factor. Not only does the analysis take a long time, but pressure to hit targets and complete tasks in certain timeframes may seriously affect accuracy when work is rushed.

Where machines can help

What about machines? They never get bored, tired or lose interest in their job, we know that. And when it comes to time too, the instantaneousness of machines can add a dimension which is entirely unattainable with human analysis on its own. But still, they aren’t much use if they continuously produce inaccurate analysis. Can they understand if 10% growth is good or bad? Can they understand if “good” is actually “bad” in a certain context? Actually, yes they can. The secret of effective automatic sentiment analysis is based on an understanding of its danger areas: domain-dependency and time-dependency.

Domain-dependency means that a classifier (or ‘machine’) designed to classify reviews of teapots will not perform well on political debates, or even on reviews of teacups. Time-dependency refers to when a classifier becomes ineffective after a certain period of time: the language or topics might have changed so that the classifier doesn’t “understand” the data in the same way as before. For example, a classifier that was trained during mass recalls of Toyota cars because of a technical fault will not understand what’s going on several months later with Toyota when the data is engulfed by news reports relating to the tsunami.

How to deal with the issues

For an automatic sentiment classification system to be accurate it must handle domain-dependency and time-dependency by being kept specific and current. We should not rely on generic or one-size-fits-all sentiment classifiers but instead use tailored and up-to-date ones.

There are two ways sentiment analysis systems can be built and maintained – one is based on knowledge/linguistic resources and the other on machine learning (see this Brandwatch blog post on this distinction). Most social media monitoring tools currently use systems of the first kind and there are various criteria that need to be met for them to perform accurately. Firstly, it needs to be made very domain-specific. For example, ambiguous words that are not good indicators of sentiment in the domain need to be filtered out, or their sentiment weights manually edited (e.g. “good” should perhaps have a slightly negative weight for accurate analysis of eBay reviews). Also important is to include domain-specific sentiment makers (e.g. “predictable ending” as a negative word for book reviews, and a positive word for car reviews – i.e. “predicable steering”). And also, these lists and archives need to be checked regularly to ensure they remain valid and up-to-date.

Sounds quite tedious and labour-intensive doesn’t it? That’s why, at Brandwatch, we use a system based on machine learning. The maintenance required by this kind of system is much simpler to execute which means it can be done far more efficiently, mainly consisting of annotating fresh data to either retrain and update current classifiers or train new ones in new domains.

Ensuring the classifiers are up to date and as domain specific as possible (at Brandwatch we have classifiers in 500 domains/industries) like this is the best way to achieve accurate automatic sentiment classification. The core question facing operators of machine learning sentiment systems is when to update the classifier. This is not a trivial question; such maintenance requires a lot of time and money and there is no one rule that would work for all domains. Is it possible to automate this? Yes, and that’s what we are developing now at Brandwatch: a system that is tracking input data change so that it will be able to notify when a classifier is becoming ineffective. This system will also be capable of looking at a query and automatically choosing the most appropriate classifier (of the many available) to use in the automatic sentiment analysis process based on the phrases used in the query – currently in Brandwatch this is a manual choice.

From what I’ve seen, a lot of the cynicism towards automatic sentiment analysis originates from a feeling that social media monitoring companies overpromise on it, exaggerating its accuracy and the extent to which it can replace human analysis as a generic tool that is capable of processing almost any data at any time. Automatic sentiment analysis is a tool and like any tool it has its limitations and best practice methods and tuning and maintenance is required to make the most out of it.

Of course it’s not a replacement for human sentiment analysis, just like a printer doesn’t replace an artist although both produce pictures. It should never be claimed that automatic analysis can be solely depended on or that human analysis is no longer necessary; automatic analysis in its current state should be thought of as a supplement to human analysis rather than a substitution. The reason for this is that both forms have their advantages and disadvantages; hopefully this article has gone some way to clarifying some of them.

Related discussions can be found on Why You Aren’t Getting What You Need Out of Social Media Monitoring and the five-part series on Sentiment Analysis and Social Media Marketing. Social Times recently profiled Brandwatch and several other social media monitoring solutions in our “What You Should Know” series.