Policing hate speech is something nearly every online communication platform struggles with. Because to police it, you must detect it; and to detect it, you must understand it. Hatebase is a company that has made understanding hate speech its primary mission, and it provides that understanding as a service — an increasingly valuable one.
Essentially Hatebase analyzes language use on the web, structures and contextualizes the resulting data, and sells (or provides) the resulting database to companies and researchers that don’t have the expertise to do this themselves.
The Canadian company, a small but growing operation, emerged out of research at the Sentinel Project into predicting and preventing atrocities based on analyzing the language used in a conflict-ridden region.
“What Sentinel discovered was that hate speech tends to precede escalation of these conflicts,” explained Timothy Quinn, founder and CEO of Hatebase. “I partnered with them to build Hatebase as a pilot project — basically a lexicon of multilingual hate speech. What surprised us was that a lot of other NGOs [non-governmental organizations] started using our data for the same purpose. Then we started getting a lot of commercial entities using our data. So last year we decided to spin it out as a startup.”
You might be thinking, “what’s so hard about detecting a handful ethnic slurs and hateful phrases?” And sure, anyone can tell you (perhaps reluctantly) the most common slurs and offensive things to say — in their language… that they know of. There’s much more to hate speech than just a couple ugly words. It’s an entire genre of slang, and the slang of a single language would fill a dictionary. What about the slang of all languages?
A shifting lexicon
As Victor Hugo pointed out in Les Miserables, slang (or “argot” in French) is the most mutable part of any language. These words can be “solitary, barbarous, sometimes hideous words… Argot, being the idiom of corruption, is easily corrupted. Moreover, as it always seeks disguise so soon as it perceives it is understood, it transforms itself.”
Not only is slang and hate speech voluminous, but it is ever-shifting. So the task of cataloguing it is a continuous one.
Hatebase uses a combination of human and automated processes to scrape the public web for uses of hate-related terms. “We go out to a bunch of sources — the biggest, as you might imagine, is Twitter — and we pull it all in and turn it over to Hatebrain. It’s a natural language program that goes through the post and returns true, false, or unknown.”
True means it’s pretty sure it’s hate speech — as you can imagine, there are plenty of examples of this. False means no, of course. And unknown means it can’t be sure; perhaps it’s sarcasm, or academic chatter about a phrase, or someone using a …read more
Read more here:: https://techcrunch.com/social/