Thursday, September 30, 2010

Collecting the Dirty Words

We have collected dirty words and short phrases from several different sources and in several different languages to add to our database. These were then annotated manually with various types of information. The original intended use was humor generation and humor recognition in English and Japanese, so these two languages received the most focus.
The single largest source of dirty words was a list collected by George Carlin1, containing about 2,400 dirty word expressions in English. Most of these are euphemisms, tending towards joke like expressions, for example “trouser anaconda”. For Japanese we extracted all words in the EDICT dictionary (Breen, 1995) marked with the “vulgar” flag, and also added various short lists of dirty words found on the Internet. We also had several native speakers of Japanese simply write down a lot of dirty words that they could come up with by looking at the other words in the list. We have also found useful information in the Alternative Dictionaries2, the Swearsaurus3, andWikicurse4, which are collections of “bad words” in many languages. There are also many bad words in these resources in other languages that we have not added to our database, mainly because of a lack of native speakers to check if the words are really of the kinds we want. These could of course be added later if one so wishes. After collecting the dirty words, they have been annotated by hand with different types of information. Not all words are annotated with all types of information yet. Annotation regarding the meaning, nuance, and ambiguity of a word or phrase is possible.

0 comments:

Post a Comment