More Google Books OCR errors: Of Arms and the Anus

If you’re a little sensitive to NSFW language, you might stop now. Apparently a number of otherwise innocent books are recognizing “arms” as “anus.”The romance blogger Sarah Wendell, Smart Bitches, Trashy Books (direct links don’t seem to work) pointed out the OCR problem in a tweet (“So if the text is old, and it says ‘arms’, the OCR scanner will see it as ‘anus.’ OMG,”).

The Guardian‘s Alison Flood picked up the story on May 1, 2014 and compiled some of the more… disturbing errors, and then Googled the phrase “ “wound her anus” through Google Book Search” and turned up “a wealth of other examples.”

At Melville House, director of marketing Zeljka Marosevic quotes the Guardian piece and remarks, “Parents should keep their children away from the ebook edition of the 1882 children’s book Sunday Reading From the Young. It all seems perfectly innocent until… “Little Milly wound her anus lovingly around Mrs Green’s neck and begged her to make her home with them. At first Mrs Green hesitated.” That’s technology for you, always making an ass out of someone.”

Mainstream blogger Andrew Sullivan picked up the story on May 3, with an elbow and a** headline, so the story may gain some traction as odd-ball news. Yes, this finding is amusing because it is so egregious and transgressive, but it won’t really change the fact that automated OCR has a history of trouble with pre-twentieth century typefaces and printing. Nineteenth century book printing should not produce such a problem. (Arguably, nineteenth century newspapers might require more human intervention; I don’t think they are so easily scanned.) Eventually Google may seek to fix these errors because of the publicity, but I wonder how similar errors like this one have affected research by digital humanists now and in the future.

(hat tip: Andrew Sullivan)



Enhanced by Zemanta

By Paul Romaine

Paul Romaine is a grant writer and independent curator in New York City.

Leave a comment

Your email address will not be published. Required fields are marked *