Optical character recognition, or OCR, is an amazingly powerful tool for processing character-based text such as Chinese. This technology allows you to take an image containing text, such as a scanned medical record, and convert it into “live” text that can be imported into a word processor, copied, cut, pasted, and otherwise manipulated.

Without OCR, foreign language texts in image formats such as Adobe Acrobat would need to be hand-keyed into a word processor. A translator could only look up words by retyping them into an electronic dictionary or looking them up in a hard-copy dictionary. With Chinese, that presents additional challenges because in order to type or look up a character, you often need to know its pronunciation…and you often do not know that if you are trying to look it up: Catch 22.

While OCR is a tremendous time- and labor-saver, it can also lead to some hilarious (and potentially dangerous) gaffes. An OCR program must be configured to the language it is recognizing so that it can match the graphic input to the written symbols of the relevant language. For example, if you run Japanese through an OCR system set to recognize Chinese, it would map some of the characters correctly, but the syllabic Japanese symbols that do not exist Chinese would show up as Chinese gibberish as the system tried to “retrofit” them to Chinese.

I recently ran a scientific journal article through OCR software set to recognize Chinese and English, and the result was laughable. This article discussed the “p-value,” which is a statistical concept for determining whether a difference between data sets is statistically significant. Like English papers, Chinese papers use the letter “p” to designate this value. While the OCR system was set to recognize Chinese and English, it did not “know” that the input “p” was an English letter and instead mapped it to the closest Chinese character: “尸.” This character means “corpse.” Thus, the translation read “Corpse < 0.05 is regarded as statistically significant,” and “The difference between the 2 groups was not significant (cadaver = 0.6).”

While this example is humorous (if perhaps a bit morbid), it would be tremendously embarrassing if it appeared in the translation of a professional journal article. At any rate, I got a chuckle from it. OCR can be tremendously beneficial in saving costs and time, but please practice it responsibly!