Optical character recognition (OCR), or the conversion of images of handwritten or printed text into machine-readable text, is a science that dates back to the early ’70s. But algorithms have long struggled to make out characters that aren’t parallel with horizontal planes, which is why researchers at Amazon developed what they call TextTubes. They’re detectors for curved text in natural images that model said text as tubes around their medial (middle) axes, and in a paper describing their work, the coauthors claim that their approach achieves state-of-the-art results on a popular OCR benchmark.
As the researchers explain, scene text is typically broken down into two successive tasks: Text detection and text recognition. The first involves localizing characters, words, and lines using contextual clues, while the second aims to transcribe their content to the extent that it’s possible. Both are easier said than done — text in the wild is affected not only by deformations, but viewpoint changes and arbitrary fonts.
The team’s solution is a “tube” representation of the text reference frame that captures most of the variability, taking advantage of the fact that target text is usually a concatenation of characters of similar size. It’s formulated as a mathematical function that enables the training of machine learning scene text detectors, in contrast to traditional approaches that use overlap- and noise-prone rectangles and quadrilaterals to capture text information.
TextTubes’ performance was evaluated on CTW-1500, a data set consisting of 1,500 images collected from natural scenes and image libraries and over 10,000 text instances with at least one curved instance per image, and on Total-Text, which contains roughly 1,255 training images and 300 test images with one or more curved text instances. The researchers report that they achieved industry-leading results with 83.65% accuracy on CTW-1500, compared with the closest method’s 75.6% accuracy.
“Modeling an instance’s medial axis and average radius … captures information about the instance overall,” wrote the paper’s coauthors. “On datasets that consist of individual words, such as Total-Text, our model is able to achieve state-of-the-art performance. On datasets that have line-level annotations, such as CTW-1500, our model is able to better capture textual information along an instance’s separate words.”
Assuming TextTubes makes its way into production someday, it could be a boon for enterprises that rely heavily on OCR to conduct business. It’s estimated that paper remains in over 80% of digital processes; roughly 97% of small businesses still use paper checks. That’s perhaps why the OCR solutions market is anticipated to be worth $13.38 billion by 2025, according to Grand View Research.
Author: Kyle Wiggers
Source: Venturebeat