Recognizing Chinese Characters in Scene Text: 2012

Wednesday, June 6, 2012

Tuesday, June 5, 2012

Project Recap

You can view a summary of the project in this presentation.

Monday, June 4, 2012

Trying a New Technique for Generating Features

I'm trying to implement the unsupervised learning algorithm in the following paper:
Text Detection and Character Recognition in Scene Images with Unsupervised Feature Learning

Wednesday, May 23, 2012

One question I've been pursuing relates to the limits of the resolution / fidelity of the HOG as the feature vector for my classifier. For example, what kinds of detail can it represent, and for which values of its parameters (spatial bin size, number of orientation bins, size of training image as a multiple of the spatial bin size)?

I created a simple dataset featuring three classes shown below. I assumed that the detector would perform fine for circles and squares. I was curious primarily about how it would perform in distinguishing squares from nested squares.

Geometric shapes class exemplars

Geometric shapes test example image

The following images show some representative detection results.

As expected, the detector seems to perform well distinguishing circles from squares (but without perfect recall even in this simple case - see the third image). Somewhat as expected, it does not perform well distinguishing squares from nested squares.

I'm curious about which characteristics explain the detection of nested squares as squares in most cases, but as a nested square in the one case.

The figures below show the HOGs for each of the 3 classes at two spatial bin sizes.

sBin = 8
sBin = 6
	Circle	Square	Nested Squares

As one would expect, the HOG for a circle is visually very distinct from those of squares and nested squares. The HOGs for squares and nested squares appear much more similar, but visually they appear distinct. My intuition is that the differences should matter statistically given enough samples, so is it possible that this is illustrating that ~1000 training examples is not sufficient for the random ferns classifier? On the other hand, nested squares are in fact 2 squares at slightly different scales, so it seems there is an argument that in a scale insensitive detector that correct performance would be to detect 2 squares, and a single nested square. In this case, some downstream component would have to distinguish between the 2 detected squares and the single nested square.

Saturday, May 19, 2012

Tuning the HOGs

The following images show: the Chinese character (pinyin: yi1), the HOG for that character with spatial bin size = 6, and the HOG for that character with the spatial bin size = 8. There is detail evident with sBin = 6 that is not visible with sBin = 8. I think we can conclude that sBin is important in capturing certain visual details.

yi4

HOG: sBin = 6, oBin = 8

HOG: sBin = 8, oBin = 8

Note a side effect of reducing sBin to 6. The top and bottom row of HOGs appear to have no information in them.

With sBin = 6, 1000 known negative images, and 1000 hard negatives, the detector performance is given in the following graph. Note that there a range of threshold which gives perfect precision and recall! Before getting too excited, look at the following image showing the detection results.

So, although the detector has fired on the horizontal lines, there's a question about resolution/fidelity of the bounding boxes, scale, and sensitivity. Why do the bounding boxes exhibit so much variability in their size and location relative to the actual detected pixels? Why does the detector fire at multiple scales? Why is the selection threshold that works for this sample so much different than the other data sets? At the threshold that works for this example, the previous data set would have extremely poor precision.

Wednesday, May 16, 2012

Increasing the Number of Trained Characters

In the previous runs, we had trained 11 characters. In this run, we increased the number of trained characters to 25 by adding 14 of the 100 most common Chinese characters to the 11 used in the prior runs.

The chart below shows precision, recall, and F-score for different values of selection threshold with the target window size 48x48. The best F-score is .57 at threshold = 160. This is a slightly lower max F-score than the case where only 11 characters are trained.

Tuesday, May 15, 2012

Increasing the Number of Orientation Bins

I increased the number of HOG orientation bins to 12 from 8.

The chart below shows precision, recall, and F-score for different values of selection threshold with the target window size 48x48. The best F-score is .61 at threshold = 150. This is a slightly lower max F-score than using the original 8 orientation bins.

Increasing the Size of the Target Window

The detector works by sliding a target window over the image being searched. For each target window, the feature vector is computed. The computed feature vector is classified using the random ferns classifier. In this case, the feature vector is the set of HOGs computed for sub-windows of the target window, concatenated into a single vector for the target window.

The original code used a 48x48 pixel window size. One hypothesis is that a larger window size might preserve more discriminative information. By increasing the number of points at which gradients are computed, the computed HOG may exhibit higher fidelity. I'm not convinced of this, but it's an easy thing to check empirically.

The chart below shows precision, recall, and F-score for different values of selection threshold with the target window size 48x48 (this is the value used in the original paper). The best F-score is .69 at threshold = 150.

I changed the window size to 72x72, a 50% increase in each axis and 225% increase in area. Without changing the HOG parameters, this results in a 225% increase in the size of the feature vector. A 225% increase in feature vector bits describing a region 225% larger seems to net out to the same bits / area, so I remain dubious of this hypothesis.

The chart below shows precision, recall, and F-score for different values of selection threshold with the target window size 72x72. The best F-score is .69 at threshold = 110.

So although the behavior of the detector is clearly changed by differing window sizes (that is, the precision-recall curve changes), the basic shape of the precision, recall, and F-score graphs is similar, and the maximum F-score is the same for both.

Wednesday, May 9, 2012

Challenges

Characters with Simple Geometry

There are some characters that have strong similarity to very basic features of most images. For example, the characters 一 , 二 , 三 , 十 all have very basic geometry. We will be on the lookout for techniques that can be successful for characters like these.

Scale Variation
Many ideographs and pictographs that comprise Chinese characters ( 汉字 ) are characters in their own right, and also appear as sub-components of other characters. The ideograph/pictograph reuse occurs at multiple scales. For example,

文 and 这

illustrate the issue at a single scale. While the characters

一 , 口, 寸, 肉 and 豆腐

illustrate the issue at multiple scales. My initial approach to this problem is to determine whether the detected bounding boxes actually nest the way they would with "perfect" recognition, and then whether a specific application of non-maximal suppression can correctly discard the nested character elements in favor of the larger character's bounding box.

General Training Sets

The experiments I've run so far have used tiny training sets. A key result of this work should be to characterize the performance of the technique using training sets comprised of several hundred to several thousand characters.

Preliminary Performance

Training Data

500 images per Character = ( 100 characters / font / character * 5 fonts ). Rotation = random value in -PI/8 to +PI/8.
100 background images picked at random from a set of images known to have no Chinese characters.
100 background 'hard negative' images saved from bootstrap step.

Set1: characters = { 向前一小步文明大英发服饰 }
Set1-yi4: characters = {向前一小步文明大英发服饰 } (Set1 without the character 一 )

Image Set

Ground Truth (character, count)

left image	right image
英,1 发,1 服,1 饰,1	前,1 一,2 小,1 步,2 文,1 明,1 大,1

Results (with training data Set1 - without 一)

Observations of these results include:

Multiple detections of patches that appear fairly flat and uniform to the naked eye.
All true characters in the right image are detected (along with a number of false detections), but only 1 of 4 true characters detected in the left image.

Results (with training data Set1 - with 一 )

Observations of these results include:

Results are blown out by inclusion of 一 in the training set. It is a very hard class given its similarity to horizontal edges.

Wednesday, May 2, 2012

Improved Training

With the training upgraded as I described in the last post, I get the following results.

total # of characters present: 10
# of unique characters present: 9
# of characters trained: 8
total # of recognitions: 14
# of characters instances correctly recognized: 8
# of false recognitions: 6

precision: 8 / 14 = .57
recall: 8 / 10 = .8

The training set consists of 向前小步文明大.
Note that this case is still somewhat artificial. The training set did not include 一, because this character matches every horizontal edge in the image. This is a very hard case that I'm not sure how to handle. So for now I'm cheating and not including it in the training set. Note also that this case has no 'distractors' in the training set.

Monday, April 30, 2012

Improving the Training

No pictures this week (but I do have 210 words, or about 21% of a picture) In my initial attempts to reproduce the previous results, I had simplified the training of the classifier for various reasons. This week I have been focused on getting the training to be equivalent to the scheme used in the prior work. Specifically,

1. Added the step in which known negative images are sampled and patches collected to serve as the "background" or "NOT a character" class during training.
2. I'm still working on adding back the inclusion of "hard negative" cases. In this step, we run the detection against known negative images and record the false positives. The false positive patches are saved as additional training samples for the "NOT a character" class. Finally, the ferns are retrained using good character training images for each character (class), and both "easy" and "hard" negative images for the "NOT a character" class.

The minor challenge in this step is determining the selection threshold used to select candidate bounding boxes. In the prior work, this value was determined by manual experimentation. I'm hoping to develop a routine to systematize the determination of this value, perhaps by optimizing the F-score to a user-specified bias for recall vs. precision.

Sunday, April 22, 2012

Detecting Multiple Hanzi in a Single Image

The following image shows the results of running the character detector on the sign13.jpg image with the classifier trained for 5 of the characters present in the image ( 向，前，小，大，文 ). Note that it fails to detect 文. Also, this set of bounding boxes is selected using a hand-picked threshhold value. The next image shows that with another value for threshold, the detector returns a lot of false detections, and a lot of noise (extra bounding boxes at different scales) for true detections.

So, it's apparent that I need to figure both a) how to "tune" the detector and b) select the "best" candidate bounding box.

threshold hand-picked for good results

results with poorly chosen threshold

results with previous "poorly chosen" threshold, and NMS applied

Friday, April 20, 2012

Visualizing the HOGs

The histograms of oriented gradients are computed over the entire image by dividing the image into cells that are small relative to the image size. So you end up with a grid of HOGs covering the entire image. Variables in this step include: cell size, overlap between adjacent cells, number of spatial bins, number of orientation bins. The following image shows a visualization of the HOGs computed for the sign13.jpg image using the default values.

Friday, April 13, 2012

It's alive

To borrow from Shelley, "it's alive". The following images shows the first glimmer of the basic character detection working. The number of caveats is too numerous to enumerate, but after a week of wrestling with Matlab this is a small victory.

Sunday, April 8, 2012

Synthetic Character Training Images

I haven't figured out to get MATLAB to display Unicode strings, despite a significant amount of research. I gave up and wrote a Java program to create the synthetic training images used with the Ferns classifier. The following image is an example.

Most Common Chinese Characters

One obvious challenge in recognizing Chinese characters (hanzi) and words in images is the tremendous number of characters in written Chinese. For now, I will be relying on work by Professor Jun Da at Middle Tennessee State University. In particular, the character frequency lists I'm using are here.

The frequency lists allow me to very simply use the top n most frequently occurring hanzi. One key desired outcome of this project is to see if I can characterize the performance of the recognizer in terms of n.

Wednesday, April 4, 2012

References

Histograms of Oriented Gradients is used for feature detection. This is the primary paper on HOGs. This presentation discusses HOGs.

A nice presentation on Random Forests and Ferns.

This post collects the references I've looked at for this project. So far, I'm only skimming the abstracts.

Video Character Recognition Through Hierarchical Classification
Text Detection in Natural Scene Images by Stroke Gabor Words
Enhanced Active Contour Method for Locating Text
A New Feature Optimization Method Based on Two-directional 2DLDA For Handwritten Chinese Character Recognition
Efficient Cut-off Threshold Estimation for Word Spotting Applications

Sample Images

The following images give an intuitive understanding of the problem.

A busy Chinese city street

Fumin Lu street sign in the French Concession

Fumin Lu street sign close up

A neighborhood tire dealer

Tire dealer sign close up

A sign at a tourist attraction

Circular text is beyond the scope of this project!

Tuesday, April 3, 2012

Tool Chain Setup and GrOCR code (Plex v1.02)

I had to wrestle with my Matlab environment for quite awhile to get the Plex code running at all. It relies on 2 libraries (LIBSVM and Piotr Dollar's Matlab Toolbox). Neither would install or build for me without errors. I still do not have it running on OS X, but I've gotten Kai's "quick demo" working on Windows 7. I'm using Matlab R2010a. Next step is to see if I can get the walkthrough of the evaluation code working.

[Update - 4/8/2012] The problem on Mac OS X turned out to be an assumption by Matlab about the target version of the OS (e.g. 10.5 vs. 10.6) that editing a config file seemed to resolve. I also have been able to verify Kai's code on my Windows 7 + Matlab R2010a (32-bit Student Version) box to the point of running the demoSVT() script successfully.

Motivation - GrOCR at UCSD

I try to keep an eye on the website for the UCSD Computer Vision group. In early 2011, I became aware of Kai Wang's GrOCR project. I during the summer of 2011 I lived in Shanghai for 3 months to begin studying Mandarin. As an aside, if you want to study Mandarin in Shanghai I highly recommend John Pasden's company AllSet Learning. I continue to study Mandarin at UCSD. My fascination with Chinese characters (hanzi) intersected my fascination with Kai's work, and this project was conceived.

Monday, April 2, 2012

Abstract

In this project I'm trying to recognize Chinese characters in scene text in unconstrained images. My initial approach is to apply the techniques of Wang, Babenko, and Belongie described in End-to-end Scene Text Recognition and Word Spotting in the Wild.

The project proposal is here.