Recognizing Chinese Characters in Scene Text

Wednesday, June 6, 2012

Tuesday, June 5, 2012

Project Recap

You can view a summary of the project in this presentation.

Monday, June 4, 2012

Trying a New Technique for Generating Features

I'm trying to implement the unsupervised learning algorithm in the following paper:
Text Detection and Character Recognition in Scene Images with Unsupervised Feature Learning

Wednesday, May 23, 2012

One question I've been pursuing relates to the limits of the resolution / fidelity of the HOG as the feature vector for my classifier. For example, what kinds of detail can it represent, and for which values of its parameters (spatial bin size, number of orientation bins, size of training image as a multiple of the spatial bin size)?

I created a simple dataset featuring three classes shown below. I assumed that the detector would perform fine for circles and squares. I was curious primarily about how it would perform in distinguishing squares from nested squares.

Geometric shapes class exemplars

Geometric shapes test example image

The following images show some representative detection results.

As expected, the detector seems to perform well distinguishing circles from squares (but without perfect recall even in this simple case - see the third image). Somewhat as expected, it does not perform well distinguishing squares from nested squares.

I'm curious about which characteristics explain the detection of nested squares as squares in most cases, but as a nested square in the one case.

The figures below show the HOGs for each of the 3 classes at two spatial bin sizes.

sBin = 8
sBin = 6
	Circle	Square	Nested Squares

As one would expect, the HOG for a circle is visually very distinct from those of squares and nested squares. The HOGs for squares and nested squares appear much more similar, but visually they appear distinct. My intuition is that the differences should matter statistically given enough samples, so is it possible that this is illustrating that ~1000 training examples is not sufficient for the random ferns classifier? On the other hand, nested squares are in fact 2 squares at slightly different scales, so it seems there is an argument that in a scale insensitive detector that correct performance would be to detect 2 squares, and a single nested square. In this case, some downstream component would have to distinguish between the 2 detected squares and the single nested square.

Saturday, May 19, 2012

Tuning the HOGs

The following images show: the Chinese character (pinyin: yi1), the HOG for that character with spatial bin size = 6, and the HOG for that character with the spatial bin size = 8. There is detail evident with sBin = 6 that is not visible with sBin = 8. I think we can conclude that sBin is important in capturing certain visual details.

yi4

HOG: sBin = 6, oBin = 8

HOG: sBin = 8, oBin = 8

Note a side effect of reducing sBin to 6. The top and bottom row of HOGs appear to have no information in them.

With sBin = 6, 1000 known negative images, and 1000 hard negatives, the detector performance is given in the following graph. Note that there a range of threshold which gives perfect precision and recall! Before getting too excited, look at the following image showing the detection results.

So, although the detector has fired on the horizontal lines, there's a question about resolution/fidelity of the bounding boxes, scale, and sensitivity. Why do the bounding boxes exhibit so much variability in their size and location relative to the actual detected pixels? Why does the detector fire at multiple scales? Why is the selection threshold that works for this sample so much different than the other data sets? At the threshold that works for this example, the previous data set would have extremely poor precision.

Wednesday, May 16, 2012

Increasing the Number of Trained Characters

In the previous runs, we had trained 11 characters. In this run, we increased the number of trained characters to 25 by adding 14 of the 100 most common Chinese characters to the 11 used in the prior runs.

The chart below shows precision, recall, and F-score for different values of selection threshold with the target window size 48x48. The best F-score is .57 at threshold = 160. This is a slightly lower max F-score than the case where only 11 characters are trained.

Tuesday, May 15, 2012

Increasing the Number of Orientation Bins

I increased the number of HOG orientation bins to 12 from 8.

The chart below shows precision, recall, and F-score for different values of selection threshold with the target window size 48x48. The best F-score is .61 at threshold = 150. This is a slightly lower max F-score than using the original 8 orientation bins.