Recognizing Chinese Characters in Scene Text: May 2012

Wednesday, May 23, 2012

Geometric Shapes Test

One question I've been pursuing relates to the limits of the resolution / fidelity of the HOG as the feature vector for my classifier. For example, what kinds of detail can it represent, and for which values of its parameters (spatial bin size, number of orientation bins, size of training image as a multiple of the spatial bin size)?

I created a simple dataset featuring three classes shown below. I assumed that the detector would perform fine for circles and squares. I was curious primarily about how it would perform in distinguishing squares from nested squares.

Geometric shapes class exemplars

Geometric shapes test example image

The following images show some representative detection results.

As expected, the detector seems to perform well distinguishing circles from squares (but without perfect recall even in this simple case - see the third image). Somewhat as expected, it does not perform well distinguishing squares from nested squares.

I'm curious about which characteristics explain the detection of nested squares as squares in most cases, but as a nested square in the one case.

The figures below show the HOGs for each of the 3 classes at two spatial bin sizes.

sBin = 8
sBin = 6
	Circle	Square	Nested Squares

As one would expect, the HOG for a circle is visually very distinct from those of squares and nested squares. The HOGs for squares and nested squares appear much more similar, but visually they appear distinct. My intuition is that the differences should matter statistically given enough samples, so is it possible that this is illustrating that ~1000 training examples is not sufficient for the random ferns classifier? On the other hand, nested squares are in fact 2 squares at slightly different scales, so it seems there is an argument that in a scale insensitive detector that correct performance would be to detect 2 squares, and a single nested square. In this case, some downstream component would have to distinguish between the 2 detected squares and the single nested square.

Saturday, May 19, 2012

Tuning the HOGs

The following images show: the Chinese character (pinyin: yi1), the HOG for that character with spatial bin size = 6, and the HOG for that character with the spatial bin size = 8. There is detail evident with sBin = 6 that is not visible with sBin = 8. I think we can conclude that sBin is important in capturing certain visual details.

yi4

HOG: sBin = 6, oBin = 8

HOG: sBin = 8, oBin = 8

Note a side effect of reducing sBin to 6. The top and bottom row of HOGs appear to have no information in them.

With sBin = 6, 1000 known negative images, and 1000 hard negatives, the detector performance is given in the following graph. Note that there a range of threshold which gives perfect precision and recall! Before getting too excited, look at the following image showing the detection results.

So, although the detector has fired on the horizontal lines, there's a question about resolution/fidelity of the bounding boxes, scale, and sensitivity. Why do the bounding boxes exhibit so much variability in their size and location relative to the actual detected pixels? Why does the detector fire at multiple scales? Why is the selection threshold that works for this sample so much different than the other data sets? At the threshold that works for this example, the previous data set would have extremely poor precision.

Wednesday, May 16, 2012

Increasing the Number of Trained Characters

In the previous runs, we had trained 11 characters. In this run, we increased the number of trained characters to 25 by adding 14 of the 100 most common Chinese characters to the 11 used in the prior runs.

The chart below shows precision, recall, and F-score for different values of selection threshold with the target window size 48x48. The best F-score is .57 at threshold = 160. This is a slightly lower max F-score than the case where only 11 characters are trained.

Tuesday, May 15, 2012

Increasing the Number of Orientation Bins

I increased the number of HOG orientation bins to 12 from 8.

The chart below shows precision, recall, and F-score for different values of selection threshold with the target window size 48x48. The best F-score is .61 at threshold = 150. This is a slightly lower max F-score than using the original 8 orientation bins.

Increasing the Size of the Target Window

The detector works by sliding a target window over the image being searched. For each target window, the feature vector is computed. The computed feature vector is classified using the random ferns classifier. In this case, the feature vector is the set of HOGs computed for sub-windows of the target window, concatenated into a single vector for the target window.

The original code used a 48x48 pixel window size. One hypothesis is that a larger window size might preserve more discriminative information. By increasing the number of points at which gradients are computed, the computed HOG may exhibit higher fidelity. I'm not convinced of this, but it's an easy thing to check empirically.

The chart below shows precision, recall, and F-score for different values of selection threshold with the target window size 48x48 (this is the value used in the original paper). The best F-score is .69 at threshold = 150.

I changed the window size to 72x72, a 50% increase in each axis and 225% increase in area. Without changing the HOG parameters, this results in a 225% increase in the size of the feature vector. A 225% increase in feature vector bits describing a region 225% larger seems to net out to the same bits / area, so I remain dubious of this hypothesis.

The chart below shows precision, recall, and F-score for different values of selection threshold with the target window size 72x72. The best F-score is .69 at threshold = 110.

So although the behavior of the detector is clearly changed by differing window sizes (that is, the precision-recall curve changes), the basic shape of the precision, recall, and F-score graphs is similar, and the maximum F-score is the same for both.

Wednesday, May 9, 2012

Challenges

Characters with Simple Geometry

There are some characters that have strong similarity to very basic features of most images. For example, the characters 一 , 二 , 三 , 十 all have very basic geometry. We will be on the lookout for techniques that can be successful for characters like these.

Scale Variation
Many ideographs and pictographs that comprise Chinese characters ( 汉字 ) are characters in their own right, and also appear as sub-components of other characters. The ideograph/pictograph reuse occurs at multiple scales. For example,

文 and 这

illustrate the issue at a single scale. While the characters

一 , 口, 寸, 肉 and 豆腐

illustrate the issue at multiple scales. My initial approach to this problem is to determine whether the detected bounding boxes actually nest the way they would with "perfect" recognition, and then whether a specific application of non-maximal suppression can correctly discard the nested character elements in favor of the larger character's bounding box.

General Training Sets

The experiments I've run so far have used tiny training sets. A key result of this work should be to characterize the performance of the technique using training sets comprised of several hundred to several thousand characters.

Preliminary Performance

Training Data

500 images per Character = ( 100 characters / font / character * 5 fonts ). Rotation = random value in -PI/8 to +PI/8.
100 background images picked at random from a set of images known to have no Chinese characters.
100 background 'hard negative' images saved from bootstrap step.

Set1: characters = { 向前一小步文明大英发服饰 }
Set1-yi4: characters = {向前一小步文明大英发服饰 } (Set1 without the character 一 )

Image Set

Ground Truth (character, count)

left image	right image
英,1 发,1 服,1 饰,1	前,1 一,2 小,1 步,2 文,1 明,1 大,1

Results (with training data Set1 - without 一)

Observations of these results include:

Multiple detections of patches that appear fairly flat and uniform to the naked eye.
All true characters in the right image are detected (along with a number of false detections), but only 1 of 4 true characters detected in the left image.

Results (with training data Set1 - with 一 )

Observations of these results include:

Results are blown out by inclusion of 一 in the training set. It is a very hard class given its similarity to horizontal edges.

Wednesday, May 2, 2012

Improved Training

With the training upgraded as I described in the last post, I get the following results.

total # of characters present: 10
# of unique characters present: 9
# of characters trained: 8
total # of recognitions: 14
# of characters instances correctly recognized: 8
# of false recognitions: 6

precision: 8 / 14 = .57
recall: 8 / 10 = .8

The training set consists of 向前小步文明大.
Note that this case is still somewhat artificial. The training set did not include 一, because this character matches every horizontal edge in the image. This is a very hard case that I'm not sure how to handle. So for now I'm cheating and not including it in the training set. Note also that this case has no 'distractors' in the training set.