# An Introduction to Morphological Operations for Digital Image Text Classification

Before you begin, make sure to follow the article using the Colab notebooks for LineCount and WordCount.

Given a digitalized text document image, a common question one might ask is?

How can we localize and count the number of lines and words in the document? How can we take a digitilized input image like this and count the number of words and lines on it?

To answer this question, we describe a method to localize lines and their respective words in digitized text images. The method is based on digital imagery morphological operations. You can see the word counting results in the cover image.

#### Morphological Image Processing

In summary, a morphological operation is a set of image processing algorithms that acts on image pixels using pre-defined kernels. These kernels, known as structuring elements, define patterns that are used to process images. Bellow, you can see some examples of common structuring elements.

You can imagine these elements as simple matrices. Here, black circles represent the value 1 (foreground). Also, structuring elements are defined based on some local origin (marked as Xs).

The most common morphological operations are dilation and erosion. In both methods, a kernel convolves an image and substitute the target (origin) pixel following a criterion.

Intuitively, erosion has the effect of deteriorating the boundary pixels of foreground elements. We can think of it as expanding the background pixels and reducing the shape of foreground elements.

In short, a pre-defined kernel is passed over the image as a 2D convolution. Then, assuming a binary image, an input pixel will be 1 if all the pixels under the kernel is 1. Otherwise, it gets a 0.

We can also think of erosion as applying a minimum function over the pixels under the kernel. In this way, the output pixel will be the smallest value among all the pixels under the kernel. As we will see shortly, erosion has also the effect of removing white noise.

Dilation acts in the opposite way. It expands the boundaries of foreground elements. As a consequence, it also reduces the number of background pixels. As oppose to erosion, given an input pixel, the output is 1 if at least one of the input pixels (under the kernel) is 1. Otherwise, the output is 0.

Similarly, dilation applies a maximum ope over the pixels under the kernel. In other words, the output pixel will be the maximum value among all pixels that fall within the kernel.

We can also combine these 2 operations. For instance, Opening is the application of erosion followed by dilation. Similarly, dilation followed by erosion is called Closing.

Intuitively, Opening is widely used as a noise removal transformation. It is used to smooth object boundaries, and to remove thin connections between objects.

Closing, on the other hand, is used to get rid of small gaps within the foreground objects. Also, it has the effect of merging narrow separation between objects.

#### Finding Lines on Text documents

As you will see, to find lines and words, we are going to develop a common recipe for both tasks. Let’s dive into it.

In order to allow the morphological operators to act as expected, we transform the input image to a binary representation. Then, the pixel values are inverted to guarantee 2 properties:

• Black Pixels (0s) represent the background.
• White Pixels (1s) represent the foreground.

The first step is to create a structuring element with 100 pixels wide and 1 pixel tall. With this element, we apply a dilation operation to the input image. We expect the dilation to increase the boundary pixels of the foreground elements (the words in the image).

The elongated shape of the kernel (100 pixels wide) has the effect of stretching the foreground pixels in the horizontal direction. As a result, the pixels representing the words in the text gets merged as one wide block of foreground pixels. Take a look at Figure 1 (a) to see the result. Figure 1: Result of horizontal dilation (a) and erosion (b)using a structuring element with 1-pixel width and 100 pixels height.

Following, we erode the result from dilation using the same structuring element. The goal is to refine the boundaries of the blocks that represent the foreground. Here, erosion removes a few pixels from the boundaries of each block. As a result, we get smaller blocks as shown in Figure 1 (b).

This first series of operations allowed us to find blocks of foreground pixels in the horizontal direction. Now, we are going to perform the same operations but in the opposite direction.

We start off by creating a different structuring element. This time, it is 200 pixels tall and 1 pixel wide. When we dilate the original image using this kernel, we get a similar effect from the first dilation. However, note that the blocks of foreground elements that get merged are in the vertical direction. Again, we take the result from dilation and apply erosion with the aim to get finer results. Figure X displays the results of vertical dilation and erosion. Figure 2: Results of vertical dilation (a) and erosion (b) to the input image.

The intermediate results representing horizontal and vertical blocks can now be merged using an AND pixel-wise operation. The idea is to consolidate the findings in both directions. Intuitively a valid line of text has to appear in both representations to be considered a valid line. As we might expect, this step removes many candidates that were only detected horizontally or vertically in the input image. Since we represent foreground pixels as 1 (white) and background pixels as 0 (black), a pixel-wise AND operation can be expressed as simple multiplication.

Lastly, merging the vertical and horizontal representations might add some holes inside the foreground objects. To make sure we minimize this problem, we apply a Closing operation with a kernel shape of 30 pixels wide and 1 pixel tall. The result can be seen in Figure 3. Figure 3: Final result after merging the horizontal and vertical representations and removing small gaps.

We want to find the bounding boxes surrounding the connected components in the resulting representation of Figure 3. A connected component is a set of pixels that share a common pre-defined rule.

To find such regions we can use the OpenCV connectedComponentsWithStats() function. For binary images, it returns a segmented image where the pixels of each component are labeled with a unique value. In other words, the pixels of the first component may have labels 1, pixels of the second component may be 2, and so on. Also, for each component, it returns the smallest rectangle that fits that component — the bounding boxes.

Having the bounding boxes of each component, we can use the pixel information (within the boxes) to calculate some features. In turn, we will use these features to design a set of rules to classify a given component as text or non-text.

Basically, for each bounding box, we are going to calculate 2 features.

• The ratio of black pixels and the total number of pixels in the box.
• The ratio of white to black transitions and the total number of pixels in the box.

The idea is that these features will be very similar for components that represent lines of text. In contrast, other components will express a different distribution of these features. Thus, we can design heuristics that classify a component as text or non-text by comparing feature values. Take a look at the final results in Figure 4. Figure 4: Line counting result. Note that lines with large spacing between words are counted as 2 instances.

As we can see, the results in Figure 4 are very robust. Note that some lines in the text got spit in 2. That is because, in certain points, some words are very far apart from each other. As a result, the horizontal dilation wasn’t able to merge them into a unique block. Also, because of this extra gap, the vertical operations weren’t able to detect them as a single entity.

Considering lines of text as document lines (regardless of the spacing among words), the input document has 28 lines. Nevertheless, the algorithm was able to count a total of 35 lines.

#### Counting Words in Text Documents

To localize individual words in the text, the process is very similar. The major difference lies in the shape of the structuring elements.

For the first set of operations, we choose a 12x1 (12 pixels wide 1 pixel tall) structuring element. Note that we reduced the horizontal reach of this kernel (compared to the one used to identify lines). The idea is that, while most words have an elongated (rectangular) shape, they are not as wide as lines are.

Following our recipe, dilation will expand the foreground pixels in the horizontal direction. And erosion will refine the new blocks of text. Because we chose a narrower structuring element, dilation has the effect of merging letters of the same word in a unique block. You can see the results in Figure 6.

To find vertical patterns, we define a structuring element with 1-pixel width and 4 pixels height. Again, we take the original input image and apply dilation and erosion using this kernel.

Lastly, we merge the horizontal and vertical representations using a pixel-wise AND operation and apply closing to fill in possible gaps. The result, after merging, is depicted in Figure 7. We can see that the combination of horizontal and vertical representations resulted in clean, well-defined blocks of words. The next step is to find the bounding boxes surrounding each connected component. Figure 7: Final result after merging the horizontal and vertical representations and removing small gaps.

For each bounding box, we can compute the pixel ratio and the white to black transitions. Finally, we can hand-design heuristics to discriminate a bounding box as text or non-text. The final result can be seen in Figure 8.

Considering numbers as words, and discarding special characters, the text has approximately 241 words. The algorithm was able to detect a total of 244 words, out of 316 possible candidates (number of initial bounding boxes).

#### Discussions and Conclusions

It is important to note that although this method seems robust, it is not general. Indeed, the shape of the structuring elements does not generalize to other documents. Also, the method we used to classify foreground components as text or non-text does not generalize either.

If we only use simple heuristics for classification, the proposed solution will probably output different (non-accurate) results for different document images. That is due to the chaotic nature of printed documents. Although they seem well behaved in terms of structure, the morphological structuring elements are very sensitive to things like font-size, line spacing, and font-type.

A better way of achieving a generalizable set of rules would be to add a learnable component to the system. One approach is to learn unsupervised regions where similar objects are clustered together based on some measure of similarity.

For this end, we could use a clustering algorithm such as K-Means on the image patches of the components. This way, we could learn to gather patches of text together (because they are similar). In the same way, we would learn different clusters for image patches that are not similar to text. Nevertheless, we still would be bounded by the capacity of finding the connected components.

That is all for morphological operations.