I. INTRODUCTION
Nowadays, many overlaid graphics such as text and channel logos are artificially superimposed on the broadcasting videos by human producers. The graphic text inserted videos is called the overlay text. This text differs from the scene text which naturally occurs in the scene being recorded such as advertising boards, street signs and clothing. The overlay text provides additional information from the audiovisual context so the audience pays more attention. For this, it is superimposed on the video frame during the editing stage of production. Therefore, the extraction of video text information has a very important significance for the further semantic understanding. Many approaches have been studied in the scene understanding, indexing, browsing, and retrieval [1-5].
Especially, the overlay text in news videos provides concise and direct description of the content. For instance, the text annotates the names of people and places, or describes objects and the current issue.
Therefore, the overlay text is the most reliable clue for constructing a news video indexing system, when the text can be accurately transcribed. The detection and recognition of the overlay text have become a hot topic in news video analysis, such as identification of person or place, name of newsworthy event, date of event, stock market, other news statistics, and news summaries [6-9].
Among these applications, the identification of the person from the overlaid text raises a lot of interest in the information research community. The identification using the overlaid person names (OPN) has started to be investigated [10]. Since then the research area has raised a large amount of work, especially in face clustering tasks, face naming of captioned images, and recently, automatic naming within broadcast videos [11-16].
However, this paper focuses on the application to make the automatic person indexing system by the OPN in the news interview videos. The name and title information in the interview video of the TV news program are valuable for building an information retrieval and data mining system.
As the first step for this goal, this paper proposes the method to only detect the name text line among the whole overlay texts in the one frame. The framework of the proposed method is given in Figure 1. The paper’s method uses the rule-based characteristics in production of the TV news program. Many of the accepted production rules apply to the TV content, since the broadcasting videos are produced by professionals. For example, text and logo are often overlaid onto the natural content in a structured manner, such as aligned text lines at the bottom or on the upper corners of the screen, to minimize the chance of covering the important content.
This system contains three main parts: the reference frame selection, the overlay text region detection, and the overlaid name text line detection. The proposed method is explained in detail in the following section. Section III shows the results of the overlaid name text line detection and describes the analysis of the experimental results. And the last section describes the brief conclusion and the future works.
II. PROPOSED FRAMEWORK
For the videos have the appearance and disappearance of the overlay text, the processing of all frames is time-wasting. As the first step, the sub clips which contain the overlay texts are made based on the corner characteristics of the text. Second, to reduce the processing time, the reference frames are selected. Because the same overlay text lasts on the same position for a few seconds or more. The third step transforms the reference frames into grayscale images. Next step does the logical AND operation on the edge map images. And based on this result, the overlay text region is obtained from the number of black and white transitions and also the horizontal projection histogram is acquired. And then to limit the region of interest in the whole text lines, the detected overlay text region and the horizontal projection analysis is combined. At last, the overlaid name text line is detected by overlaying the ROI text line mask with one of the four reference frames. Details are given below.
By observing a large quantity of the TV news programs, the overlay text superimposed on most news videos has the following characteristics. Since the appearances and disappearances of the overlay text occur suddenly or slowly like Fig. 2, all frames in the video sequences do not have the overlay text. The position of the overlay text is fixed; generally in the range of 1/2 from the bottom of the frame. The overlay text is aligned horizontally and overlaid on the opaque or translucent background matte. For readability in a complex scene, the same overlay text appears in the same position for a few seconds or more.
First, when the video sequences come in, the frame images are stored in RGB color space. And then, according to the characteristics presented above, this paper decides whether the overlay text is included or not in all frames of the videos, and yields the sub clips with the overlay texts. For, the detection of the overlay text in every frame is time-wasting. Therefore, to decide the frames included the overlay text, this paper uses the corner density map based on Harris corner detector proposed in the previous work [17].
And next, the reference frames are selected in the sub clips by the method proposed in [7]. If videos are played f frames per second, the overlay text stays in a fixed location for at least 2f consecutive frames. Let k be the nearest integer that is not less than f. This paper defines every consecutive k frames to be one round. To simplify the calculation, about only 1st round, the four reference frames are selected on frame 1, └k/3┘, 2└k/3┘, 3└k/3┘ like Fig. 3. Because the same overlay text is fixed in the same position for every consecutive k frames.
Next step is to detect the whole overlay texts in the frame by using the four reference frames of the previous result. This paper uses the temporal information of video and the logical AND operation on Canny edge maps to detect the overlay text region [7].
First, this system transforms the four color reference frames into grayscale images by (1)
where Y is the intensity value and R, G, B are the value of red, green, blue channel of the pixel, respectively. Fig 4. shows the conversion result.
Second, this system yields the edge map images by the Canny edge detector and the simple line deletion applied on each of the grayscale images. The simple line deletion is used to remove long lines which are unlikely to be characters in the Canny edge result image. When the Canny edge image is scanned from left to right and top to bottom, a horizontal line vertical line is removed if its length exceeds the presumed width w and height h of a character. As a result, the edge map images are obtained as shown in the Figure 5.
Third, the logical AND operation on four Canny edge map images is executed. The result image is called the Multiple-Edge-Map. After the AND operation, a position (i, j) becomes an edge pixel if all four edge images are edge at (i, j). Therefore, most of the background edge pixels are removed, whereas the static overlay texts are remained. Because, the same overlay text appears in the same location for many successive frames, while the location of background edge pixels may differ in a few pixels. The Figure 6 shows the Multiple-Edge-Map image. The result well explains that the problem which the difficulty to distinguish whether the detected edges are really from overlay texts is alleviated by multiple frame integration method.
Fourth, the overlay text candidate region is detected by utilizing the number of the black and white transition. As shown in (2), the value of Ntrans can be obtained that a window of the presumed character size w × h slides from left to right and top to bottom on the Multiple-Edge-Map image.
where w and h are the width and height of window, and b(•) is binary image. If Ntrans is larger than threshold Ttrans, this window is masked. The union of all masked windows is the overlay text candidate region. The threshold Ttrans depends on the character size and is obtained by Ttrans = β(w × h) with β a constant which is empirically measured.
Finally, to resolve the problem that characters lose some pixels in the AND operation, a morphological closing is applied first and then dilation is followed. A closing with a horizontal structuring element of size ┌w/3┐ is used to fill holes. A dilation with a structuring element of size ┌w/4┐ × ┌h/4┐ is applied to the connect characters. The result image is the overlay text region like Figure 7.
In general, many overlay texts can exist in the one frame of the video. To detect a name text line, it is necessary not analyzing the whole overlay texts in the one frame. This paper constrains the detection region based on the news program production rules. The TV content is produced by professionals. Many of the accepted production rules apply to the TV content. In the news interview video sequences, over a few lines, the story of interviewees is positioned at the bottom of the frame like the characteristics remarked in the previous phrase. And the interviewee's name and the title fix on the top line among the interviewees first story lines and appear in the same position for a few seconds. Therefore, only the top of the first story lines is the interest of region (ROI).
To detect the ROI text line, at first, the horizontal projection histogram must be obtained by applying to the Multiple-Edge-Map image of the previous phrase. To scan the result image from top to bottom, and count the number of edge in a row can be gotten the horizontal projection histogram image like (3).
The Figure 8 shows the result image. This projection image is analyzed in the range of 1/2 from the bottom of the frame. The first top area of the horizontal projection histogram in the half bottom region is selected as the region of interest. The start point and end point of ROI along the height (vertical) axis are applied to the overlay text region image. The result is the ROI text line mask image like Figure 9(a). At last, to apply to the one frame of four reference frame images yields the overlaid name text line image such as Figure 9(b).
III. EXPERIMENTAL RESULTS AND ANALYSIS
Since there is no standard dataset for overlay text in videos, the test videos used in the experiment were captured from the interview video sequences in a TV news program in Korea. The resolution of the videos was 720×480, and frame rate was 29.97 frames per second. The presumed character size w×h was 20×20 pixels. The threshold of the Ntrans was set to be 0.15 and the threshold of the horizontal projection histogram analysis was the presumed character width 20.
Figure 10 and table 1 show the results of the overlaid name text line detection. To evaluate the performance of the proposed algorithm, this paper shows the block level accuracy of the results of overlaid name text line detection, and uses precision and recall as performance measures. TP (true positive) is the predicted positive block and FN (false negative) is the predicted negative block of the actual name text line. FP (false positive) is the predicted positive block and TN (true negative) is the predicted negative block of the actual non-name text line. The measures are calculated (4) and (5), and the result is shown in Table 1.
In Figure 10, (b) column is the results of the overlaid text region, (c) column is the results of the ROI name line mask based on horizontal projection analysis, (d) column is the results of the detected overlay name text line. The proposed method accurately detects the position of the overlaid name text line in many examples. However, the fourth and the fifth row results show that the false block in overlay text name line was detected. In this case, many edges have in non-text region and this area is detected as text region. Thus, it is necessary to refine process to remove this noise.
In general, the name text line contains not only name but also age, degree, job and address. This kind of information plays an important role to develop an automatic person indexing system. Therefore, it more necessary exactly detects of the text line for the input of the traditional OCR (Optical Character Recognition).
IV. CONCLUSION
This paper proposes the detection method of the name text line among many overlay texts in one frame for automatic person indexing of the interview videos. The overlay text region is obtained by edge and multiple frame integration method. And the overlaid name text line is detected by the production rules and the horizontal projection histogram analysis.
The result image is used as the input, which recognizes the text with the OCR. As a result, the retrieval system to control effectively and automatically mass the interview videos in the news can be developed.