News Video Indexing and Retrieval System Using Feature-Based Indexing and Inserted-Caption Detection Retrieval
1
News Video Indexing and Retrieval System Using Feature-Based Indexing and InsertedCaption Detection Retrieval
Akshay Kumar Singh, Soham Banerjee, Sonu Kumar and Asst. Prof. Mr. S. Ghatak Computer Science and Engineering, Sikkim Manipal Institute of Technology, Majitar, India.
Abstract—Data compression coupled with the availability of high bandwidth networks and storage capacity have created the overwhelming production of multimedia content, this paper briefly describes techniques for content-based analysis, retrieval and filtering of News Videos and focuses on basic methods for extracting features and information that will enable indexing and search of any news video based on its content and semantics. The major themes covered by the study include shot segmentation, key frame extraction, feature extraction, clustering, indexing and video retrieval-by similarity, probabilistic, transformational, refinement and relevance feedback. A new caption text extraction algorithm that takes full advantage of the temporal information in a video sequence is developed.
Keywords—Shot Boundaries Detection, Inserted Caption Detection, Machine Learning, Face Annotation, Edge/Field Detection.
I. INTRODUCTION
E
ffective techniques for video indexing/searching are required for large visual information systems (such as video databases and video servers). In addition to traditional methods that allow users to search video based on keywords, video query by example and feature-based video search provide powerful tools to complement existing keyword-based search techniques Due to the
rich information contained in caption text, video-caption based methods are increasingly used for efficient video content indexing and retrieval in recent years. Caption text routinely provides such valuable indexing information as scene locations, speaker names, program titles, sports scores, dates and time. Compared with other video features, information in caption text is highly compact and structured, thus is more suitable for video indexing. However, extracting captions embedded in video frames is a difficult task. In comparison to OCR for document images, caption extraction and recognition in videos involves several new challenges. First, captions in videos are often embedded in complex backgrounds, making caption detection much more difficult. Second, characters in captions tend to have a very low resolution since they are usually made small to avoid obstructing scene objects in a video frame . Indexing can be classified into 2 types I: feature based. II: Annotation based.
Here we are going to give you a brief explanation about feature based indexing which can be further classified into: 1. Segment Based. 2. Object Based. 3. Index Based. Finding the required video and its Retrieval can be successfully carried out by identifying and using the best video query scheme among the group of video queries available. Some of the queries scheme for video retrieval that is worth mentioning are: 1. Query content 2. Query using matching 3. Query function 4. Query behavior 5. Query temporal unit, etc. In query content we try to specify the content of video in the query, in order to retrieve the most suitable match video. It is further classified on the type of contents that is used in the query for video retrieval. Semantic (information) query is the most complex type of query in video database, and it depends on the technologies such as computer vision, machine learning, and AI (Artificial Intelligence). For example: finding scenes with Actor = ―nasseerudin shah‖ and Emotion =‖fighting‖. Audiovisual (AV) query depends on the AV features of the video For example: finding shots where camera is stationary and lens action is zoom-in. Meta query attempts to extract the information about the video data For example: finding out a video directed by Stephen Spielberg and titled ―Jurassic Park‖.IN query using matching we extract matching objects from the database. AV features such as sound and image analysis is used to match the query sample and the video data. Exact-match query Requires exact match between the query and the video whereas Similarity-match query (Often known as Query by Example) Because of the complex nature of video data, this query type is more required Query functions or query behavior depends on the functions that queries perform like Location-deterministic query, Browsing query, Iterative query, Tracking queries, Statistical queries. Query temporal unit classify the granularity of video data required to satisfy query Unit (or video-stream based) query deals with the complete units of video. For example: finding a sport video, which has player A in it, Sub-unit query deals with parts of video data, such as frames, clips and scenes. For example: finding the scenes where actors X appears, finding the shots in which a player with ‗this type of face‘ appears.
News Video Indexing and Retrieval System Using Feature-Based Indexing and Inserted-Caption Detection Retrieval II. VIDEO REPRESENTATION A. What is Video? The term video ("video" meaning "I see", from the Latin verb "videre") commonly refers to several storage formats for moving pictures: digital video formats, including Blu-ray Disc, DVD, QuickTime, and MPEG-4; and analog videotapes, including VHS and Betamax. Video can be recorded and transmitted in various physical media: in magnetic tape when recorded as PAL or NTSC electric signals by video cameras, or in MPEG-4 or DV digital media when recorded by digital cameras. In other Words any video stored in a given format is just a collection of large number of image shots taken in a sequential order with a very less time difference between each frame. . B. Video Analysis Since a video can be considered as a collection of image frames, thus analyzing a video can be seen as analyzing one image at a time then combining the results together. This concept of analysis can also be mentioned as Shot Boundaries Detection as discussed in [7]. Some of the methods for image and audio analysis can be used in video analysis. For example, image analysis and retrieval methods can be applied to selected representative frames extracted from video clips. We can distinguish different ways of analyzing and searching video: video summarization, video parsing, motion and event analysis. Each of these methods has its own challenges. C. Video Summarization Video summarization is the process of extracting abstract representation that will compress the essence of the video in a meaningful manner. The simplest video summarization is pictorial summarization built from selected frames from the video [18]. That is, we need to extract a subset of video data from the original video that has key-frames or highlights as entries for shots, scenes or stories. The result of the abstraction process forms the basis not only for video content representation, but also for content-based video browsing. A successful approach is to utilize information from multiple sources, including sound, speech, and transcript and image analysis of video. To summarize video, many research papers suggest video shot detection methods. Some of these methods are based on comparing pixel differences between frames histograms, edge content or DCT coefficients. One of the first approaches proposed by Arman et al. [3] uses a DCT approach on both JPEG and MPEG streams. For MPEG streams, only I-frames are analyzed. This implementation employed a two-step approach. Video frames are compared based on their representation using a vector of subsets of DCT coefficients. Then the normalized inner product is subtracted from one and compared to a threshold. If a potential cut is detected, the images can be decompressed for further processing. A multi-pass approach has been used by Zhang et al. but their technique also analyzes the B- and P-frames in an MPEG stream. The first two passes compare the images based on DCT coefficients with different skip factors on Iframes. In another pass, the
2
number of motion vectors is compared to a threshold. If there are fewer motion vectors than some threshold, a scene break is determined. III. CAPTION (DIS) APPEARANCE DETECTION In order to obtain the temporal feature vector, we need to segment the video sequence into segments that contain the same caption text. First, we can use a conventional shot boundary detection technique to segment a video sequence into camera shots. Since there is relatively small change in content between adjacent frames within each shot, it should be easier to detect the caption changes within a shot. We use the metric, quantized spatial difference density (QSDD), to detect the caption transition frame. We first compute the direct difference between two neighbor frames. We observe that a small movement of a scene between adjacent frames produces many residual edge pixels at object boundaries in the difference image. A direct summation of these edge pixels may result in a value higher than that caused by caption transition. Fortunately, most of these edge pixels are sparsely distributed, while the residual pixels produced by the caption are highly concentrated because of the dense stroke pattern of characters. So we compute a feature that can measure the residual pixel density distribution. The QSDD metric is defined by a two-step thresholding of the difference image between a pair of adjacent frames. We first compute the difference between a pair of adjacent frames. For each pixel with a difference value higher than a binarization threshold, the value 1 is assigned to the same location in a binary difference map. Otherwise the value 0 is assigned. The binary map is then uniformly partitioned into a number of small blocks. A block is labeled as Significant Change (SC) block if the summation of binary values in the block exceeds a second threshold. Then the QSDD metric simply counts the number of SC blocks. Since the caption residual pixels are closely distributed in the difference image, they tend to produce more blocks with significant changes. Therefore, the difference image at a caption transition tends to produce a high QSDD value, thus can be identified. Finally, to check whether there is a caption transition at a shot boundary, we compare the caption regions in the two frames right before and after the shot boundary. If they are the same, then no caption transition happens. Otherwise, we consider them as two different captions.
IV. TEXT EXTRACTION (CONTENT BASED) A. Indexing Considering Image Contents Considering the above mentioned issues, we are proposing an automatic news video indexing system that considers correspondences between indices derived from texts in the video and image contents. This paper will focus especially in the character region of an image frame retrieved from news video and indexing the same using the information conveyed by the text. B. Term Definition A video consists of still images called frames, and a sequence of graphically continuous frames is called a shot. The in continuous gap between shots is called a cut. The proposed indexing system aims to apply digital image
News Video Indexing and Retrieval System Using Feature-Based Indexing and Inserted-Caption Detection Retrieval processing techniques in order to achieve a high level of accuracy in indexing news video.
3
V. TEXT BASED ANALYSIS A. PIXEL Filtering During or after a key event in most news videos, a text with surrounding box is usually inserted to draw the attention of users to some content-sensitive information. For example, after a flood is recorded in a news report, some texts are usually displayed to inform viewers about the current status and flood affected areas. Our area of interest is the textual regions of an image frame extracted from the news video. in this context pixel filtering refers to removal of unwanted pixel information from the video frame to minimize the area of interest or work. A very simple approach correlation coefficient analysis is used Continuous frames in a gap of a fixed amount of time is extracted and correlation coefficient between the 2 frames is calculated. pixels in the frame having coefficient value less than 0.95 are removed the frames as shown in the figure below .this removes some of the unwanted pixel information from the image frame.in the second step of pixel filtering we have used the fact that most of the textual information in a news video is stored at the bottom part of the image frame.so applying further steps in bottom part makes the process faster and efficient. The process is predicated on the maximization of a correlation coefficient that is determined by examining pixel intensity array subsets on two or more corresponding images and extracting the deformation mapping function that relates the images.
B. Binary Conversion The image filtered so far is now converted into binary images which is essential for detecting the continuous text characters in the image frame. Since binary images uses only two colors i.e. white and black, as a result some of the textual information is lost during the conversion.to deduce back the lost contents complementary of the current picture frame shows the hidden information. The clarity of the text depends on the threshold value chosen during the conversion of gray scale frame to binary frame. in order to get the clear quality of text pixels one more binary image frame is created using new threshold value equal to (1threshold).The concept behind creating four binary image from a single image frame is to minimize the chances of losing any textual information during the conversion process.
News Video Indexing and Retrieval System Using Feature-Based Indexing and Inserted-Caption Detection Retrieval
4
copy free of scanning artifacts, and apply techniques such as machine translation, text-to-speech and text mining to it. OCR is a field of research in pattern recognition, artificial intelligence and computer vision. A conventional OCR is used to recognize the characters in the object images segmented from the picture frames. In this way textual information stored in the news video is extracted out to index the news video. VI. TEXT BASED ANALYSIS During or after a key event in most news videos, a text with surrounding box is usually inserted to draw the attention of users to some content-sensitive information. For example, after a flood is recorded in a news report, some texts are usually displayed to inform viewers about the current status and flood affected areas. To detect text in video, Wernicke and Lien hart [8] used the gradient of color image to calculate the complex-values from edge orientation image. It is defined to map all edge orientation between 0 and 90 degrees, thereby distinguishing horizontal, diagonal and vertical lines. Similarly, Mita and Hori [34] localized character regions by extracting strong still edges and pixels with a stable intensity for two seconds. Strong edges are detected by Sobel filtering and verified to be standstill by comparing four consecutive gradient images. Thus, current text detection techniques detect the edges of the rectangle box formed by text regions in color video-frames and then check if the edge will stay for more than 2 seconds. Based on these concepts, the essence of our text display detection method is that news videos use only horizontal text in 99% of the cases, if we can detect strong horizontal line in a frame, we can locate the starting point of a text region. The main steps involved in detecting text display detection are explained below. First, video track is segmented into one minute clip. Each frame for every one second gap is pre-processed to optimize performance by converting the color scheme to gray scale and the size is reduced to a smaller preset value. Second, Sobel Filter is performed to calculate the edge (gradient) of the current frame and Hough Transform is used on the gradient image to detect line spaces (R) between 0 - 1800. Threshold1 is applied on the R values to detect the potential candidates of strong lines which are usually formed by the box surrounding text displays. After these lines are detected, the system calculates the r (rho) and t (theta) values from the peak coordinates. It should be noted that r value indicates the location of the line in terms of the number of pixels from the center and t indicates the angle of the line. In order to verify that the detected lines are the candidates of text display, the lines are filtered out using two criteria: 1. The absolute value of r is less than n % of the maximum y-axis, and 2. The corresponding t is equal to 900 (horizontal). This n-value is regarded as threshold2. The first criterion is important to ensure that the location of the lines is within the usual location of text display. The second criterion is to ensure that the line is horizontal because some strong horizontal lines can be detected from other area beside the text display, such as the boundary between field and crowd. Finally, for each of the lines
C. Image Segmentation and Extraction Image segmentation is the process of segmenting the required object from the original image and extracting the object image from the original image. The essence of the method is to extract the individual texts from the original image. The idea is to label the connected component in the binary image and plot bounding boxes around the connected components. The object image inside the bounding box is extracted using the coordinate specification of the bounding box. The fact can be clearly understood by the figure shown below.
D. Optical Character Recognition Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic translation of scanned images of handwritten, typewritten or printed text into machine-encoded text. It is widely used to convert books and documents into electronic files, to computerize a recordkeeping system in an office, or to publish the text on a website. OCR makes it possible to edit the text, search for a word or phrase, store it more compactly, display or print a
News Video Indexing and Retrieval System Using Feature-Based Indexing and Inserted-Caption Detection Retrieval detected the system check that their location (i.e. the r values) is consistent for at least m seconds. This m-value is regarded as threshold3. The purpose of this check is to ensure that the location of the lines is consistent with the next frames as text displays always appear for at least 2 seconds to give viewers enough time to read. In fact, when a text display size is large and contains lots of information, it will be displayed even longer to give viewers enough time to read. Since there are not so many texts during news (otherwise they can be distractive), it takes longer to check text display in all video frames. Instead, specific domain knowledge should be used to predict text appearances in sport videos which are quite typical from one to another. However, during the experiment to detect generic events, the system still check all frames since text display can detect some events which sometimes cannot be detected by a certain sound and noise. VIII. REFERENCES [1]
5
[2]
[3] [4]
[5]
[6]
VII. CONCLUSION A novel news story clip retrieval method is proposed in this paper, in the method, the change of the local frame difference between adjacent frames is defined to detect the frames that captions appear and disappear. Thus we have proposed a very simple yet efficient method for news video indexing using extraction of text from video clip.
[7] [8]
Davenport, G., Smith, T. A., & Pincever, N. (1991). Cinematic primitives for multimedia. IEEE Comput. Graph. Appl., 11(4):67–74. Davis, M. (1994). Media streams: representing video for retrieval and repurposing. Proceedings of the second ACM international conference on Multimedia, 478–479. Davis, M. (1995). Media streams: representing video for retrieval and repurposing. PhD thesis. Dorai, C., & Venkatesh, S. (2002). Media computing: computational media aesthetics. The Kluwer international series in video computing. Dorai, V. S. (2003). Bridging the semantic gap with computational media aesthetics. Multimedia, IEEE, 10:15–17. Snoek, C., & Worring, M. (2005). Multimodal video indexing: A review of the state-of-the-art. Multimedia Tools and Applications, 25(1):5–35. Tjondronegoro, D. W. (2005, May). Content-based Video Indexing for Sports Applications. 320. Wernicke, A. L. (2000). On the segmentation of text in videos. Multimedia and Expo, IEEE International.