Comparison of Spoken and Visual Natural Language Content in Video

Ehry MacRostie, Jonathan Watson, Prem Natarajan

Human language content in video is typically manifested either as spoken audio that accompanies the visual content, or as text that is overlaid on the video or contained within the video scene itself. The bulk of research and engineering in language extraction from video thus far has focused on spoken language content. More recently, researchers have also developed technologies capable of detecting and recognizing text content in video. Anecdotal evidence indicates that in the case of rich multi-media sources such as Broadcast News, spoken and textual content provide complementary information. Here we present the results of a recent BBN study in which we compared named entities, a critically important type of language content, between aligned speech and videotext tracks. These new results show that videotext content provides significant additional information that does not appear in the speech stream.

Submitted: Sep 9, 2008