Overcoming Barriers to Progress in Multimodal Fusion Research

John Garofolo

Government programs have often led the way in research in information detection, recognition, and extraction technologies. Significant performance improvements have been achieved over the last 20 years in automatic speech recognition, spoken language understanding, text, image, and video retrieval, textual structural and semantic extraction, machine translation, speaker recognition, language recognition, face recognition, video person/object detection and tracking, and other extractive technologies. Performance peaks are being observed in these technologies and the pace of breakthrough algorithmic improvements is slowing. Initial efforts at putting these technologies together into complex applications have involved pipelining with progressive distillation or time lining with limited success. It has become clear that the tightly-coupled fusion of these technologies is the next logical step in the progression towards human-like computational intelligence capabilities. Unfortunately, there have been both multi-disciplinary and technical barriers to research and development in the area of multimodal fusion and it has yet to gain significant momentum. The gargantuan amounts of multimedia now appearing on the Internet and elsewhere are necessitating an acceleration of work in this area. We identify challenges that must be overcome to enable critical development and speed progress in multimodal fusion technologies.

Submitted: Sep 12, 2008