MSS National Symposium on Sensor and Data Fusion
Kitware will make two presentations at this event.
Interactive, Content-Based Exploration of Large Visual Archives through Feature Set Fusion
Analyzing and exploring a large collection of images and videos to identify relevant content is a highly challenging problem encountered frequently in ISR, counter terrorism, intelligence and law enforcement applications. We present a novel framework for visual data search and exploration for social multimedia. Our framework combines computer vision, data fusion, and graph-based interactive visualization applied to content commonly found on the web, such as Youtube videos. From a large archive collection of visual data, our framework creates clusters of images and videos (visual items) based on their content similarity factored by semantically distinct feature sets. The association weights of clusters and visual items can be interactively adjusted by the user according to each feature set, yielding a dynamic, associative browsing capability with intuitive controls.
Upon data ingest, computer vision algorithms characterize the content of visual items with multiple feature types. Our feature types capture high-level information such as object and scene types, as well as low-level information such as color, space-time gradients, motion strength, and texture patterns. Each feature type yields a set of features that can be weighted independently.
Pairwise similarities between visual items are computed for each feature set, and overall similarity is a weighted sum across the similarities of the feature sets. In a supervised learning framework, event types can be learned by training classifiers from exemplars within each feature set, then fusing these classifier outputs with a second-level classifier on their outputs. For unsupervised clustering, the overall similarity is used directly.
Interactively, graph-based visualization enables users to identify and explore clusters of visual items. Each node corresponds to an image or a video, and a weighted edge between two of them represents their similarity. A spring-loading rendering technique is used to visually arrange the nodes such that similar ones are close together, naturally revealing clusters without explicit algorithmic clustering. Users can rapidly identify promising clusters early in the search process, then focus on those clusters while disregarding most of the data.
Initially, default or prior weights between feature sets are used to display the graph. The user is given interactive sliders to control these weights. As the user changes a weight value, the graph display immediately changes to reflect the new overall pairwise similarities by moving and rearranging the nodes. This capability enables the user to set the weights according to content important for their task, such as searching for objects of a particular type embedded in scenes of a particular type. The clusters will rearrange accordingly, giving clarity to the types of video similarities each feature highlights, and the user can easily see which visual items may be of interest. This level of interactivity avoids the challenging problem of automatically predicting optical feature fusion parameters.
Clicking on a video provides a drill-down into feature values and playback in the browser. The application makes use of modern web prototyping tools including Tangelo, D3.js, and HTML5 video.
Probabilistic Sub-Graph Matching for Video and Text Fusion
FMV analysts typically produce text to callout the entities and activities of interest observed in a video feed. In parallel, automated video analytics algorithms can detect various objects and activities. The association between video and text can provide useful information for both streaming and forensic modes for improved mission reports and more accurate descriptions. For instance, the association of a video track with a textual element can be used to search for other instances of the same object through the archive more effectively than either modality alone.
Our goal is to associate textual descriptions of entities and activities with automatically extracted visual entities (e.g. tracks) and activities (e.g. "person walking"), while also inferring the presence of missing details. Textual data provides a rich source of intelligence for describing activities in the video, but does not specifically localize objects or scene elements such as buildings, and can be incomplete. On the other hand, video analytics provides highly accurate localization of objects and scene elements, but might produce false alarms or fail to detect objects and activities. The accuracy and completeness of both intelligence sources can be improved by fusing them, such that they fill in intelligence gaps and provide accurate localization.
We accomplish this by converting the chat text to an attributed probabilistic graph model that has activities (e.g. walking, stationary) as nodes, relational features as attributes between pairs of nodes (e.g. walking towards, near), and object attributes (e.g. black, building, vehicle) as node attributes. This graph model is then used to efficiently perform sub-graph matching through probabilistic inference in the much larger observation graph derived from video based activities and objects. This process is complicated by high levels of track fragmentation on moving objects, high levels of background clutter (e.g. other co-occurring activities), and poor spatial-temporal localization of the chat text. Existing techniques for text to video event fusion can address the track fragmentation and localization challenges to some extent, but either fail when there are multiple moving objects near each other or are computationally intensive.
To overcome these challenges we developed a novel fusion algorithm that instantiates complex graphs of interacting objects from simple textual cues such as "load vehicle" or "enter building". This imposes well defined interactive behaviors between the person and vehicle or building that are known a priori and can deduce the presence and associated behaviors (e.g. walking toward vehicle) that would have otherwise been missed. Our probabilistic graph model also incorporates low-level uncertainties associated with the textual data and video activities to produce a completely probabilistic sub-graph matching framework. Qualitative and quantitative analysis will be presented on an unclassified ITAR FMV dataset with promising results.
This paper includes ITAR material preventing it from being presented in open meetings and addresses many of the challenges also present in the operational data, such as closely spaced co-occurring activities and background clutter activities.Matt Turek