Unified framework for recognition, localization and mapping using wearable cameras

Cogn Process. 2012 Aug:13 Suppl 1:S351-4. doi: 10.1007/s10339-012-0496-2.

Abstract

Monocular approaches to simultaneous localization and mapping (SLAM) have recently addressed with success the challenging problem of the fast computation of dense reconstructions from a single, moving camera. Thus, if these approaches initially relied on the detection of a reduced set of interest points to estimate the camera position and the map, they are currently able to reconstruct dense maps from a handheld camera while the camera coordinates are simultaneously computed. However, these maps of 3-dimensional points usually remain meaningless, that is, with no memorable items and without providing a way of encoding spatial relationships between objects and paths. In humans and mobile robotics, landmarks play a key role in the internalization of a spatial representation of an environment. They are memorable cues that can serve to define a region of the space or the location of other objects. In a topological representation of the space, landmarks can be identified and located according to its structural, perceptive or semantic significance and distinctiveness. But on the other hand, landmarks may be difficult to be located in a metric representation of the space. Restricted to the domain of visual landmarks, this work describes an approach where the map resulting from a point-based, monocular SLAM is annotated with the semantic information provided by a set of distinguished landmarks. Both features are obtained from the image. Hence, they can be linked by associating to each landmark all those point-based features that are superimposed to the landmark in a given image (key-frame). Visual landmarks will be obtained by means of an object-based, bottom-up attention mechanism, which will extract from the image a set of proto-objects. These proto-objects could not be always associated with natural objects, but they will typically constitute significant parts of these scene objects and can be appropriately annotated with semantic information. Moreover, they will be affine covariant regions, that is, they will be invariant to affine transformation, being detected under different viewing conditions (view-point angle, rotation, scale, etc.). Monocular SLAM will be solved using the accurate parallel tracking and mapping (PTAM) framework by Klein and Murray in Proceedings of IEEE/ACM international symposium on mixed and augmented reality, 2007.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Humans
  • Imaging, Three-Dimensional
  • Pattern Recognition, Visual / physiology*
  • Photic Stimulation
  • Photography / methods*
  • Recognition, Psychology / physiology*
  • Signal Detection, Psychological*
  • Video Recording