In this laboratory, we investigate the role of common sense in perception because we want to find out what prospective machinery efficiently drives our interpretation of the world to show how to overcome the high uncertainty resulting from the severe temporal, spatial and informational limitations in sensor data as the embodied agent interacts with its environment. Perception in such complex scenes (e.g., safety-critical, dynamic) does not only goes beyond processing sensor data to address classical tasks such as object classification, usually known as the what- and where-object-questions, but also faces the what-, how-, and why-happen-questions (e.g., task execution verification, estimation of physical parameters, quantities and detection of states such as fullness, stability). We generalize the problem of perception by placing events instead of objects at the center of scene understanding, where perception takes place as a loop which consists in predicting on the one hand the effects (anticipation) and on the other hand the causes (explanation) of events.
Description
NaivPhys4RP (Naive Physics for Robot Perception) is a causal and transparent generative perception model that emulates human perception by capturing key aspects of human common sense, formulated and coined in this work under the term Probabilistic Embodied Scene Grammars (PESG), that invisibly (dark) drives our interpretation of the world from poor observations. Note that PESG constitutes the foundations of NaivPhys4RP and enables it to probabilistically emulate the way the physical scene evolves and adjust these emulations with the few available sensor data (e.g., if we are in the desert, we can imagine entities that can be found there, and if this moves it is likely a carmel). In highlighting the high uncertainty from sensor data in complex worlds, NaivPhys4RP first point the necessity to generalize scene understanding and center it around events instead of objects, handling what/how/why-happen-queries (e.g., task execution verification, physical parameters, quantities, states such as fullness, stability) instead of just what/where-queries (e.g., object detection). Then, NaivPhys4RP regards perception as a loop consisting on the one hand of anticipatory queries where anticipation makes use of PESG to look ahead and predict the consequences of events (e.g., would glass spill when pouring? or what is the image resulting from the observation of a half full glass?). This can be seen as generating more information through PESG to augment sensor data. On the other hand, explanatory queries for which explanation looks back and predicts the causes of events (e.g., how full should the glass be for it to be observed as such?). This can be seen as parsing evidences through PESG. Mathematically, NaivPhys4RP computes a Contextual Partially-Observable Markov Process with Decision (CPOMP-D). In such a particular POMP, though the scene state is the central target, the decision making of agents in the scene is also modeled since the perceiving agent can decide to act mentally without actually realizing the action phyically (e.g., deciding actions from context and imagining them). The figure below illustrates PESG and its superiority to natural language grammars.
Core Inference Tasks - Anticipation and Explanation through Cognitive Emulation (Imagination)
This video demonstrates the application of NaivPhys4RP for the learningless and safe recognition and 6D-pose estimation of (transparent) objects.
Learning to Imagine in NaivPhys4RP - Amortization (Speeding up and Narrowing Imagination)
Despite the considerable reduction of the world state space through context-specific imagination, still there remains a bit of vagueness for instance in terms of number of objects and concrete spatial configurations. In order to amortize this combinatorial explosion, we employ a greedy direct (unconscious) perception approach of the scene, neurally trained on imagined datasets, to compress the state space. Then, the optimistic results of the neural learner are filtered based on the context and available sensor data (e.g., if knife detected then likely spoon because coffee drinking). We developed RobotVQA in this regard.
Safety with NaivPhys4RP - Trace, Verifiy, (Recover,) Report
RobAuditor is a safety-centric wrapper of NaivPhys4RP and flexible framework for automated task execution verification and audit trail generation in safety-critical processes which performs 1) task execution verification, (2) audit trail generation, and (3) failure recovery and prevention while scaling to (a) the diversity of task execution structures, (b) diversity of task execution contexts and (c) the availability of computational resources.
Perception of Flexibles in NaivPhys4RP - Generative Models of Fluids [Coming soon]
The model allows agents, manipulating fluids (e.g., pouring), to answer difficult perceptual queries such as how full is the container, whether the container will spill or whether the container will be missed.