Human actions manipulating articulated objects, such as opening and closing a drawer, can be categorized into multiple modalities we define as interac- tion modes. Traditional robot learning approaches lack discrete representations of these modes, which are crucial for empirical sampling and grounding. In this paper, we present ActAIM2, which learns a discrete representation of robot manipulation interaction modes in a purely unsupervised fashion, without the use of expert labels or simulator-based privileged information. Utilizing novel data collection methods involving simulator rollouts, ActAIM2 consists of an interaction mode selector and a low-level action predictor. The selector generates discrete representations of potential interaction modes with self-supervision, while the predictor outputs corresponding action trajectories. Our method is validated through its success rate in manipulating articulated objects and its robustness in sampling meaning- ful actions from the discrete representation. Extensive experiments demonstrate ActAIM2’s effectiveness in enhancing manipulability and generalizability over baselines and ablation studies.
BibTex Code Here