Gesturing Toward Abstraction: Multimodal Convention Formation in Collaborative Physical Tasks

1Princeton University 2University of California, San Diego 3Brown University 4MIT 5Stanford University
Teaser figure showing shifts in multimodal signals

Figure 1. Shifts in multimodal instructions (speech and gesture) from the first repetition (R1) to the final repetition (R2). A) Instructions shift from redundant in R1 to complementary in R4 for block position and orientation. B) For abstract tower-level instructions, no position or orientation information is provided when establishing a convention in R1, but redundancy is introduced to emphasize position and orientation changes in R4. C) The virtual target tower on the 2×2 grid.

Abstract

A quintessential feature of human intelligence is the ability to create ad hoc conventions over time to achieve shared goals efficiently. We investigate how communication strategies evolve through repeated collaboration as people coordinate on shared procedural abstractions. To this end, we conducted an online unimodal study (n = 98) using natural language to probe abstraction hierarchies. In a follow-up lab study (n = 40), we examined how multimodal communication (speech and gestures) changed during physical collaboration. Pairs used augmented reality to isolate their partner’s hand and voice; one participant viewed a 3D virtual tower and sent instructions to the other, who built the physical tower. Participants became faster and more accurate by establishing linguistic and gestural abstractions and using cross-modal redundancy to emphasize key changes from previous interactions. Based on these findings, we extend probabilistic models of convention formation to multimodal settings, capturing shifts in modality preferences. Our findings and model provide building blocks for designing convention-aware intelligent agents situated in the physical world.

Viewing the Dataset

The viewing application allows replay of study data, showing 4D hand movements of Instructors synchronized with the audio transcript of their verbal instructions, as shown in the example video above. The viewing app also shows the target tower and the reference 2×2 grid. To view the multimodal study data:

  1. Download the viewing app linked above.
  2. Unzip and open the Mac application.
  3. Set the metadata fields on the top left for the participant, trial, and step you want to view:
    • Participant ID: "P<i>", i ∈ {1, …, 21} and i ≠ 18
    • Trial ID: "<t>", t ∈ {1, …, 12}
    • Step ID: "<s>", s ∈ {1, 2, 3, 4}
  4. Load and Play.
  5. Right-click and drag to change the viewing angle.

Example Data Video

*The original audio file has been removed for anonymity and replaced with a re-recording of the participants' audio transcript for demonstration purposes.

BibTeX

@inproceedings{Maeda2026GesturingTowardAbstraction,
  author = {Maeda, Kiyosu and McCarthy, William P. and Tsai, Ching-Yi and Mu, Jeffrey and Wang, Haoliang and Hawkins, Robert D. and Fan, Judith E. and Abtahi, Parastoo},
  title = {Gesturing Toward Abstraction: Multimodal Convention Formation in Collaborative Physical Tasks},
  booktitle = {Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems},
  series = {CHI '26},
  year = {2026},
  location = {Barcelona, Spain},
  numpages = {15},
  url = {https://doi.org/10.1145/3772318.3790618},
  doi = {10.1145/3772318.3790618},
  publisher = {ACM},
  address = {New York, NY, USA}}