In my MDes thesis, I began with the museum audio guide and reenvisioned it as a context-appropriate, audio-based AR experience using voice and gaze interaction. It empowers visitors to take greater control over the experience and engenders the possibility of communicating multiple narratives surrounding a work of art, rather than just one.
This design centers on the famously-complex Arnolfini Portrait by Jan Van Eyck, and the design concept is inspired by its multiple interpretations.
After the narrator introduces the painting, the visitor is invited to verbally select one of three possible theories on its meaning — a depiction of the couple’s wedding, a memorial portrait for the wife, or just an elaborate demonstration of the couple’ wealth. The visitor actively drives the exploration, using their gaze to uncover hotspots on the surface of the painting. A different narrator discusses the painting in each theory, explaining the meaning of symbols that can have multiple interpretations depending on which theory is selected.
This project originated in work that I did as an undergraduate student, in which I imagined an augmented reality experience for the art museum that superimposed virtual touch targets onto the surface of a painting. This application of AR has become a reality in the years since, but there’s a fundamental awkwardness about interacting with a work in the gallery this way — seeing it in the flesh, but interacting with it on a screen-based reproduction.
“Augmented reality”, as a term, tends to describe visual experiences. However, an early definition by Milgram et. al. broadly defines AR as “augmenting natural feedback to the operator with simulated cues” &emdash; leaving open the idea that AR “natural feedback” could involve, or even center around, other senses besides vision. The new audio guide at the San Francisco Museum of Modern Art, for instance, could be considered a form of AR that augments the real-time museum experience with audio, rather than with visuals.
This lead me toward the question that drove my thesis exploration: How can visual and non-visual elements of augmented reality be utilized to make the museum audio guide richer and more interactive, yet still appropriate to context?
It was important to me that this thesis yield an actual, testable prototype, rather than a video concept. For that, I needed and a prototyping tool.
As impressive as the HoloLens was, its hefty physical profile would have made it unsuitable for actual deployment in a museum in its current state. But it did have what I needed for a prototype: image and voice recognition, visual augmentation (holograms), and audio feedback. The prototype on the HoloLens was developed with Unity and C# in collaboration with my good friend and developer partner, Bryan Oltman.
I wanted to minimize visual augmentation, so I didn’t make much use of the actual hologram functionality of the HoloLens. However, I did use it for very simple wayfinding in the final design, to display the three selectable theories. They appeared as holograms on the right side of the painting, indicating which theory was selected and reminding visitors how to select a different one.
Initially, I proposed using an open-ended Q&A model with conversational UI, where the visitor could simply ask questions as they would with a museum docent. I also proposed using a line-of-sight gaze cursor, so the visitor could also look at something and say “who is this?”
Before developing for the HoloLens, I designed low-fidelity prototypes to test — and ultimately challenge — that working hypothesis.
In this test, users wore a head-mounted camera with attached laser pointer, simulating the gaze cursor in their line of sight. It was a “Wizard of Oz” test because of the man behind the curtain, so to speak — me, sitting behind the user, acting as the voice of the computer by reading pre-written responses (and pre-written error messages for when I was stumped). I superimposed cue words on the painting as suggestions for what to ask about, which changed depending on how close users came to the screen — general terms when they were far away, and finer details when they were close.
Simultaneously, I created a voice-driven chatbot called “Docent” that ran on the Google Home device. I programmed it with intents to try and catch all likely questions from the user; any utterance that contained “dog”, for instance, would trigger a response about the different possible symbolic meanings of the dog at the woman’s feet.
In addition to generating a lot of observations that shaped the final design, the tests showed that open-ended Q&A was not the right approach.
In the testable HoloLens prototype, I abandoned open-ended questioning in favor of a more guided approach and centered the experience on the theories of interpretation. The user could adopt different interpretations by saying “It’s a memorial,” “It’s a wedding,” or “It’s just a portrait.” The gaze cursor lit up when the visitor was looking at something with information, and when selected, the visitor heard a short audio clip about it.
Unfortunately, footage recorded from the HoloLens was of very poor quality and, although I’ve edited it slightly to make it more accurately represent reality, it does not do justice to the experience of using the prototype. However, it is useful to see the choices that were made in contrast with the previous prototypes and the subsequent final design.
Using the HoloLens prototype as the basis, I created a video prototype demonstrating the final design vision.
It was important to choose the right voices, as they constituted the bulk of the experience. I worked to match voice actors with the theories they represented — a smooth, refined female voice for the wedding theory, for instance, and a deeper, older male voice for the memorial theory. In testing, many users commented that the different voices added variation to the experience and helped them to sort different pieces of information when remembering them later. A few even commented that it sounded like the narrators were debating, trying to convince the user of their chosen theory.
I also matched background music to each theory. Given the relative lack of visual cues or traditional navigation, the music which further helped to identify and distinguish between different modes in this experience.