The Workflow Challenge: Ambisonics or Stereo?
Spatial audio helps to enhance the experience of immersion in VR. Despite its importance, designing spatial audio for VR is not easy as there is no unified workflow from the stages of recording, postproduction to the final delivery to the target platform. The quality of sounds also varies greatly between audio engines as I will discuss later.
Spatial audio is primarily used in game and cinematic VR. The audio in cinematic VR has a strong emphasis on the timed playback and synchronisation with video, and less dependent on user interaction except the head tracking in rendering. On the other hand, VR game has higher demand for interaction and less time-based. As the project is a semi-cinematic VR app made with a game engine, I preferred to make room for interactivity that can be easily controlled in Unity instead of mixing down a pre-baked 360 soundscape in a digital audio workstation that currently has so little support for the two major ambisonic formats, namely the traditional FuMa B-format or the popular AmbiX B-format (ACN/SN3D, also known as the first-order B-format).
Enda has proposed four types of audio for the app earlier, including voiceover, music, UI tones and sound effects in general. He took care of the narration and produced music and UI tones. I focused on recording and processing ambience, and implementing spatial audio in Unity with third-party sound engines. The narration, music and UI tones throughout the tour are not supposed to be spatialized. The other audios in our app respond to the head movement, and the audio playbacks depend on user interaction. Enda had asked me to prepare ambience background for each static locales. I discovered later that even as subtle as ambience sounds can have HRTF enabled and respond to the head rotation if done properly from the beginning.
The workflow for designing spatial audio starts from recording. With the ZOOM H2n we can have sounds recorded in a 4-channel AmbiX format, or having regular stereo audios with perfect mono compatibility. A major problem for spatial audio comes in postproduction. Currently only Pro Tool, Reaper and Nuendo that have plugin supports for ambisonics. Mainstream media player such as Quicktime Player, iTune don’t have playback functionality for ambisonics. By contrast, stereo files don’t have such support problems in playback or postproduction. After mixing in a DAW, the audios will be imported into Unity and mapped with head rotation data in the background for real-time rendering of directional sounds. Unity has native support for ambisonics, but the sound quality is far from ideal compared to other third-party audio engines. The Google VR SDK enhances audio spatialisation with the Gvr Audio Soundfield APIs that aims specifically to support ambisonics.The Gvr Audio Soundfield API was mentioned in the Google I/O talk in May 2016 . However this feature was not released until 4 August in the v.0.9.0 SDK. The official documentation was updated the next day. If the GvrAudioSourcefield APIs was available when I made the spatial audio demo in mid-July, I would probably record the audios in the ambiX B-format, have them processed in Reaper and configure them in Unity with the Gvr Audio Soundfield APIs. Due to the time constraint, I don’t have time to swap the ambience with the GvrAudioSourcefield APIs in Unity. Instead, I only made a Youtube demo to show that we have the ability to implement ambisonics in Unity with Google VR SDK.
In light of the compatibility and support problems for ambisonics, I chose an easy workaround for the audio spatialisation. I have the audio recorded in stereo and processed in Audacity. Then the audio clips were imported in Unity. These audios were sprinkled as invisible game objects in the scenes to give a sense of specialisation when users interact with them. As the Gvr Audio Source APIs supports stereo and mono audios and has better spatialisation than the native audio engine in Unity, it was used for the sound effects that have directions in our app.
Recording and Editing Ambience
The first two weeks in August was primarily spent on recording and editing ambience. I recorded the ambience with the ZOOM H2n recorder. It has 5 built-in microphones, with two arranged as an XY pair and the other three configured in an MS pattern. The ambience was recorded in 4-channel mode, with signals from the XY mics and the MS mics recorded onto two separate stereo tracks. The output is two synchronised stereo audios for each locations in the tours. I also upgraded the software to Firmware v2.0. After the upgrade, the ZOOM H2n in the lab can produce first-order ambisonic surround sound and can write the 4-channel audio onto one track in AmbiX B-format. Due to the limit of time, I didn’t have time to re-do the ambience recordings in AmbiX B-format, but settled with the traditional stereo recordings I made in early August. With the GoogleVR SDK for Unity, the rendering of audios can adapt to the head movement in real time.
The ambience recording and editing took more time than expected, spanning from 19 July to 13 August. I first recorded the ambience during the daytime with lots of human activities and conversations. The panorama in the July demo was also packed with people. However, the formal 360 photos and videos was shot in an early morning when the campus was quiet and empty. It sounded very strange to have human activities in the ambience for the serene morning scenes. So I did the field recordings again in the early mornings on 4, 5 and 6 August. As the environment was very quiet, I turned up the Gain levels, fearing to miss the bird tweets and rustling leaves at the background. It turned out to be another mistake, as increasing gain in the recorder also introduced noises, which was difficult to remove in postproduction. Jill suggested not to turn up the Gain level during the recording, but only bring it up in the postproduction when necessary. So I went for another early morning field recording on 13 August and have the Gain level set at 0 or 1. This batch turned out to be very quiet, with the RMS at -70dB in general. I normalised the soundtracks to somewhere between -28dB and -36dB before putting them in Unity. The recordings in August formed a large part of ambience ready for use. The recordings taken in July was mostly discarded. Also, in the postproduction stage, Enda reminded me to remove the low-frequency rumbling in the ambience with high pass filter, which was very useful.
Sample recording 1: Regent House ambience with conversation
Sample recording 2: Regent House ambience without conversation
Sample recording 3: Front Square ambience with conversation
Sample recording 4: Front Square ambience without conversation
Building Spatial Audio and Collaboration In Unity
I had proposed to Enda in July that we should use Git for version control and Github for remote collaboration to keep everything in sync. Enda thought Git was not very helpful for Unity. Git is great for handling text-based files like scripts, but might have problems with scenes and prefabs as these are normally stored as a mix of binary and text in Unity. Although we had the option to store the scene and prefab assets as text-based files , they were too important that we could not afford to break them. Also, it would be much easier for Enda to integrate my works in Unity if they were packed as prefabs and exported as Unity package. We eventually went with the prefab way of collaboration. A minor problem in this approach was that Unity did not support the update of nested prefab. A nested prefab would appear missing and had to be adjusted every time a new set of prefabs were imported into the project. As Enda finished a mature prototype in Unity with photospheres, videospheres, navigation markers and multimedia placeholder in early August, I started to build on his master copy of the project. The following is a snapshot of the structure of game objects in a scene of our project.
1. Photosphere (with Narration attached) 1) Marker a) Navigation (NAV) b) Point of Interests (POI) 2) Audio a) Ambience (AMB) b) Directional Sound Effects (SFX) 2. Videosphere 3. UI Sounds 4. Music
I took care of everything grouped under the Audio in the above hierarchy. My work in Unity concerned with the placing and the control of ambience sounds and directional sound effects. I used the GvrAudioSource API to trigger ambience and directional sounds with the event trigger component, while Enda used the AudioSource API in Unity to control narration, music and UI sounds as these audios didn’t need to be spatialised. Hence the manipulation over audio in Unity was clearly separated by class and functionality, and my scripts would not break Enda’ work.
The ambience audio was dropped at a location close to the main camera, and the playback would be triggered when the camera entered the active photosphere. At first I used the GvrAudioSource for the ambience sounds, and off-set it slightly from the camera to increase a feeling of spatialisation. If the location of the ambience audio overlapped with the camera to which the GvrAudioListener was attached, the spatial illusion would not work anymore. The GvrAudioSource would only create a sense of spatialisation if the audio was played somewhere away from GvrAudioListener. Also, the Google VR SDK for Unity tended to attenuate the audio level and muffled the ambience which was very subtle already. After I swapped the ambience with better quality sound tracks, I suddenly realised the degradation of sound quality in Unity might not come from the audio alone, but caused by the rendering algorithm in Google VR SDK for Unity. Considering the overkilled spatialisation and the tendency of Google VR SDK to muffle the original soundtracks, we preferred to use the native audio engine in Unity for the ambiences for a few good reasons. First, the output from the native audio engine in Unity seemed to be much louder than the playback in GvrAudioSource. Second, the output from the native audio engine can be routed to a mixer in Unity, and brought to a desired level in a batch. As the GvrAudioSource sidestepped the mixer workflow in Unity, it also became an advantage in Unity to continue with a mixer approach to adjust audio levels between narration, UI sounds, music and ambience all in one place.
The directional sounds were sprinkled around the scenes and the playbacks were triggered by the gaze-over interaction. Although the GvrAudioSource was incompetent for the ambience sounds in our project, it was suitable for directional sound effects spreading out in the virtual world. Generally there was an empty object in the scene representing the audio source, and an event trigger that decided when the playback should be started. The event trigger could be attached to the object holding the audio source, or separated from it. An advantage of the last approach was that we can have the user looking at one direction but triggering sound from another direction. Also, the object serving as a trigger can be set inactive, its collider can be disabled, or the event trigger component can be disabled after the crosshair exited the colliding zone. This would guarantee the audio would be played only once. In the spatial audio demo in July, the reticle grew immediately after hitting an audio object. In the later build, I disabled the reticle when it entered the colliding zone, and enabled it again upon exit. This wouldn’t affect the interaction between the GvrReticle and the UI markers somewhere else in the scene. As the crosshair stayed unchanged after the collision with audio objects, users needed to explore the soundscapes without the aid of visual cues.
Another issue was that I needed to be careful with the UI canvas or other game objects that might block the ray-casting from the camera on the audio triggers in the scene. After Enda changed the projection meshes from sphere to octahedron, many audio triggers was laid outside the projected mesh and the transform coordinates needed to be tweaked again. The audio design for each location also changed as more multimedia contents were brought into Unity. For instance, if Enda put an movie clip at the back of the user, I would place an audio trigger at a location that cannot be neglected, such as the area near the forward navigation marker, and to draw attention with directional sound to the position where the movie clip was placed.
GoogleVR, FB360 Rendering SDK (TwoBigEars’s 3Dception)and Compatibility Issue in Unity
In the last week of August, I started to consider implementing the ambience surround sounds with the newly release GvrAudioSoundfield APIs or another third-party plugin for Unity. 3Dception has been highly praised before the company Two Big Ears was acquired by Facebook recently. I had a false impression from the official website that the Two Big Ears ceased support to the Unity plugin. It turned out that the rendering engine was incorporated into the FB360 Rendering SDK, which was free for download. I tried both the FB360 rendering engine (henceforth TwoBigEars rendering engine for the sake of brevity), and the GvrAudioSoundfield APIs.
Here is a brief comparison between the GvrAudioSoundfield in Google VR SDK for Unity and the TwoBigEars rendering engine in FB360 in terms of sound quality, supported audio formats and the ease of use.
The GvrAudioSoundfield, as part of the Google VR SDK for Unity, tends to attenuate the sound levels, and developer had to turned up the Gain from the inspector to have louder surround sounds. But the average sound levels of GvrAudioSoundfield were higher than the GvrAudioSource. The TwoBigEars rendering engine tends to raise the sound to the highest possible levels and let the developer to scale it down in the range between 0 and 1.
The GvrAudioSoundfield have support for mono, stereo and the ambiX B-format audio which has 4 channels in one track. One can simply drag and drop two audios in the Unity inspector and the Google VR SDK would render them with head tracking data.
The TwoBigEars rendering SDK requires encoding in the custom *.tbe format, B-format (FuMa), B-format (ambiX) or quad-binaural format before decoding and rendering the audios in Unity. It has indirect support for the traditional mono, stereo audios in popular DAWs such as Reaper and Pro Tool through a rebranded spatialiser plugin. In Reaper, sound designers can use the FB360 plugin to spatialise mono or stereo sound tracks, and rendered the master 3D track into an 8-channel audio file in Reaper. The 8-channel audio would be encoded in the FB360 encoder before bringing it to Unity. I tried the customised *.tbe encoding and decoded the audio in Unity. The spatialisation was slightly better the GvrAudioSoundfield. However, it has more softwares involved, and the TwoBigEars audio engine in the FB360 rendering SDK only had a few playback parameters exposed in the Unity inspector. The TBE rendering engine doesn’t even provide a loop function in the inspector, and I had to write a simple C# script for looping an audio asset with the TBE rendering engine. The TBE APIs could be queried in C#, but the amount of exposed APIs was limited. In short, the GvrAudioSoundfield has a more easy-to-use interface in Unity while the TwoBigEars has better quality in audio rendering.
Last but not least, the Google VR SDK cannot work together with the FB360 rendering engine in the same project. They both works well with the native audio engine in Unity. Considering all the directional sounds in Unity used the Google VR SDK, it would be a large amount of work to migrate the directional sounds to the TwoBigEars rendering engine. Due to the time limits, Enda and I decided to leave the ambience rendering to Unity’s native audio engine, while sticking with the GvrAudioSource for the directional sound effects in Unity.
Sample ambience soundtrack playback in Reaper
Soundtrack encoded with FB360 Encoder and decoded with FB360 Rendering SDK (TwoBigEars’ TBSpatDecoder) in Unity
Soundtrack playback with Audio Source in Unity
Soundtrack playback with Google VR SDK’s GvrAudioSoundfield
Soundtrack playback with Google VR SDK’s GvrAudioSource