This blog post is based on the interview I gave for XROM on XR audio which you can watch here:
Extended Reality is the umbrella term for virtual reality (VR), augmented reality (AR) and Mixed Reality. Spatial audio plays an entire different role in each of the fields. Covering the complexity of 3D sound is probably impossible in a single post. That’s why I already wrote multiple articles on this. Anyways, here is an overview that hopefully helps you!
I will try to explain why it’s so exciting. We already have mono (0 dimensions), stereo (1D) and surround (2D). Now you have the possibility to play sound from the top and the bottom like you may already know from cinemas. As there are not only loudspeakers at the front and at the back. These days, loudspeakers are even hanging from the ceiling. So this is where you have probably heard the term Dolby Atmos.
But what’s the benefit? Finally, we have the possibility to experience sound as we are used to. You will realize the sound is always three-dimensional. No matter where you are and what you do. For example, we are used to the feeling of how it sounds when a car is driving by from behind to the front. How it sounds if somebody is knocking at the door and this feels very natural.
Find out more about the cinematic aspect of sound. Or as some people like to say:
“It’s like being there.”
Now we can have this natural feeling with all the new algorithms and microphone technologies. This is enabling us to feel more immersed than ever before because everything just sounds as we would expect it. So our brain doesn’t have to decode where the sounds are coming from. With spatial 3D audio we eventually forget about technology. And this is where immersion kicks in.
One of my projects was for the European Commission. They have funds that help people worldwide. We went to different refugee camps for eg. in Kenya or Bangladesh to film in 360° their everyday life. What people are doing there, how the food is provided and how education is running?
I really love the perspective of 360° video as you feel like being a visitor. This makes you restricted to interact and let you feel helpless. This again is where it makes so much sense to have 360° videos, because this will raise people’s emotions but are still objective. Besides this I am still a big fan of 360° videos, because if it’s done right, they are so much fun and easy to use for new users.
To get back to the recordings. As you can imagine, it’s quite chaotic in a refugee camp: There are so many people. This means a lot of unwanted background noise. For (spatial) audio, I need silence to do decent recordings. This is why outdoor productions are so challenging. You can hardly control the surroundings. Like the weather even or curious humans that don’t feel like being quiet while we are recording.
But for spatial audio, it’s really crucial to have as isolated objects as possible. To make it more exemplary, here we were doing a recording of a kid. We were following him all day and should record his voice. As you can guess, around him is the environment which is loud. This is a big struggle. Since we are doing XR sound, we need to use another microphone, for example Ambisonics microphone. Which is a bit like 360° camera because it has multiple microphones pointing in each direction. This is sort of a backbone for my whole soundtrack because it does record the child, but it also does record the noisy background.
I like to say: Don’t make it realistic, make it believable. What would the viewer imagine that the scenery would sound like?”
This multichannel audio provides a diffuse sound field that works as a backbone. But the child wouldn’t be as present as necessary. Since the child is like one or two meters away, it gets a bit too quiet and it’s hard to understand. However the kid is our start of the scene so we need to do spatial audio magic. The trick here is basically to have both microphones running at the same time. I like to have the combination of an isolated microphone sound from the child. Parallelly, I’m also having a spatial sound microphone that captures everything. Afterwards adding some sound effects. Well when it makes sense also a bit of music, and here you go! A decent 360 soundtrack.
So from a technical point of view, I would say yes. Spatial audio is already great, not only in XR. But from a content perspective, I think we can find some better use cases of just having music running around your head. Just like 8D does. In the audio engineering scene (AES), people were freaking out because there’s so much high quality content and such fragile orchestral recordings. People put so much work into everything. And then somebody just runs the music through a spatializer and uploads it to YouTube.
But what I can tell is: I appreciate that millions of people were listening to it. I think this is a good sign that people really want to experience sound three dimensionally. If there wasn’t ASMR (Autonomous Sensory Meridian Response) or 8D Audio content, there wouldn’t be many other spatial audio experiences that people could listen to. There would be just from my colleagues and my stuff out there, which are very niche projects.
For me it was the right for spatial audio and XR to exist. That’s like a reason for me to do my job because there are millions of people who enjoy it. Now it’s just the 8D music and ASMR type content. But we can do better in the future and it will be great. So yes, there will be a new standard for sound that will replace two channel stereo as we know.
Well you can just look it up on YouTube, for the virtual barber shop. Somebody had a great idea already years ago before spatial audio and XR was a thing. There’s a microphone, which looks like a human head, called a binaural dummy head. It immediately gives you the 3D audio sound we are used to. I showed it in the XROM video at 36:19.
So somebody used his or her creativity and came up with the sound of cutting your hair. Without spoiling too much, put your headphones on, it’s really fun. This is where people really get excited about spatial audio. When it’s combined with great stories. Not just showcasing, “this is how technology works”, but put into a fun story.
Another thing I always say when asked for inspiration is: Notes on blindness. It’s like the best thing ever. It was one of the first early VR experiences and you can still watch it for free. Still it is so competitive. It’s a story of somebody losing eyesight. Of course this makes you automatically listen more for the sound – brilliant idea. So it takes so many things that work for a 3D experience. And it’s just amazing.
There is so much more terrain for spatial audio in XR and beyond. So be sure to check out the overview full of inspiration.
I would recommend getting an overview of what’s already out there. What XR experiences with spatial audio are out there that work good? And from there you can start maybe developing your own stories or thinking about your own ideas. You basically mess around with the tools that are already available. So fail hard, fail early.
There are so many experiences which don’t make any sense and they would have just worked fine with stereo. But that’s totally OK because then you learned something about it. You should just get started. Luckily the barrier is not as high anymore as there are free tools out there. The microphones are not too expensive and you have many manufacturers who are putting out spatial audio microphones as stated above.
I can recommend: get involved in the audio engineering society. At the moment they have free webinars or sometimes you have to charge a little bit. But then you have quality and structures which tell you everything about spatial audio. Associations like the German VDT (Verband Deutscher Tonmeister) or BVFT (German Association for Film Sound Professionals) are doing something similar.
Regarding shameless self promotion: You can also have a look at my blog and see what I write. I really try to make 3D audio accessible and show cool use cases. Like how the technology is being used and maybe you can just stop there and see what is your interest? The 3D audio world is bigger than you think. Learn more about the endless possibilities of sound for Extended and Mixed Reality
The biggest mistake you can make is thinking that producing a movie is just those steps: you are having a director have the camera man have a sound guy then you do all the visual part after that you add the soundtrack
This is where you lose so much potential because you should really start from scratch, even before shooting and recording anything. When developing an XR story with spatial audio you should be thinking of how I can use sound to help to drive the story. So to get started you really need to reconsider sound in general. There are so many workflows you just do with mono or stereo that don’t work in spatial audio for XR.
I don’t say sound is more important as the visuals or different mediums. But definitely you would have more potential if you also have a look at the sound. Since you are not just looking in one direction in XR. With sound you can always hear everything. I can always hear what’s beside or behind me – but I cannot see it. This is why I realized that during the process from recording to post-production and even before recording, consulting people with their ideas is really where I could find the sweet spot.
I had a lot of VR experiences that really made sense because they didn’t make the mistake of just doing sound afterwards, with the rest of the budget. Instead we were doing it the other way round and thinking of how can the visual part and the sound part help each other out? And this is where you can create good experiences. Both of the parts have to be of good quality and at least work hand in hand.
To keep it simple. I can use most of the tools that I would be using for stereo sound in your DAW. It was different over five years ago. There weren’t many tools for eg. plug-ins. I had to do so much by hand and go into Unity to write my own scripts or plugins. This is where Ambisonics became a thing.
YouTube then said, we are supporting Ambisonics. For over five years now, you can upload 360° videos on YouTube with spatial 3D audio. Thinking about it, this is pretty amazing. They already knew back then it’s going to be the future and there’s still so much happening. Right now I would say most of the tools you need for spatial audio in VR are now available. But then it really depends on what you are going for.
Since then, a lot of manufacturers have developed new types of microphones. Mostly Ambisonics, but also ORTF or quad-binaural formats. I listed all possible multichannel 360° microphone array in this overview.
Currently, I’m having a VR360 project where we want to have the sound coming from loudspeakers. Not just 5.1 or 7.1, but 22.2! There is only a single plug-in out there which can do that as an object-based workflow at the moment. No other tool can do that, which is a shame. You would be thinking: I’m doing the same thing as always. I’m putting 3D audio objects inside of a virtual room and then it all gets rendered loudspeakers.
My point is, as soon as you have some special needs, you may be finding a boundary fairly quickly. You always have to rely on VR headset with its proprietary hardware and software. Also, which video player could you even use with a decent quality? Does it support the spatial audio format because there isn’t really a standard at the moment.
Ambisonics still is and will be a thing and I’m happy about it. In the future there will be something new, trying to replace it. Therefore it always is a bit of a pain to use new microphones and different plugins, because you immediately run into problems. This is why I always just scratch a technical part for the client and say they should leave this problem to me. I will fix it somehow. And let’s not try to talk too much about technology. I will find a solution. And then they are just happy I can take them through the whole process of an XR production with spatial audio.
The main difference is that 360 videos or films in general are linear. Which means you have a certain start and you have a certain end. The duration is always the same. When it’s spatial instead of stereo, you “just” add more audio channels. Your final piece still has the same duration. No matter if mono, stereo, surround of spatial 3D audio.
But for game audio, you don’t really add such long files with a fixed timing. You add more short files which are called assets. Those assets can be combined with all the interactive possibilities you have inside of your XR experience.
For example, if somebody is opening the door, you can throw a “door open” sound to an object. If you code it right, the sound gets triggered automatically as soon as somebody opens the door, obviously. So you can basically put sounds and attach them to every object in the scene of a game engine. Still, there are a lot of similarities where both worlds compliment each other. Find out how game audio influences film.
This is like the key difference you have between linear and interactive content. While working in linear or cinematic worlds, you work with DAW (digital audio workstations) such as Reaper, ProTools, Nuendo. But if you work interactive, you need to handle Unity or Unreal game engine. Or even Middleware like WWise, FMOD or Steam Audio.
Nowadays with immersive media, the screen is not just in front, but all around you. This is where suddenly everything changes. I realized that when I started working with spatial sound for XR. I had to rethink: how am I even using sound here? For example, if you add music in a VR experience, where’s the music coming from? This is from the storytelling I was mentioning. You can’t just throw music at your 360° and say: that’s the soundtrack. Like you could do with classical TV or documentary content for film.
But now with XR audio, you have to reconsider. Does it even make sense to have music? If so, where’s it coming from? Is there a radio inside of the room or is it music in my head? Or if you have a narrator, so called voice over, who is the person inside of my head? Am I the person? Is it somebody else?
With spatial audio, you can help the people to distinguish who’s even part of the XR scene. For example, if you have some 3D audio effect of somebody walking inside of your room, you can localize the person from behind for instance. Now you know, there’s really somebody else in this VR environment. With just stereo sound (so no binaural audio), there’s no possibility to do so. Because our brain always thinks, it’s just happening inside of our head. This is called in-head localisation.
Now with 3D you can, despite just wearing headphones, give the people the feeling of, there’s something coming from the outside. This is called externalization and we forget the feeling of even wearing headphones. Suddenly, sound becomes an intuitive way to guide users through storytelling.
You could use Ambisonics everywhere but does it make sense? This is the real question you should be asking. I would say there are use cases just like 360VR where using 3D audio or Ambisonics is find. So multi-channel formats that have not just two audio channels like stereo, but more channels go a long way. More channels give you a better resolution and have this spatial localization. Therefore for VR it definitely makes sense to have Ambisonics. But I would say for films it’s nice to have loudspeakers at the top. It doesn’t change the film like in a 360° video.
With music, there is so much going on at the moment. Recently Apple said they are now supporting Dolby Atmos Music for Apple Music with their Apple Pro headphones. Which also has some sort of head tracking and spatial audio already enabled. Now people are able to listen to music spatially and this makes everybody excited. Currently, it is a marketing hype that I already investigated further and found disappointing.
For my part, I also like stereo. Stereo is not broken! But I’m on my mission to find use cases where it really makes sense to have 3D audio. To give people a new experience and help them to hear sounds like never before.
Yes, I think how we experience sound in general in the future will change. And this is exactly why I do this, because I know it’s going to happen. It’s just a question like when does it start? I think that Apple jumping onto the topic is a big message because everybody loves Apple and they are so innovative. So this is already a big step that the huge companies are putting millions of dollars into new spatial audio technology for XR.
What I can already tell is how we are going to experience sound in the future? Right now we are used to listening to MP3. Everybody knows what MP3 is, but it’s basically a technology that compresses wave files to a tiny size. So that you could put it on your Walkman and send it to your friends or whatever. We will have a similar format in the future but it will be more interactive and more immersive. This is where object based audio comes in. Therefore it will also be one file so similar to an MP3, but it will give you more ways of interacting with it. It’s called NGA, Next Generation audio and here are the details.
For example, imagine you’re watching football/ soccer. Currently, you can just raise the volume or lower it. But with object based audio, the whole soundtrack gets split into objects with which you can interact. Which means for example you can switch off the commentator and just leave the ambience of the stadium. Isn’t it amazing? Imagine, you don’t care what the commentator from the TV station is saying but you want to make your own comments – it is now possible! But you are still hearing the thrill of a fan audience and being emotionally impacted. Here is a short display of what’s possible with the object based audio format MPEG-H:
So it still keeps you excited and it is what we are already doing at the moment. If we make films, we can add those metadata for music and for sound effects. So that people can adjust to their taste, because as we know, everybody has a different kind of taste. Everybody listens to different kinds of music.
In the future the personalized sound experience will be way bigger. This is where we have just one delivery master file. For example the content of the soundtrack of a film and it doesn’t matter where it’s being played back on a soundbar, if you wear headphones or if you have 10 loudspeakers. The file automatically will decode to the best possible format. Currently, the binauralisation is still a big issue for Dolby Atmos or similar technologies for instance.
At the moment I still need to create content for the web, for TV, for cinema, and all of them have different resolutions and channel layouts. In the future there will hopefully just be one file that you can play on any device which will be 3D and personalize able. This could be combined with the personalization mentioned above.
Let’s have all the buzzwords, like artificial intelligence or blockchain, haha. I hate myself for mentioning that because it sounds like an off topic. But for sound it’s not because you can pretty much use sound everywhere. You just have to find use cases! I have no idea how sound can be used with blockchain, but I’m sure this is the same thing for 3D audio. We have spatial audio, but where can you even use it? You have to find out for yourself!
As I said, do your hands dirty, try to do use cases. And this is what I try right now to mess around with. For eg. I just did a commercial as a sound experience for a live event. At the moment I am messing around with Zoom a lot because I can share 3D audio examples as they support stereo. So really this is my mission to use new audio technology creatively and to unleash the full potential. Because most manufacturers throw hardware onto the market without really thinking what they do with that?
Since I’m the headphone guy, I like that hearables currently disrupt the whole hearing aid industry. Finally, personalized sound gets more accessible for seniors and people with hearing loss. Also, the Metaverse is really hyping and could change how we communicate in the future.
Long story short: Spatial Audio for XR is really exciting. If you want to know more about it, contact me know! I’m sure there is a way for you to benefit from three dimensional sound.More 3D Audio