Who has not made a video call, held a video conference, or attended an online meeting given our overall global situation? Whatever you want to call it, within the last year the exchange via videotelephony has become part of the everyday life of many people for obvious pandemic reasons.
Although Skype and Co. are now more widespread than ever. But in the technical development of the audio, quite little has happened it seems. Of course, there are already some efforts to further develop the major services auditorily. One could already suspect it: The most effective step to improve a video call (auditory) is probably the introduction of spatial audio.
The following article will explain the advantages of immersive audio compared to the current mono standard in video conferencing.
Probably every user has not only one of the video telephony services in use. So most will already have an overview of the strengths and weaknesses of each provider: Some can show more people, some are clearer, others have that background feature, etc.
In any case, the fact is that the differences are almost exclusively on the visual level. Auditorily, you usually have to make do with sparse audio settings and mono format.
Zoom has already taken a step and offers a stereo or hi-fi function. This facilitates speech intelligibility due to the higher audio quality and brings the possibility to stream music in stereo. The two audio channel function is also very handy for streaming binaural audio samples, for example, at corresponding events like this here.
So we already see Zoom taking a step in the right direction. But what the added value of spatial audio is to video calls will be clarified in the following.
Only one month after this article was published, Apple is already presenting an update at WWDC. With iOS 15, calls via Facetime can now use 3D spatial audio. The voice is heard from the direction where the screen of the device is located. In addition, noise is filtered out even better. This is particularly important for spatial sound, as this should be as isolated as possible.
I’m happy to read that spatial audio will be rolling out to iOS devices. Since there are usually quite a lot of people in a room (or let’s say: more than two), it will be easier to distinguish between the people talking. All wired and wireless headphones will support the 3d audio feature. Bluetooth has a technical limitation, but the effect will still be applied.
Update 2022/04: In their blog post they also state that people are using stereo input for live music. They are aware of stereo not being as easy to implement into spatial audio as a mono-source is. Their workaround is to mono sources out of the two channels from left and right. This preserves the richness from the stereo width while gaining spatial depth.
Audio quality aside, what differentiates a direct conversation from a video call? On an auditory level, mainly the directional impression! In a video conference with multiple people, it quickly becomes chaotic with the current mono. As soon as more than one person is speaking, it quickly becomes confusing. Our brain has a hard time distinguishing the voices because they all come from the same direction.
From an audio engineer’s point of view, mono is also referred to as in-the-head localization. If you wear headphones, you hear loud external voices in your head. Sounds almost like a case for a psychiatrist. The solution here is not a visit to the doctor but can be solved technically via the so-called externalization. More about this later.
If you distribute the voice of each person in the 3D audio space, as we are used to in reality, this suddenly makes it easier to differentiate – spatial sound makes it possible. In this way, a conversation situation can be spatially recreated. Just as we know it from meetings, discussions, or sitting together in a cozy atmosphere. Strictly speaking, stereo is even sufficient for this, as this video shows:
The example of Highfidelity founder Philip Rosedale makes it audible in which direction it could go. However, the reverberation is missing here, as we have it in reality. This is crucial for externalization. Here we have just a left-right localization. Also, the sounds appear very close to the ear of the listener. Too close to be natural.
But there are more things that would elicit the potential of the technology. So let’s take a deeper dive!
For accurate sound localization, small head movements help. When we want to locate something more precisely with our ears in everyday life, we usually move our heads unconsciously. By changing the angle to the sound source and the associated change in the runtime. Also, the level differences between the ears, we can locate sound sources even more precisely.
Head tracking is needed to incorporate these movements into the video call. Sounds like a technical effort. But this technology is also on the rise and is now already available. There are even several ways to measure head movements:
Apple already has head-tracking-capable headphones on the market with the AirPods Pro and the AirPods Max, as well as Samsung with the Galaxy Buds Pro – to name two well-known representatives. Furthermore, there is the possibility to use external head trackers that are attached to the headphones. At this point, I would like to refer to this article for more information.
Now, not everyone owns such audio devices. The third option is head tracking via webcam, which is especially exciting for desktop applications. This is roughly how the face recognition looks, which is necessary to adapt the 3D audio sound field to our head movement in real time:
The ability to detect head movements via webcam is a clear win-win situation for spatialized video calls. Because the communication requires a camera, no additional hardware purchases are necessary. And the necessary technology for audio spatialization can even be integrated into the browser. In this respect, the colleagues from atmoky are at the forefront and the right contact persons.
Thinking a bit further, the video aspect could even be neglected by spatializing the voices – at least when it comes to differentiating the persons. Also, a spatialized telephone conversation would convey a completely different sense of presence of the other person, i.e. it would also affect on a psychological level.
There are now several providers of such Bird View Meeting Places. e.g. gather town, spatial chat, or the already mentioned High Fidelity. How this sounds can be heard here:
People who attend several video conferences per day probably know the feeling. You feel drained after the video call, even though you may not have been that active. This phenomenon is called Zoom Fatigue, which is a certain tiredness after web meetings. This doesn’t necessarily have anything to do with the content of the meeting, but with the audio not being thought through!
As mentioned before, mono is the default in video calling and this demands more processing power from our brain. Why? Because all voices reach our ears without directional impressions, i.e. via mono, the brain is busy differentiating and assigning them. This process happens in real conversation without this additional effort for the brain – because we can localize sound sources in the environment.
This is confirmed for example by this scientific paper. Spatial audio can simulate realistic directional scenery. Integrating this technology into video calls would therefore save our brains unnecessary thinking. Accordingly, meetings would be more pleasant and ultimately more efficient by increasing productivity.
This approach to true-to-life communication in virtual meetings is also being pursued by the atmoky team. Their demo shows how spatial audio can be used in web meetings. This creates a natural acoustic scene and increases speech intelligibility. The potential of the cocktail party effect and the so-called Spatial Unmasking should be fully exploited.
What you don’t need is an expensive, fancy 3D microphone – although there’s a nice overview here 😉 . A normal mono microphone on the headset or the one built into the laptop anyway is enough. The virtualization of spatiality is done by software. Metadata is added to the audio stream and the program calculates in real time how the virtual audio should sound.
A problem can then be audible artifacts that can arise during the data reduction as mono during transmission. But in the long run, this can also be solved: NVIDIA, for example, has found a way to display the video – without transmitting it at all (see video below)! Artificial intelligence makes it possible.
Likewise, NVIDIA has developed an outrageously good software for noise cancellation with the RTX Voice application. This allows voices to be transmitted much more clearly without noise. Of course, this is also advantageous when using immersive audio. Then you don’t hear a virtual sound source with noise from the laptop fan and keyboard tinkling. Instead, you hear beautifully polished speech, as we know it from reality.
Actually, it is surprising that so little attention has been paid to audio in video calls. Even independently of Corona, it would have been time to take the next step here. But because of the inflationary increasing hours spent with video telephony "thanks" to the pandemic, an auditory further development is probably long overdue.
So you can see that action is needed here. Time to integrate three-dimensional sound into our everyday web, video, and audio-only meetings, conferences and whatnot. This would have a positive impact on productivity. And at the end of the day, we’d probably have more energy and motivation left over. Also, it improves immensely the sense of presence of other people to briefly mention the keyword embodiment.
That’s why I wanted to draw attention to this everyday topic, which comes into its own with the right approach. With my colleagues from atmoky I can offer such solutions directly for this kind of webmeetings, Virtual Interaction, Video Calls and Co. So implement the mentioned features directly now – we will help you with the implementation.Contact