Wednesday, June 18, 2025

How AI Could Soon Take Human-Computer Interaction to New Levels

 As AI models reach excellence in speech recognition and synthesis, text processing, and multimodalism, the ultimate voice-user interfaces…

Voice User Interface (VUI) for natural speech-based human-computer interaction as imagined by Dall-E 3 via ChatGPT.
Voice User Interface (VUI) for natural speech-based human-computer interaction as imagined by Dall-E 3 via ChatGPT.

It was a typical Friday afternoon right at the end of a long week of work on our project developing a radically new concept and app for molecular graphics in augmented and virtual reality, when I found myself in a heated discussion with my friend and colleague. He is a "hardcore" engineer, web programmer, and designer who has been in the trenches of web development for over a decade. As someone who prides himself on efficiency and control over every line of code and especially who always has the user and user experience in mind, my friend scoffed at my idea of voice interfaces becoming soon the norm…

"Speech interfaces? They’re immature, awkward, and frankly, a little creepy", he said not with these exact words but certainly meaning them, and voicing a sentiment that many in the tech community share. And this was already after having kind of convinced him, maybe by 30–50%, that our augmented / virtual reality tool for molecular graphics and modeling absolutely needs such kind of human-computer interaction because since the users’ hands are busy grabbing and manipulating molecules, there’s no other way for them to control the program, for example to run commands and such.

More broadly, speech-based interfaces (or Voice User Interfaces, VUIs) can be a game-changer for various work or entertainment situations where hands are busy, and to facilitate accessibility for people with various disabilities whereby together with regular GUIs they would be inclusive even with the visually, auditory, and motion-impaired. All these points make the topic very important to be discussed and evaluated from the viewpoint of technology and UX design, and we must do this often because of how fast technology evolves. Moreover, as I will discuss here, I think it is getting to a point that it can already be pushed for, contrary to my colleague’s viewpoint which still remains quite negative.

I do acknowledge, though, that my colleague’s concerns aren’t unfounded. He argues that speech interaction with computers is still plagued by inaccuracies, a frustrating need to repeat oneself, and a general lack of fluidity. And to an extent, I do know he’s right. (But… read on!)

A short but relevant detour: Voice User Interfaces as imagined by Star Trek

While I debated with my colleague about the current limitations of speech-based interfaces / VUIs, I couldn’t help but think of Daley Wilhelm‘s articles exploring the future of UX, in particular in her insightful piece titled Did Star Trek Predict the Future of UX?

Did Star Trek predict the future of UX?

(and by the way, I also recommend her article Designers: you need to read science fiction)

In her article on Star Trek predicting the future UX, Daley Wilhelm discusses how the VUIs in Star Trek set user expectations for technology, shaping a big chunk of how we interact with our devices today. The seamless, intuitive voice commands that the crew of the Enterprise use to control their ship represent an ideal of what human-computer interaction could be… talking to the computer just like to another human. Star Trek got right the iPads and hand gestures, and even some aspects of multitouch displays, so… did it guess right the future of VUI, too?

The exact same thing, even more advanced, with Lt. Commander Data, a highly sophisticated android from Star Trek: The Next Generation or the Emergency Medical Hologram Doctor from Star Trek: Voyager, both capable of sustaining very complex conversations – and even using speech-based thinking themselves (Further detour: Are human/artificial language models linked to human/artificial intelligence?)

Back to Daley Wilhelm, her key point is that while Star Trek‘s vision of the future was ahead of its time, our real-world technology hasn’t quite caught up – at least, not in the way the series imagined. In Star Trek, the crew interacts with the ship’s computer largely through voice commands, whether it is to access information, control ship functions, or even replicate food and beverages – yet with limitations as she exemplifies.

This vision of a future where voice interfaces are the primary mode of human/robot/hologram-computer interaction is captivating and, for many like myself, an aspirational goal. And leaving aside my subjective opinion, there are all the advantages I posed in the opening paragraphs.

In Star Trek, the conveyed ability to issue complex, context-rich commands and receive accurate, timely responses seems like a natural extension of technology’s potential. For example, Captain Picard could request for a specific flavor of tea, at a specific temperature, and instantly receive exactly what he wanted – no fuss, no misunderstandings. But as [Daley Wilhelm](None) points out, modern voice assistants like Siri, Alexa, and Google Assistant struggle to meet these expectations, and by quite far. Today’s users often find these systems falling short of the conversational, context-aware interactions that Star Trek made us dream of. On the other hand, Daley Wilhelm presents an example of Star Trek’s computer not really understanding the user, when Geordi LaForge asks the computer for music with a "gentle Latin beat" and the computer initially fails to deliver the exact type of music he had in mind, highlighting the challenges of ambiguity in natural language processing. I quote this specific example from her article because I will come back to it later on in the context of modern (real-world, 2024) technology.

But my point is that the limitations discussed by Daley Wilhelm resonate on first look with many users and developers today, including my colleague. Unlike the seamless interactions depicted on Star Trek, our current VUIs often stumble over complex queries, struggle to understand context, and sometimes return irrelevant or incorrect responses. The reliance on recall, where users need to know exactly what they want to ask or command, contrasts sharply with the more natural recognition-based interaction that users typically expect. Thus, when using modern VUIs we often find ourselves needing to adapt to the technology – learning specific commands or phrasing questions in ways that the system can understand – rather than the technology adapting to us. But my point, to be developed soon below, is that current technology has much more to offer and probably isn’t that far.

In particular, note that since Daley Wilhelm published her article, the technological landscape has evolved quite rapidly. For example, by January 2023 when her article was published, OpenAI’s first really large and "smart" language model, GPT-3, had just been released a few months earlier – and I tried it through the API, that is before ChatGPT came out, astonished at the possibilities it could open up for more fluid and natural VUIs:

No comments:

Post a Comment

Scientists from Russia and Vietnam discover new antimicrobial compounds in marine sponges

  Scientists from the G. B. Elyakov Pacific Institute of Bioorganic Chemistry of the Far Eastern Branch of the Russian Academy of Sciences, ...