AI Voice Agents Beyond Transcripts: Rethinking Enterprise AI Listening with ELM

Rajashree Goswami, April 9, 2026 | 9 min read

AI and Tech Leadership This interview series is grounded in lived experience. It explores how technology leaders move AI from experimentation into day-to-day operations, where decisions carry real consequences for teams, customers, and the business. Through conversations with practitioners who have led transformations at scale, the series examines how AI reshapes execution, accountability, and outcomes.

For years, enterprise AI has been built on a simple assumption that if machines can read language, they can understand people.

That assumption is now being challenged.

As AI moves into real-world environments, the limitations of text-first systems are becoming increasingly clear. Human communication is not just words. It is tone, pauses, emotion, and intent. These signals are often lost when a voice is reduced to a transcript.

Mike Pappas, CEO and Co-Founder of Modulate, shares how the Company is rethinking AI systems to move beyond transcription and toward true listening.

A new AI architecture: Ensemble listening models

Modulate announced a fundamentally new AI architecture called the Ensemble Listening Model, or ELM. This represents a departure from LLM-based approaches that flatten voice into text, thereby missing critical context such as emotion, timing, and intent.

ELMs coordinate up to 100 specialized models in real time to analyze conversations as multidimensional signals. This is not just a product enhancement but a shift in how AI systems are designed for enterprise use.

Velma 2.0: Proven at enterprise scale

At the center of this architecture is Velma 2.0, Modulate’s voice-understanding ELM, which is already in production.

It powers hundreds of millions of conversations across Fortune 500 companies and major platforms such as Call of Duty and Grand Theft Auto Online. Unlike traditional systems, Velma 2.0 interprets not just words but also emotion, prosody, timbre, and behavioral context.

The result is higher accuracy, significantly lower cost, and more transparent outputs, addressing common enterprise AI challenges such as hallucinations and lack of explainability.

Industry leaders such as Jensen Huang have pointed toward coordinated multi-model systems as the next major shift in AI, and ELM demonstrates what that looks like in practice.

In conversation with Mike Pappas

To start, I’d love to learn more about your journey. How did it all begin for you, and how did Modulate come into the picture?

Mike Pappas:
Sure. So Modulate is a voice AI company that traces its roots back to my time as an undergraduate. I was a physics student at MIT walking down a hall and saw someone working on an interesting-looking physics problem at a whiteboard. So I mustered up all of my social courage and stood awkwardly behind him for a few minutes until I found the missing minus sign to solve the equation.

Subscribe to our bi-weekly newsletter

Get the latest trends, insights, and strategies delivered straight to your inbox.

That turned out to be Carter, who is now my co-founder and CTO. So we’ve known each other for a long time. And like many college kids, we had this idea that we should start a company and go change the world. But we took it a little further than most and actually prototyped a lot of different startup ideas through our time in college to try and understand what it means to do market research. What does it mean to actually deliver a product that people can use?

Didn’t find anything we were truly excited about in college, but we knew we wanted to do something about how people interact online. So we went off and got jobs in the real world, trying to hone our skills.

Cut to a few years later, and Carter reaches out to me. He’s been playing with AI and has developed some of the first models that Modulate used for voice processing. And so he’s reaching out to me and saying, hey, I think there’s something to understanding voice with AI. Maybe there’s an opportunity there. It resonated, we jumped in, and from there the company has just sort of grown.

Mike Pappas, CEO and Co-Founder at Modulate

When you look at how AI systems handle voice today, most of it still revolves around transcription. At what point did you realize that this approach was fundamentally missing something?

Mike Pappas:
Yeah, so when people think about AI listening to what we say, they immediately assume, well, we’ll just transcribe the audio. And that is the necessary way to use LLMs. LLMs are text-based models, so you have to transcribe the audio and then feed the text through the LLMs.

The gap that we saw is that that’s not how human beings work. You know, transcripts are only one part of this conversation that we’re having. The emotion, the pauses, even the timbre of my voice tell you really meaningful things. And especially in an enterprise application or a social safety application, the devil is really in the details, and you’re not going to be able to accurately understand the interaction if you don’t understand those hidden nuances of voice.

So that was sort of our excitement is saying we don’t need to rebuild the brains, so to speak, of these LLMs. We need to give them better ears so that they can actually understand human beings the way we like to communicate with each other.

There’s also a cost and scalability angle here. At what point did that start influencing how you were thinking about building these systems?

Mike Pappas:
Suddenly, there are very few companies in the world that can afford that annual invoice. So that created an extra level of pressure for us to really think about how to shed all of the extra stuff you don’t need and just build an AI system that is laser-focused on the intelligence you need it to have.

And that again led to this ELM design.

And when we talk about industries like gaming, what were some of the hardest technical challenges in analyzing real-time conversations?

Mike Pappas:
I think the biggest challenge in gaming is how thin the line can be between acceptable conversations and unacceptable ones.

Gamers love to trash-talk. They love to rag on each other. It is part of the experience, and you want them to have a good time.

When we first started, most moderation systems were just saying things like the F word is bad, never say it. But among adult friends, that word is fairly common, especially in a gaming context, and it can be used positively or negatively. You need to understand that context.

Then you get into more subtle situations where people may throw pretty harsh insults at their friends, but there is an understanding between them. The same line said to a stranger can cause real harm.

So you have to understand not just what is being said or even the emotion behind it, but how it is being received, what the relationship is between participants, and what the broader context is.

We often say gaming is hard mode because all these signals are so close to each other. Because we started there, we were forced to solve the hardest version of the problem. That gave us a much richer dataset. So now, when we move into industries like FinTech, we are solving that last 10 percent problem, which is often where the real risk lies.

So how do you actually differentiate between sarcasm, banter, and genuine abuse in voice interactions?

Mike Pappas:
It is ultimately an AI system. We have trained on a very large amount of conversational data, over half a billion hours across enterprise, social, and gaming use cases.

That allows us to make these systems very strong. Each component model is highly specialized. Our transcription model is extremely strong at noisy speech. Also, our deepfake detection model is highly effective at catching adversarial synthetic voices. Our emotion and interruption models are also highly refined.

At a certain point, it comes down to exposure. Humans can detect sarcasm because we have heard it enough times. AI can do the same if it is actually listening to the audio and has the data to learn from.

Do you see this as a shift beyond LLMs entirely, or do you think ELMs and LLMs will coexist?

Mike Pappas:
I think it is the latter. ELMs will allow us to deploy AI into environments where LLMs were not the right solution, whether that is because they are black boxes, too costly, or too difficult to train.

LLMs are still incredibly powerful, and they will continue to be widely used. But many enterprises have realized that they are not always the right tool for the problems they face. So I think we will continue to see LLMs used heavily, especially in the consumer space, while new architectures like ELM become more prominent in enterprise use cases.

Looking ahead five years from now, what are the trends or use cases you are most excited about when it comes to ELMs?

Mike Pappas:
It is a great question. I think what I am most excited about is the range of applications ELMs will be used for. They are designed to handle multimodal data. Audio is one example, where there is both content and delivery, such as tone and prosody.

Another example is self-driving cars, where you have the sensor fusion problem and need to combine multiple data streams. ELMs are well suited for that. There are also challenges with LLMs around long memory and context. ELMs can approach that differently. For example, in long meetings, they can model different participants separately and build a richer understanding.

So I am really excited to see how this concept expands beyond voice. That is where we started, but it is not the only use case. There is a lot more opportunity for people to build on this approach.

To explore more expert perspectives on AI voice agents, enterprise AI architecture, and emerging technologies shaping real-world applications, explore here!

About the Speaker Mike Pappas is the CEO and co-founder of Modulate, which builds voice native conversational intelligence for content moderation, fraud prevention, customer experience, and more. He leads Modulate’s commercial strategy, advises enterprises on social and financial safety, and contributes to industry initiatives on compliance and online safety, including serving on the board of the Family Online Safety Institute in Washington DC. Outside of work, he enjoys long distance running, creating themed cocktail lists, and playing Nintendo games.

In Conversation Q&A

Miller on Designing an Effective Cyber Defense Architecture

AI in the Industry, In Conversation Q&A

AI Leadership: How Vladimer Botsvadze Sees the Enterprise AI Transformation

Rajashree Goswami

Rajashree Goswami is a professional writer with extensive experience in the B2B SaaS industry. Over the years, she has honed her expertise in technical writing and research, blending precision with insightful analysis. With over a decade of hands-on experience, she brings knowledge of the SaaS ecosystem, including cloud infrastructure, cybersecurity, AI and ML integrations, and enterprise software. Her work is often enriched by in-depth interviews with technology leaders and subject matter experts.

Subscribe to the CTO Magazine Newsletter

AI Voice Agents Beyond Transcripts: Rethinking Enterprise AI Listening with ELM

A new AI architecture: Ensemble listening models

Velma 2.0: Proven at enterprise scale

In conversation with Mike Pappas

To start, I’d love to learn more about your journey. How did it all begin for you, and how did Modulate come into the picture?

Subscribe to our bi-weekly newsletter

When you look at how AI systems handle voice today, most of it still revolves around transcription. At what point did you realize that this approach was fundamentally missing something?

There’s also a cost and scalability angle here. At what point did that start influencing how you were thinking about building these systems?

And when we talk about industries like gaming, what were some of the hardest technical challenges in analyzing real-time conversations?

So how do you actually differentiate between sarcasm, banter, and genuine abuse in voice interactions?

Do you see this as a shift beyond LLMs entirely, or do you think ELMs and LLMs will coexist?

Looking ahead five years from now, what are the trends or use cases you are most excited about when it comes to ELMs?

Related

Rajashree Goswami

Related posts

Preparing People for AI: Rethinking Workforce Strategy in the Age of Intelligent Systems

AI Leadership: How Vladimer Botsvadze Sees the Enterprise AI Transformation

Miller on Designing an Effective Cyber Defense Architecture

Engineering Discipline for AI-Driven Software Development: Insights from Divyesh Patel

The Invisible Engine: How Operational Healthcare AI Is Redefining Hospital Efficiency

AI Roles in Enterprise, Localization and the Truth About Hype: Insights from Kirit Goyal

AI Strategy to Drive Impact by Chief AI Officer Burhan Sebin

Cloud Strategy for AI: Insights from EY’s Global Vice Chair

CTO Nick Kraft on Leading at Scale in the Age of AI

AI-Driven Healthcare: Turning Insights into Action with Nasim Afsar

CTO Jennifer Wimberly on Operationalizing AI in Aviation

Artificial Intelligence Governance in the Age of Exponential Technology with Dr. Andrea Bonime-Blanc

AI and Healthcare: Adrian Jennings, Chief Product Officer at Cognosos, on Scaling RTLS in the Post-AI Era

Operational Resilience is Not a Dashboard: Mahesh Paolini-Subramanya on DORA and Bank Architecture

Is AI-native Architecture Shifting the Focus? Denis Romanovskiy, Chief AI Officer Weighs In

Cybersecurity Leadership 2026 in the Age of AI Impersonation

What AI Readiness Framework Actually Demands: Sean Blanchfield Explains

Scaling Agentic AI: When AI Takes Action, the Real Challenge Begins

How Gabe Kopley and Felicia Curcuru are making AI work for people

Why Data Governance Frameworks Are No Longer Optional: Ananya Sundar on What’s Changed

What AI Leadership Gets Right, and What It Often Gets Wrong: Lessons from Deependra Chokkasamudra

CTO Sven Oehme on Building Scalable AI Infrastructure

AI Governance and Growth in Fintech — Lessons from FIS’s CTO

AI Deregulation Is Reshaping How AI Is Built: Cris Kinross Explains Why

Leadership in AI Era: Olson on Building SaaS Teams That Last

AI in Hiring: Insights from Edge CEO Iffi Wahla

Building Enterprise-Grade AI: Insights from PwC’s AI Factory Leader

Dr. Oliver Bastert on Digital Twin Technology Beyond the AI Hype

Cyber Resilience in Healthcare: Chao Cheng Shorland on Designing Systems That Never Stop

Trust, Transparency and AI in Healthcare: A Strategic Dialogue with Wolters Kluwer’s CTO

How CIO Peer Networks and AI Are Reshaping Enterprise Tech Strategy

Deloitte’s Managing Director Anjali Shaikh on the Quiet Power Shift in the C-Suite

Dr. Darren Williams on Shadow AI: The Hidden Threat Leaders Can No Longer Ignore

The Rise of Predictive Analytics: A Conversation with Dr. Zohar Bronfman

Building Safer Spaces: How AI Is Changing Threat Detection with Steve Roskowski

Vijay Guntur on How Human-AI Collaboration Can Drive Enterprise Success in 2026 and Beyond

Building Smart, AI-Powered Supply Chain Management with Beth Hendriks

Sam Allen on How AI Will Reshape Every Marketing Touchpoint

Winning the Holiday Season with AI: Majaz on Smarter, Real-time Inventory Management

Gregory Shein on the Rise of AI Models: DeepSeek and OpenAI

Intelligent, Space-Native Robots: Jamie Palmer on Redefining Automation Beyond Earth

Building a Responsible Digital Future: Nguyen Anh Tuan on AI Ethics, Innovation, and Global Leadership

Beyond the Sales Funnel: Curtis Sparrer on How AI Platforms Decide Which Brands Win

Building the Future of Payments: CTO Phillip Goericke on AI, Embedded Finance, and Frictionless Payments

Balancing Burnout and Breaches: Sterling Wilson on The New Face of Cybersecurity Leadership

AI in Cybersecurity Through the Lens of a CTO, Rajan Koo

Security Considerations for a Connected, Highly Volatile World: Expert Insights from Joel Thayer

Chris Gibson on the Skills of Leaders Who Turn Adversity into Advantage

From the CIA to the Boardroom: Andrew Bustamante on Leadership Style That Endures

The Human Factor in Cybersecurity: Dewayne Hart on What CTOs Overlook

AI-Powered ETFs: Exploring the Next Evolution in Investing with Jack Fu, CEO of Draco Evolution

AI and Sustainability: Insights from Futurism Speaker Birju Shah

AI and E-Commerce: Indranil Guha on What’s Next for Marketing Innovation

Edge AI and Everyday Trust: A Conversation with Dr. Rajesh Natarajan, CTO of Gorilla Technology

Geospatial Intelligence, Reinvented: Spexi CTO Peter Szymczak on AI’s Next Frontier

AI in Marketing Has a Trust Problem — Karla Wentworth Explains Why

Consumer Tech and Vertical SaaS: Inside Carle Stenmark’s Investment Playbook

From Strategy to Impact: Chris Whitaker’s Vision for Leadership and Innovation at Comcast

Cybersecurity Leadership with Jadee Hanson, CISO at Vanta

The Unbossing Revolution with Shane Elahi, Co-Founder and COO of Software Finder

Blending Search and Chat: Vikas Jha on the Next Era of Conversational Commerce

AI in Retail Ecommerce, with Benoit Jacquemont, CTO and co-founder of Akeneo

Innovative Leader Spotlight: Chris Soukup’s Vision for Communication Equity

Humanoid Robots: CTO Jarad Cannon Reveals the Big Breakthroughs