Beyond text: Why video AI is the future of enterprise intelligence

Gradient background purple blue

Automate customer service with AI agents

Jae Lee Mind Makers

Large language models (LLMs) have captivated the world with their text-based capabilities. But while businesses rush to adopt these tools, they’re overlooking the largest and fastest-growing part of their data footprint: video. From sports broadcasts to manufacturing lines—video content is exploding—and almost none of it is usable or searchable to AI in its raw form.

For Jae Lee, co-founder and CEO of TwelveLabs, video is the next frontier of enterprise AI. “Language models learn from a text-projected world. What we want is video AI that actually learns from the world itself,” he says.

In this episode of MindMakers, delight.ai CEO John Kim sits down with Jae Lee to discuss why video AI is essential in a future powered by AI agents, and how enterprises can turn their unstructured footage into actionable intelligence.

Why text-only AI isn’t enough

Language models are powerful, but limited. They don’t experience the world, so they learn and understand it through text—reading descriptions as a proxy for reality. By contrast, humans learn through sight, sound, and time, perceiving the world years before we can speak.

Lee points out that, if we want AI to truly understand and operate in human-like ways, these systems must be grounded in this same kind of perceptual understanding.

“No one doubts we live in a visual-first world. An intelligent system should be built to understand that world, not just read about it,” he says.

In Lee’s view, video is the ideal substrate for machine intelligence. It’s truly multimodal, combining visuals, audio, space, and time. It also reflects not just “what” is in a scene, but “how” it changes, and how entities within interact and evolve. 

By processing video effectively, AI can gain a context-rich understanding that more closely resembles human intelligence. Meanwhile, the world is generating more video than ever, more than any existing data center could reasonably process.

“We realized it wasn’t just us trying to make sense of insane amounts of video. This was a global problem waiting to be solved.”

— Jae Lee, Co-Founder and CEO of TwelveLabs

Building the full-stack video intelligence layer

Most of today’s AI infrastructure is built around text generation, not video understanding. The hardware, the optimizations—they’re all tuned to feed tokens into an auto-regressive model and get tokens back out. Though perfect for chatbots and AI concierges, it’s not ideal for processing petabytes of video.

The challenge is: Video is dense, but also redundant. Not much changes frame to frame. Treating each frame as a “token” so a model can predict the next one works for short clips, but breaks down when it needs a deep understanding of lengthier content and massive archives. Instead of asking “how do we generate the next frame,” Lee asked “how do we turn this video into a compact representation of meaning?”

The solution is multimodal embeddings, or numerical representations that encode what’s happening visually and auditorily over time. TwelveLabs built their own video-centric infrastructure, custom preprocessing and indexing for different video types (a fast-clip sports broadcast is more demanding than a podcast), and tightly integrated the infrastructure with the models. This allows video to be ingested, indexed, searched, and described with speed, accuracy, and cost-effectively.

“We wanted end-to-end control so large-scale semantic video search would be both fast and affordable,” says Lee.

The result is a full-stack video intelligence layer designed for real-world enterprise workloads. It transforms static video archives into something that both humans and AI agents can navigate semantically.

Use cases for AI video intelligence

Before TwelveLabs, most video “search” relied on manual notes, tags, and hours of scrubbing through footage. Even sophisticated organizations were operating with workflows closer to analog than AI.

TwelveLabs changes this by enabling semantic search and summarization across entire archives and metadata. Teams can look for ideas, actions, emotions, or specific moments using text, images, or reference clips.

Across industries, AI-driven semantic video search can save serious time and money:

  • Sports broadcasting: Sports teams can now automatically generate highlights from their own footage.

  • Media and entertainment: Editors can assemble rough cuts in minutes, rather than hours.

  • Government and intelligence: Analysts can piece together multi-camera events with greater clarity than was previously possible.

Across these and other sectors, video AI can transform footage from a passive asset into a searchable, usable knowledge surface with real business value.

Misconceptions about video AI

When customers first approach Lee, they often assume a single model will handle everything. But he insists that different modalities and domains require different semantic layers. For example, the embeddings used for PDFs differ from those optimized for video.

In his experience, the best video systems are built around a stack of models and tools stitched together into cohesive workflows, rather than a single monolithic model.

“It’s almost never this model or that model. It’s usually this one plus that one, working together in a stack that fits your data and your use cases.”

— Jae Lee, Co-Founder and CEO of TwelveLabs

Ethical considerations

Beyond delivering results for customers, Lee must contend with various ethical concerns around video intelligence, especially surveillance. How do you prevent misuse? What do you do when ethically gray customers come knocking? How do you balance national security interests with civil liberties?

Lee says TwelveLabs has already declined to work with certain customers and content categories they weren’t comfortable enabling. Sometimes, the nature of the content itself is something they don’t want their team exposed to. Other times, the intended use case is misaligned with the kind of world they want to help build. Passing in these situations isn’t easy, but he considers it part of the responsibility of building the foundational infrastructure of tomorrow’s world. 

Takeaways on video AI

Many leaders still see AI as an efficiency tool, something that speeds up existing workflows. But as Lee points out, the current moment is about far more than marginal improvements. It’s about a structural shift in how organizations store knowledge, make decisions, and create value.

“This tech doesn’t just make you more efficient. It changes what’s possible,” says Lee.

In his view, leaders who benefit most from this shift will focus on three things:

  • Have a strong POV on AI direction: You don’t need to be a model architect to lead today, but you can’t outsource your understanding. The difference between seeing AI as a chatbot and seeing it as a new computing substrate determines whether you play defense or redefine your category.

  • Prepare data (especially video) for agentic AI systems: Video is the largest, richest, least-used part of enterprise data footprints, but remains invisible to AI agents if not indexed and semantically searchable. Structuring unstructured data will unlock entirely new workflows that text-only systems cannot handle.

  • Accept incumbency as fragile during architectural shifts: Just as Intel lost ground when computing moved beyond its core paradigm, today’s market leaders can fall behind if they treat AI as a bolt-on tool instead of a transformative one. New competitors with AI-native infrastructure can quickly leapfrog the laggards.

In the end, Lee and TwelveLabs consider video to be the best substrate for machine intelligence going forward, as it enables intelligent systems to see the world, better understand it, and perform accordingly.

Want to learn more about video AI with Jae Lee of TwelveLabs? Listen to the full episode of MindMakers on Spotify or Apple Podcasts.

Curious how delight.ai supports video AI for customer experience? Just contact sales.