Why multimodal data is the next enterprise AI advantage

Gradient background purple blue

Automate customer service with AI agents

Mindmakers podcast cody coleman

In a world that increasingly runs on product photos, YouTube clips, and Zoom recordings, business systems built for structured data are already falling short. As a result, enterprises are drowning in multimodal content that their tools can neither understand nor leverage.

For Cody Coleman, CEO and Co-Founder of Coactive AI, this disconnect is one of the biggest blockers to AI transformation. “There is a massive ocean of unstructured content that current business tools weren’t built to handle,” he explains. 

Without multimodal AI—models or systems that can understand visual, audio, and video content—organizations are overlooking an entire class of business intelligence and leaving money on the table.

In this episode of MindMakers, delight.AI CEO John Kim sits down with Cody Coleman to discuss how organizations can move beyond structured data, unlock value from multimodal content, and set themselves up for success in the AI-driven future.

The multimodal shift vs the content understanding gap

For decades, business systems were designed for structured inputs, such as tables, rows, and logs. But today, the majority of information flowing through many organizations is multimodal and unstructured: images, videos, audio, PDFs, livestreams, and IoT sensor feeds.

Coleman breaks the shift to rich content into three core drivers:

1. Work is increasingly visual-first. From Zoom meetings to product photography to UGC-filled social feeds, visual communication now dominates knowledge work.

2. Consumer behavior is multimodal. Product discovery now happens through visual search, TikTok video snippets, AR try-ons, and image-driven commerce.

3. Enterprises capture more visual content than ever. Cameras, sensors, robots, and customer uploads generate millions of rich media assets per day—far more than humans can manually review or tag.

However, none of this content fits neatly into structured systems.

“Classic database systems treat images and video as blobs. They’re not good at understanding the actual content,” he explains.

This has left vast amounts of high-value content to sit unanalyzed and unused, simply because legacy tools are unable to interpret it. In an era where structured data is just a slice of the enterprise pie, and data-hungry AI systems are becoming essential—organizations must shift to embracing multimodal AI that can understand images, video, audio, and text together.

Multimodal AI models are just the start

Multimodal AI technology is advancing rapidly, so AI leaders might assume that simply adopting a foundation model will unlock value from their underutilized content. However, Coleman warns that models alone won’t suffice, even as the technology becomes more advanced.

“No single model will ‘just work’ for the wide variety of content, constraints, and edge cases enterprises deal with,” he says.

In his experience, model-only approaches fall short because:

  • Different content types require different models

  • Domain-specific signals still need customization

  • Business teams need usable interfaces—not raw model outputs

To move from “we have the data” to “we can use the data,” enterprises need more than multimodal models—they need the right AI infrastructure that can:

  • Orchestrate multiple specialized models

  • Extract context directly from visual content

  • integrate outputs into real workflows

In other words, without the right underlying system to support their multimodal capabilities, even the best models won’t deliver value.

The solution: Turn unstructured content into a searchable signal

Cody believes the missing ingredient in most enterprise AI strategies is a content intelligence layer that transforms raw multimodal content into machine-readable data.

This is what Coactive AI provides: a multimodal AI platform that turns images, video, and audio into structured, searchable, and analyzable signals. This way, enterprises can finally process and leverage massive rich datasets securely and at scale.

“We want to make it easy to search and analyze visual content by effectively turning images and video into machine-readable signals,” he says.

Here’s how the multimodal AI platform works:

  • Model-agnostic architecture uses the best-in-class model(s) for each use case and content type.

  • Automated context extraction pulls meaningful signals directly from images, video, and audio.

  • AI APIs for developers enable integration, customization, and large-scale orchestration.

  • Interfaces for business teams support search, tagging, moderation, and analytics—no machine learning expertise required.

  • Enterprise scale + trust features built for environments that require security, governance, and massive throughput.

Multimodal AI capabilities identify and tag visual content to create contextual meaning
Coactive’s fast dynamic tagging enables computers to quickly get a rich, contextual understanding of visual content.

By abstracting the multimodal AI stack, Coactive enables enterprises to index, search, and analyze their rich content for widespread use. Companies like Pinterest and Meta have built this kind of infrastructure internally, Coleman notes, but at the cost of tens or hundreds of millions of dollars. Coactive makes multimodal AI accessible as a platform, so enterprises don’t have to rebuild their infrastructure from scratch to get value from rich content.

Real-world examples of a multimodal AI layer

Coactive’s approach to multimodal AI is already unlocking new value across industries. For example:

1. Better search and discovery for Thomson Reuters

The news organization relied on manual metadata entry for content, which was slow and inconsistent, as reporters didn’t always tag everything relevant. Before, searching the Reuters database for “Kamala on SNL” could return just a single result.

With Coactive, semantic context is extracted directly from video and images, making content discovery far more complete and relevant. This helps reporters and newsrooms find more of what they need from visual content faster.

2. Faster, safer content moderation for Fandom

Previously, Fandom manually reviewed ~60,000 user-uploaded images per day. With Coactive, review time fell from two days to 250 milliseconds. This achieved a 50% cost savings while increasing the accuracy of the company’s AI content moderation, resulting in safer communities at scale.

3. Visual analytics & signal extraction

Structured data alone can’t deliver certain capabilities—such as turning images and video into machine-readable signals. By turning the pixels in visual content into machine-readable features, companies can power a variety of AI workflows and systems, including:

  • Recommendation engines

  • Content understanding

  • Compliance workflows

  • Creative analytics

  • Risk detection

These are insights that structured data alone could never provide.

The multimodal opportunity: Where AI meets content understanding

While much of the AI world remains focused on text, Coleman and the Coactive team believe the next frontier lies in visual and sensor data. He says that many Industries are already undergoing this transformation, including:

  • Healthcare: telemedicine, diagnostic imaging

  • Life sciences: microscopy, pathology, high-throughput imaging

  • Retail: visual search, UGC understanding

  • Industrial & robotics: camera + depth + spatial sensor fusion

  • Media & entertainment: content production, rights management

Across all these domains, the bottleneck isn’t the availability of data, but the lack of infrastructure to understand it. He gives an example from the life sciences:

“Microscopes are now high-throughput, generating thousands of images no human can review,” he notes. “As robots capture sensor data—vision, depth, spatial—multimodal AI becomes essential for making sense of it.”

Next steps for embracing multimodal AI

As multimodal content becomes the dominant form of enterprise data, organizations that invest in content intelligence infrastructure—not just models—will gain the greatest advantages. After all, structured data alone can’t power the next generation of AI applications.

Coleman stresses that enterprises will only benefit if their infrastructure keeps pace with the latest developments. This means implementing an abstraction layer, or a system that unifies content, orchestration, and AI reasoning. This is his larger vision for Coactive:

“We need a new operating system to help humans and machines work together with a shared understanding of visual data,” he explains. “Just like CPUs needed Windows or macOS, foundation models need platforms that connect content and AI.”

What should leaders look for in a content intelligence layer? In Coleman’s view, a winning layer should abstract model complexity, allowing enterprises to use the best model for each task. It should also unify structured and unstructured data into a single semantic layer, orchestrate and integrate workflows across technical and non-technical teams, and evolve with foundation models.

Ultimately, this will enable leaders to transform their unstructured content into a signal for both humans and intelligent systems, paving the way for reliable automation, innovation, and returns from AI.

Want to learn more about multimodal AI with Cody Coleman of Coactive AI? Listen to the full episode of MindMakers on Spotify or Apple Podcasts.

Curious how delight.ai enables multimodal AI for customer experience? Just contact sales.