Summer Project 2026
SMPTE AI Copilot

Andy Beach

SMPTE AI Copilot

SMPTE’s archive is a goldmine of engineering history, but digging through S3 buckets to find it is no one's idea of fun. We call this "The GAP"—the disconnect between over a century of rich technical data and the engineers who need it.

We are bridging this gap by building a "Media AI Co-pilot." Executed during the MonteVIDEO Summer Projects 2026 with a focus on applying emerging AI technologies to our industry, this project leverages RAG to create a modular, open-source conversational interface for the archives. Now, you can ask complex questions and get answers with direct citations, ensuring you are building on history rather than reinventing the wheel. We will cover the architecture and the real-world challenges of teaching an AI to respect user permissions while serving up a century of video tech know-how.

Github: SMPTE-Copilot

Demo Page

RAG, what's that?

Retrieval-Augmented Generation (RAG) is the architectural standard for connecting Large Language Models (LLMs) to private, dynamic data. Instead of relying on a model's "frozen" training set, RAG retrieves relevant context at runtime to ground the response. This approach solves the two main limitations of off-the-shelf LLMs: it drastically reduces hallucinations by citing sources and enables access to proprietary data without the high cost and rigidity of model fine-tuning.

Strategically, RAG competes with Long Context Windows and Fine-Tuning. While massive context windows (1M+ tokens) are powerful for analyzing single, large documents, RAG is the scalable solution for querying vast knowledge bases where latency and cost per query matter. Fine-tuning remains superior for adapting model behavior or style, but RAG is the correct choice for injecting factual knowledge.

Architecturally, RAG is a spectrum, not a fixed recipe. Naive RAG (simple vector search) is the starting point but often fails in production. Mature implementations utilize Hybrid Search (combining dense vector retrieval with sparse keyword search like BM25) and Re-ranking steps to improve precision. For complex reasoning, GraphRAG (using knowledge graphs) and Agentic RAG (where the LLM autonomously plans retrieval steps) represent the state-of-the-art for handling deep, multi-hop queries.

The Open Source ecosystem offers a complete stack for these architectures. LangChain and LlamaIndex serve as the primary orchestration layers to manage pipeline complexity. For storage, dedicated vector databases like ChromaDB and Weaviate, or extensions like pgvector, handle high-dimensional data. Finally, serving layers like vLLM and Ollama enable the efficient deployment of local models (like Llama 3), ensuring high throughput and data sovereignty without relying on closed APIs.

A Modular RAG Architecture

We moved beyond simple chatbots to build a robust Retrieval-Augmented Generation (RAG) pipeline. The system was designed to be modular, allowing individual components (parsers, vector DBs, LLMs) to be swapped as technology evolves.

Key System Components:

Ingestion & Parsing: Utilizing Docling for complex PDF structures (tables/formulas) and OpenAI Whisper for video transcription.
Enrichment: Using Vision LLMs to generate dense textual descriptions for diagrams and images within technical papers.
Vector Store: Migrated from Chroma to Qdrant to support complex list-based metadata for access control tags.
Retrieval: Implemented Hybrid Search and Re-ranking steps to ensure the most relevant chunks appear first, correcting early issues where correct answers were buried.
Interface: An Open Web UI connected via an OpenAPI-compatible backend.

Key Learnings

Parsing is the Bottleneck

We quickly discovered that standard text extraction tools (like pdfminer) destroy the context of complex engineering documents, often turning flattened tables into nonsense and completely ignoring vital flowcharts. To resolve this, we pivoted to a multimodal ingestion pipeline. We implemented Docling to preserve the structural integrity of complex tables and utilized Vision LLMs with custom prompts to generate dense textual descriptions of images and diagrams before indexing them, ensuring the system "sees" the visual data.

Security Requires Filtering

We learned that you cannot rely on the "obedience" of an LLM to protect confidential data; instructing a model to "not reveal private info" is insecure and prone to jailbreaks. We solved this by implementing Role-Based Access Control (RBAC) at the database level. We tagged every data chunk with permission metadata (e.g., public, member, staff) and configured the retrieval engine (Vector DB) to physically filter out restricted data before the LLM ever accesses the context, preventing data leakage by design.

Vector Similarity Does Not Equal Factual Relevance

While vector search is incredibly fast, it often retrieves chunks that are "semantically close" but factually incorrect for specific technical queries, such as retrieving outdated versions of a standard. To address this, we added a Re-ranking step (using Cross-Encoders) immediately after the initial retrieval. This forces the system to re-evaluate and reorder the results based on actual relevance to the query rather than just vector proximity, drastically improving answer precision

Acoustic vs. Semantic Segmentation in Video

A significant portion of the archive consists of older, scanned standards containing complex mathematical formulas that standard OCR and current vision models struggle to convert accurately into machine-readable text (like LaTeX). While our image description pipeline helped provide context, the precise digitization of these complex equations remains a pending challenge slated for future work, potentially requiring specialized fine-tuning of vision models.

Trust Through Exact Citations

We found that users are hesitant to trust AI-generated technical answers without immediate proof of provenance. To solve this, we modified our Qdrant metadata schema to store presigned URLs and precise timestamps for every chunk. This allows the system to provide a direct link to the exact page of a PDF or the specific minute of a video where the information was found, although refining the UI to display these citations seamlessly is an ongoing task.

Summer Project Demo

Check out the final demo of the SMPTE Copilot MonteVIDEO Summer Project 2026 and learn about the challenges we faced, the lessons we learned, and the decisions we made to build this open source AI Copilot