RAG-Powered Subtitle Translation: How srt_dual_rag

At Hamkee, we’ve developed srt_dual_rag_mt — a Python-based subtitle translation engine that applies Retrieval-Augmented Generation (RAG) to the localization problem. Rather than sending each subtitle to a language model in isolation, the engine first searches a corpus of professionally translated subtitle pairs for semantically similar examples, then grounds the model’s output in those references. The result is machine translation that maintains your own terminology, style, and voice — not a generic model’s best guess.

The Problem with Standard Machine Translation for Subtitles

Imagine a streaming platform that has spent years building a Spanish-language content library. Dozens of titles, hundreds of hours of professionally translated subtitles. Each subtitle was crafted by human translators who understood the franchise, the characters, the genre conventions. A medical drama uses specific clinical vocabulary. An animated series has running jokes that require consistent phrasing. A documentary series about a historical figure uses a particular formal register.

Now that library needs to expand — new episodes arrive, spin-offs are greenlit, international licensing requires additional language directions. At scale, relying solely on human translation is slow and expensive. But feeding subtitles into a generic MT system throws away everything those human translators already decided. Character names get rendered inconsistently. Domain-specific terms change between episodes. The voice shifts. Localization teams spend as much time correcting machine output as they would have translating from scratch.

The core issue is that standard machine translation has no memory. Every subtitle is a blank-slate input. The model does not know that in this franchise, the main character’s catchphrase was translated this way, or that this studio prefers formal address forms. That institutional knowledge sits in the existing subtitle corpus — and conventional MT systems ignore it entirely.

What RAG Brings to Translation

Retrieval-Augmented Generation is an architectural pattern that addresses this problem directly. Before calling the language model, you retrieve relevant context from a knowledge base and include it in the prompt. The model is no longer translating in a vacuum — it is translating with reference to examples that are semantically close to the current input.

For subtitle translation, “semantically close” has a precise and useful meaning: if the current Spanish subtitle is “No me queda otra opción,” and your corpus contains the translation pair “No tengo otra salida” → “I have no other choice,” that is a strong signal about both vocabulary and register. The model should produce output consistent with that existing choice, not introduce a competing rendering.

The pipeline in srt_dual_rag_mt implements this in three sequential phases: ingest, index, and translate.

Phase 1: Ingesting Bilingual SRT Pairs

The foundation of the system is a corpus of aligned bilingual subtitle segments stored in SQLite. Building that corpus from raw SRT files is handled by ingest_srt_pair.py.

Bilingual SRT files for the same piece of content are rarely frame-perfect mirrors of each other. Different languages expand and contract at different rates. A Spanish segment covering three seconds might correspond to two English segments or one. The aligner handles this reality with a time-overlap strategy:

def align_by_time(es_segs, en_segs,
                  min_overlap_ms=300,
                  min_ratio=0.5):

    aligned = []
    j = 0

    for es in es_segs:
        es_start, es_end = es["start_ms"], es["end_ms"]
        es_dur = es_end - es_start
        bucket = []

        while j < len(en_segs):
            en = en_segs[j]
            en_start, en_end = en["start_ms"], en["end_ms"]
            en_dur = en_end - en_start

            overlap = max(
                0,
                min(es_end, en_end) - max(es_start, en_start)
            )

            if overlap >= min_overlap_ms and overlap / min(es_dur, en_dur) >= min_ratio:
                bucket.append(en)
                j += 1
            elif en_end < es_start:
                j += 1
            elif en_start > es_end:
                break
            else:
                j += 1

        if bucket:
            aligned.append((
                es["text"],
                " ".join(b["text"] for b in bucket),
                min(es_start, bucket[0]["start_ms"]),
                max(es_end, bucket[-1]["end_ms"]),
            ))

    return aligned

For each Spanish segment, the aligner scans forward through the English segments collecting any that overlap it by at least 300 milliseconds and where that overlap is at least 50% of the shorter segment’s duration. Multiple matching English segments are joined into a single string. This handles the many-to-one case cleanly without requiring the subtitle files to be manually pre-processed.

The resulting aligned pairs are written to a SQLite segments table with columns for the source file identifier, segment index, timing in milliseconds, and text in both languages. The schema is minimal and fast:

CREATE TABLE IF NOT EXISTS segments (
  file_id TEXT,
  seg_idx INTEGER,
  start_ms INTEGER,
  end_ms INTEGER,
  text_es TEXT,
  text_en TEXT,
  PRIMARY KEY (file_id, seg_idx)
);

Running ingestion across your full back-catalog builds the translation memory that the rest of the pipeline draws on.

Phase 2: Building Dual FAISS Indexes

With the corpus in SQLite, build_indexes.py encodes every segment into a vector space using a multilingual sentence embedding model, then writes two FAISS indexes — one for Spanish source text, one for English source text.

embedder = SentenceTransformer(
    "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
)

es_texts = [r[0] for r in rows]
en_texts = [r[1] for r in rows]

es_emb = embedder.encode(es_texts, normalize_embeddings=True)
en_emb = embedder.encode(en_texts, normalize_embeddings=True)

dim = es_emb.shape[1]
idx_es = faiss.IndexFlatIP(dim)
idx_en = faiss.IndexFlatIP(dim)

idx_es.add(es_emb.astype("float32"))
idx_en.add(en_emb.astype("float32"))

The choice of paraphrase-multilingual-MiniLM-L12-v2 is deliberate. It is a compact but capable multilingual model trained specifically for semantic similarity across languages. Critically, it maps semantically equivalent text in different languages into nearby regions of the same embedding space — so a Spanish subtitle and its English translation will have similar vector representations. This enables cross-lingual retrieval: when you embed a Spanish query, you are searching a space where meaning, not surface form, determines proximity.

IndexFlatIP is FAISS’s inner product index. Because the embeddings are L2-normalized at encode time (normalize_embeddings=True), inner product is equivalent to cosine similarity. Flat indexes are exact rather than approximate — no ANN approximation error — which is appropriate when the corpus fits comfortably in memory (millions of short subtitle segments require only a few gigabytes).

The metadata (the actual text pairs) is written to a meta.jsonl file in the same order as the index entries, providing the lookup layer for retrieved results.

Phase 3: Translation with Retrieved Context

The translation step ties everything together. For each subtitle in the input SRT file, translate_srt.py embeds the source text, retrieves the top-k most similar corpus entries from the appropriate language index, and constructs a prompt that includes those entries as reference examples.

q = embedder.encode([src], normalize_embeddings=True).astype("float32")
_, I = index.search(q, args.k)
examples = [meta[i] for i in I[0]]

prompt = build_prompt(src, examples, args.direction)
resp = client.chat.completions.create(
    model=args.model,
    messages=[{"role": "user", "content": prompt}],
    temperature=0.2
)

The default retrieval depth is k=6 — six reference pairs per subtitle. The prompt template enforces subtitle-specific output constraints alongside the reference examples:

You are a professional subtitle translator.

Rules:
- Translate Spanish → English
- Be concise, subtitle-appropriate
- Max 2 lines, ~42 chars/line
- Preserve meaning and tone

Reference translations:
Example 1 (Spanish): No me queda otra opción.
Example 1 (English): I have no other choice.
...

Translate this subtitle (Spanish):
No tengo salida.

Return ONLY the translated subtitle.

The constraint Return ONLY the translated subtitle prevents the model from producing explanatory text or multiple alternatives — important when the output is written directly into SRT subtitle objects and composed back into a valid SRT file.

Temperature is set to 0.2, leaning toward deterministic output. For translation where consistency is the goal — producing the same rendering of recurring phrases across thousands of subtitles — low temperature is the right default.

Bidirectional Translation

The architecture is symmetrical. Depending on the --direction flag (es2en or en2es), the translation script selects either index_es.faiss or index_en.faiss for retrieval. The prompt builder adjusts its language labels accordingly. The same corpus, the same embedding model, and the same LLM serve both translation directions without any architecture changes.

Extending the system to additional language pairs requires building the appropriate bilingual corpus, encoding into new FAISS indexes, and adding a direction variant to the prompt builder. The core pipeline does not change.

Practical Applications

Streaming and media platforms with back-catalogs of professionally translated content are the primary fit. Each title added to the knowledge base improves the system’s ability to handle similar content. A platform that translates a second season of a series will benefit from every aligned pair ingested from the first.

Corporate learning and training content often has strict terminology requirements — product names, process labels, technical jargon. Traditional MT systems ignore these constraints. An organization that has translated 50 hours of compliance training has implicitly encoded all its terminology choices. srt_dual_rag_mt makes that encoding explicit and searchable.

Localization teams managing ongoing projects gain a force multiplier. Rather than replacing translators, the system reduces the volume of subtitles requiring full human effort — routing familiar phrasing through automated retrieval-backed translation while flagging low-confidence segments for review.

Documentary and specialized content benefits most when domain-specific vocabulary is consistent. Medical, legal, technical, or regional content is exactly where generic MT fails most visibly and where a domain-specific retrieval corpus adds the most value.

Architecture Trade-offs and Design Choices

Several decisions in the current implementation are worth noting explicitly.

SQLite for corpus storage keeps the ingestion pipeline dependency-free and portable. For corpora in the range of millions of short segments, SQLite performs well for bulk insert and sequential reads. It does not need to support concurrent write access or complex query patterns.

FAISS IndexFlatIP for retrieval prioritizes correctness over speed. Flat indexes perform exact nearest-neighbor search with no approximation. For most use cases — even large corpora — this is fast enough, and it avoids the precision trade-off that approximate indexes introduce. Organizations with very large corpora (tens of millions of segments) could substitute IndexIVFFlat or IndexHNSW without changing the surrounding code.

Low temperature for the LLM trades diversity for consistency. In translation, consistency is usually the goal. A small non-zero temperature (0.2) still allows for natural language variation while strongly biasing toward the most probable output given the retrieved context.

OpenAI’s gpt-4o-mini as the default model balances capability and cost. The model is capable of high-quality translation when given good reference examples. Swapping to a different model — a local model, a different provider, or a larger OpenAI model — requires only changing the --model flag and ensuring the client is configured appropriately.

Conclusion

srt_dual_rag_mt demonstrates that the gap between fast, cheap machine translation and high-quality, consistent localization can be substantially narrowed with the right architectural approach. By treating an existing corpus of professional translations as a searchable knowledge base — rather than a static archive — the system converts past translation effort into a continuously compounding asset.

The pipeline is intentionally straightforward: ingest, index, translate. Each phase is a single Python script with minimal dependencies. The output is standard SRT format, making integration into existing subtitle workflows direct.

srt_dual_rag_mt was developed by the engineering team at Hamkee, where we build tools for intelligent content processing and localization at scale. The source code is available at github.com/hamkee-dev-group/hse.

HAMKEE