Diego Sanz · AI Content Marketing Engine

Portfolio · Case Study

Diego S.

AI Engineering RAG Systems Content Automation Agentic Pipelines

Education

Florida International University

BS · Computer Science

Florida International University

MBA · Marketing & E-Commerce

Pennsylvania State University

Master of Applied Statistics

Available now

Expert-Vetted

Top Rated

100%Job Success

Case Study

AI Content Marketing Engine

Agentic Platform Build

Content Marketing

5

Content source types

4

Output channels

3

AI models in pipeline

01 · 08

Project overview

Key project facts

Platform

AI Content Marketing Engine

Domain

Content Marketing · AI Engineering

Engagement type

Agentic Platform Build

Status

In production

Scope

RAG · Generation · Review · Publish

Technical stack

Python FastAPI React Vite ChromaDB OpenAI Embeddings Claude Sonnet 4.6 Claude Haiku gpt-image-2 ElevenLabs WordPress REST API DigitalOcean Spaces PostgreSQL shadcn/ui

Engagement summary

An end-to-end AI content pipeline that generates grounded, multi-channel content from a curated knowledge base. Source material — YouTube transcripts, PDFs, eBooks, web articles, and podcast episodes — is ingested, chunked, and embedded into ChromaDB. A podcast topic extraction pipeline uses Claude to identify article-worthy topics from each episode, each with a verbatim source segment as raw material. At generation time, RAG retrieval pulls source-attributed context — not the model's training data — and renders it through a configurable voice template — calibrated against reference writers or publications defined by the author, producing output that sounds like a consistent, human voice rather than generic AI. A standalone Claude Haiku review agent audits every draft before publish. Feature images are generated via gpt-image-2 against a branded studio style guide. Final output publishes to WordPress with a companion ElevenLabs audio file.

5

Content Source Types

YouTube · PDF · eBook · Web Articles · Podcast feeds

4

Output Channels

Blog · LinkedIn · Twitter · Newsletter — one knowledge base, multiple formats

3

AI Models in Pipeline

Claude Sonnet 4.6 · text-embedding-3-small · gpt-image-2

02 · 08

Technical Design · AI Content Marketing Engine

End-to-End Content Generation

RAG · Claude Sonnet · gpt-image-2 · WordPress

1

Knowledge Base

Multi-Source Ingestion

YouTube transcripts · PDF · eBook highlights · web articles · automated podcast feed scraping on schedule

5 source types

Extractors

youtube.py pdf.py ebook.py articles.py podcast feeds

2

Topic Discovery

Claude Sonnet · Topic Extractor

Per-episode: Claude identifies N article-worthy topics, each with a 300–600 word verbatim segment · lands as pending_review · generalizes to any long-form source

N topics / episode

3

Vector Index

ChromaDB · text-embedding-3-small

1,000-char chunks · 200-char overlap · paragraph-boundary splitting · OpenAI text-embedding-3-small · ChromaDB 'sources' collection

ChromaDB

4

Grounded Generation

Claude Sonnet 4.6 · Voice Template

Semantic query returns chunks grouped by source with attribution headings · source metadata injected separately · rendered against configurable voice template · author-defined reference styles

claude-sonnet-4-6

Context format

source-attributed chunks canonical metadata block style_notes

5

QA Gate

blog-post-reviewer · Claude Haiku

Standalone agent · read-only · structured pass/fail audit: title format, char count, source attribution, metric citations, visual integrity, mobile rendering

claude-haiku

6

Asset Generation

Data Viz + Feature Image

Inline visuals: Mermaid · Plotly · SVG · HTML (10 canonical archetypes) · Featured image: gpt-image-2 · 9 variants generated (3 styles × 3 framings) for human selection

gpt-image-2

7

Distribution

WordPress + Audio + CDN

Markdown → HTML → WordPress REST API · ElevenLabs TTS companion audio → DigitalOcean Spaces CDN · post meta updated with audio URL

WP REST · ElevenLabs

What Makes It Grounded

Content is generated from the author's actual knowledge base — not the model's training data. RAG retrieval pulls only from ingested sources; chunks are grouped by source title so Claude cites real authors and titles, not hallucinated ones. Source metadata (author, date, URL) is injected in a separate block to prevent fabrication of bibliographic details.

Key Decisions

Child topic mining — Claude extracts N topics per podcast episode, each with a verbatim segment as source material. One episode becomes multiple grounded article stubs automatically.

Voice templates — voice templates are configurable against any reference writers or publications the author wants to emulate. Output stays stylistically consistent across topics, contributors, and channels.

Standalone review agent — Claude Haiku audits the draft before publish as a separate read-only agent. Never fixes; only reports. Failures block publish.

03 · 08

Technical Design · AI Content Marketing Engine

RAG Knowledge Architecture

ChromaDB · Source Attribution · Voice System

Sources

YouTube

transcript via yt-dlp

·

PDF

text extraction

·

eBook

highlights export

·

Web Article

scraper

·

Podcast Feed

scheduled pull

Processing

Text Extraction

per-type extractor

→

Chunking

1,000 char · 200 overlap

→

Embedding

text-embedding-3-small

→

ChromaDB

'sources' collection

Generation

RAG Query Layer

semantic search

→

source grouping

→

attribution headers

ChromaDB n_results=20 source_ids filter metadata injection

Voice & Template

Template selection

default · default-voiced · simple

Voice blend

Author-defined reference voices

Style enforcement

audience · structure · anti-patterns

QA +
Visuals

Review Agent

Claude Haiku · 10+ checks

·

Data Viz

Mermaid · Plotly · SVG · HTML

·

Feature Image

gpt-image-2 · 9 variants · studio style guide

·

Metric Check

every stat traced to source

Publish

WordPress REST

markdown → HTML · ACF fields

→

ElevenLabs TTS

article → audio companion

→

DO Spaces CDN

audio storage · CDN URL

→

Post Meta

audio_url · genesis_source · sources block

Attribution Architecture

When Claude generates, it sees chunks grouped under labeled source headings — '[YouTube] Author Name · Title', '[PDF] Book Title by Author'. The model naturally cites from these labels rather than inventing sources. Canonical bibliographic metadata (author, date, URL) is injected in a separate block after the context, so the model has the real data available and cannot fabricate it.

Key Decisions

text-embedding-3-small over larger models — faster and cheaper for a knowledge base that re-embeds frequently as sources are added. Quality is sufficient for the retrieval task.

n_results=20 with source_ids filter — per-topic filtering means the model only sees sources the author explicitly curated, not the full corpus.

Genesis source required — every article records what seeded the topic (episode, article, conversation). Checked by the review agent before publish.

04 · 08

Content Writer dashboard — knowledge base overview with source counts

01 / 04 AI Content Marketing Engine · Deliverable

Knowledge Base
Dashboard

A single view of the curated knowledge base — what's in it, what's been processed, and how to add more.

What this screen does

Multi-modal source counts

Books, Videos, Web Articles, PDFs, and Podcasts tracked per source type — the breadth of the knowledge base at a glance.

Pipeline health

Total sources ingested, extracted, and embedded — every ingestion is staged so you can see what's ready for retrieval.

Quick actions

Add a source (any modality) or generate content directly from the dashboard.

Python · FastAPI · ChromaDB · Anthropic Claude · OpenAI · WordPress REST · ElevenLabs

05 · 08

02 / 04 AI Content Marketing Engine · Deliverable

Source Library

Every source ingested into the knowledge base — searchable, filterable, and traceable from any generated article back to its origin.

What this screen does

Five ingestion modalities

YouTube transcripts, PDFs, eBooks, web articles, and podcast feeds — one schema across all.

Extraction + embedding status

Every source shows its processing state so it's clear what's available to the RAG retrieval layer.

Source attribution

Every generated article cites the exact sources used; click through to verify any claim.

Python · FastAPI · ChromaDB · Anthropic Claude · OpenAI · WordPress REST · ElevenLabs

06 · 08

Topics — discovered and curated article ideas with podcast-mined child topics

03 / 04 AI Content Marketing Engine · Deliverable

Topic Discovery

The pipeline of article ideas — sourced from the curated knowledge base and automatically mined from new podcast episodes.

What this screen does

Curated topic ideas

Each topic carries a title, description, source list, and status (pending review → ready to generate → article published).

Podcast child-topic mining

Claude extracts multiple article-worthy topics from each new podcast episode, with verbatim source segments as raw material.

Source-anchored

Every topic links back to the specific sources it draws from, so no article is generated without an evidence base.

Ready-to-generate queue

Promote a topic from candidate to writing-ready in a click; the generation pipeline picks it up.

Python · FastAPI · ChromaDB · Anthropic Claude · OpenAI · WordPress REST · ElevenLabs

07 · 08

Generated articles — full content library with metadata per article

04 / 04 AI Content Marketing Engine · Deliverable

Generated
Content

Every article the engine has produced — long-form, voice-aligned, source-cited, and ready for publishing across channels.

What this screen does

Article catalog

Every generated piece by title, with creation date, source count, and the model that wrote it.

Source-cited by default

Each article carries its source list inline; the review agent blocks publish on uncited metrics.

Voice-consistent output

All articles are generated against the same configurable voice template — same brand voice across topics and contributors.

Channel-ready

Long-form articles repurposed into LinkedIn, Twitter, and email variants from the same source-grounded draft.

Python · FastAPI · ChromaDB · Anthropic Claude · OpenAI · WordPress REST · ElevenLabs

08 · 08