Portfolio · Case Study

Diego S.
Diego S.
AI Engineering RAG Systems Content Automation Agentic Pipelines
Education
Florida International University
BS · Computer Science
Florida International University
MBA · Marketing & E-Commerce
Pennsylvania State University
Master of Applied Statistics
Available now
Expert-VettedExpert-Vetted
Top RatedTop Rated
100%Job Success
Case Study
AI Content Marketing Engine
Agentic Platform Build
Content Marketing
5
Content source types
4
Output channels
3
AI models in pipeline
01 · 08
Project overview
Key project facts
Platform
AI Content Marketing Engine
Domain
Content Marketing · AI Engineering
Engagement type
Agentic Platform Build
Status
In production
Scope
RAG · Generation · Review · Publish
Technical stack
Python FastAPI React Vite ChromaDB OpenAI Embeddings Claude Sonnet 4.6 Claude Haiku gpt-image-2 ElevenLabs WordPress REST API DigitalOcean Spaces PostgreSQL shadcn/ui
Engagement summary

An end-to-end AI content pipeline that generates grounded, multi-channel content from a curated knowledge base. Source material — YouTube transcripts, PDFs, eBooks, web articles, and podcast episodes — is ingested, chunked, and embedded into ChromaDB. A podcast topic extraction pipeline uses Claude to identify article-worthy topics from each episode, each with a verbatim source segment as raw material. At generation time, RAG retrieval pulls source-attributed context — not the model's training data — and renders it through a configurable voice template — calibrated against reference writers or publications defined by the author, producing output that sounds like a consistent, human voice rather than generic AI. A standalone Claude Haiku review agent audits every draft before publish. Feature images are generated via gpt-image-2 against a branded studio style guide. Final output publishes to WordPress with a companion ElevenLabs audio file.

5
Content Source Types
YouTube · PDF · eBook · Web Articles · Podcast feeds
4
Output Channels
Blog · LinkedIn · Twitter · Newsletter — one knowledge base, multiple formats
3
AI Models in Pipeline
Claude Sonnet 4.6 · text-embedding-3-small · gpt-image-2
02 · 08
Technical Design · AI Content Marketing Engine
End-to-End Content Generation
RAG · Claude Sonnet · gpt-image-2 · WordPress
1
Knowledge Base
Multi-Source Ingestion
YouTube transcripts · PDF · eBook highlights · web articles · automated podcast feed scraping on schedule
5 source types
Extractors
youtube.py pdf.py ebook.py articles.py podcast feeds
2
Topic Discovery
Claude Sonnet · Topic Extractor
Per-episode: Claude identifies N article-worthy topics, each with a 300–600 word verbatim segment · lands as pending_review · generalizes to any long-form source
N topics / episode
3
Vector Index
ChromaDB · text-embedding-3-small
1,000-char chunks · 200-char overlap · paragraph-boundary splitting · OpenAI text-embedding-3-small · ChromaDB 'sources' collection
ChromaDB
4
Grounded Generation
Claude Sonnet 4.6 · Voice Template
Semantic query returns chunks grouped by source with attribution headings · source metadata injected separately · rendered against configurable voice template · author-defined reference styles
claude-sonnet-4-6
5
QA Gate
blog-post-reviewer · Claude Haiku
Standalone agent · read-only · structured pass/fail audit: title format, char count, source attribution, metric citations, visual integrity, mobile rendering
claude-haiku
6
Asset Generation
Data Viz + Feature Image
Inline visuals: Mermaid · Plotly · SVG · HTML (10 canonical archetypes) · Featured image: gpt-image-2 · 9 variants generated (3 styles × 3 framings) for human selection
gpt-image-2
7
Distribution
WordPress + Audio + CDN
Markdown → HTML → WordPress REST API · ElevenLabs TTS companion audio → DigitalOcean Spaces CDN · post meta updated with audio URL
WP REST · ElevenLabs
What Makes It Grounded

Content is generated from the author's actual knowledge base — not the model's training data. RAG retrieval pulls only from ingested sources; chunks are grouped by source title so Claude cites real authors and titles, not hallucinated ones. Source metadata (author, date, URL) is injected in a separate block to prevent fabrication of bibliographic details.

Key Decisions
Child topic mining — Claude extracts N topics per podcast episode, each with a verbatim segment as source material. One episode becomes multiple grounded article stubs automatically.
Voice templates — voice templates are configurable against any reference writers or publications the author wants to emulate. Output stays stylistically consistent across topics, contributors, and channels.
Standalone review agent — Claude Haiku audits the draft before publish as a separate read-only agent. Never fixes; only reports. Failures block publish.
03 · 08
Technical Design · AI Content Marketing Engine
RAG Knowledge Architecture
ChromaDB · Source Attribution · Voice System
Sources
YouTube
transcript via yt-dlp
·
PDF
text extraction
·
eBook
highlights export
·
Web Article
scraper
·
Podcast Feed
scheduled pull
Processing
Text Extraction
per-type extractor
Chunking
1,000 char · 200 overlap
Embedding
text-embedding-3-small
ChromaDB
'sources' collection
Generation
RAG Query Layer
semantic search
source grouping
attribution headers
ChromaDB n_results=20 source_ids filter metadata injection
Voice & Template
Template selection
default · default-voiced · simple
Voice blend
Author-defined reference voices
Style enforcement
audience · structure · anti-patterns
QA +
Visuals
Review Agent
Claude Haiku · 10+ checks
·
Data Viz
Mermaid · Plotly · SVG · HTML
·
Feature Image
gpt-image-2 · 9 variants · studio style guide
·
Metric Check
every stat traced to source
Publish
WordPress REST
markdown → HTML · ACF fields
ElevenLabs TTS
article → audio companion
DO Spaces CDN
audio storage · CDN URL
Post Meta
audio_url · genesis_source · sources block
Attribution Architecture

When Claude generates, it sees chunks grouped under labeled source headings — '[YouTube] Author Name · Title', '[PDF] Book Title by Author'. The model naturally cites from these labels rather than inventing sources. Canonical bibliographic metadata (author, date, URL) is injected in a separate block after the context, so the model has the real data available and cannot fabricate it.

Key Decisions
text-embedding-3-small over larger models — faster and cheaper for a knowledge base that re-embeds frequently as sources are added. Quality is sufficient for the retrieval task.
n_results=20 with source_ids filter — per-topic filtering means the model only sees sources the author explicitly curated, not the full corpus.
Genesis source required — every article records what seeded the topic (episode, article, conversation). Checked by the review agent before publish.
04 · 08
Content Writer dashboard — knowledge base overview with source counts
01 / 04 AI Content Marketing Engine · Deliverable
Knowledge Base
Dashboard

A single view of the curated knowledge base — what's in it, what's been processed, and how to add more.

What this screen does
Multi-modal source counts
Books, Videos, Web Articles, PDFs, and Podcasts tracked per source type — the breadth of the knowledge base at a glance.
Pipeline health
Total sources ingested, extracted, and embedded — every ingestion is staged so you can see what's ready for retrieval.
Quick actions
Add a source (any modality) or generate content directly from the dashboard.
Python · FastAPI · ChromaDB · Anthropic Claude · OpenAI · WordPress REST · ElevenLabs
05 · 08
Source library — ingested books, videos, PDFs, web articles, and podcast episodes
02 / 04 AI Content Marketing Engine · Deliverable
Source Library

Every source ingested into the knowledge base — searchable, filterable, and traceable from any generated article back to its origin.

What this screen does
Five ingestion modalities
YouTube transcripts, PDFs, eBooks, web articles, and podcast feeds — one schema across all.
Extraction + embedding status
Every source shows its processing state so it's clear what's available to the RAG retrieval layer.
Source attribution
Every generated article cites the exact sources used; click through to verify any claim.
Python · FastAPI · ChromaDB · Anthropic Claude · OpenAI · WordPress REST · ElevenLabs
06 · 08
Topics — discovered and curated article ideas with podcast-mined child topics
03 / 04 AI Content Marketing Engine · Deliverable
Topic Discovery

The pipeline of article ideas — sourced from the curated knowledge base and automatically mined from new podcast episodes.

What this screen does
Curated topic ideas
Each topic carries a title, description, source list, and status (pending review → ready to generate → article published).
Podcast child-topic mining
Claude extracts multiple article-worthy topics from each new podcast episode, with verbatim source segments as raw material.
Source-anchored
Every topic links back to the specific sources it draws from, so no article is generated without an evidence base.
Ready-to-generate queue
Promote a topic from candidate to writing-ready in a click; the generation pipeline picks it up.
Python · FastAPI · ChromaDB · Anthropic Claude · OpenAI · WordPress REST · ElevenLabs
07 · 08
Generated articles — full content library with metadata per article
04 / 04 AI Content Marketing Engine · Deliverable
Generated
Content

Every article the engine has produced — long-form, voice-aligned, source-cited, and ready for publishing across channels.

What this screen does
Article catalog
Every generated piece by title, with creation date, source count, and the model that wrote it.
Source-cited by default
Each article carries its source list inline; the review agent blocks publish on uncited metrics.
Voice-consistent output
All articles are generated against the same configurable voice template — same brand voice across topics and contributors.
Channel-ready
Long-form articles repurposed into LinkedIn, Twitter, and email variants from the same source-grounded draft.
Python · FastAPI · ChromaDB · Anthropic Claude · OpenAI · WordPress REST · ElevenLabs
08 · 08