Portfolio · Case Study

Diego S.
Diego S.
Data Science Customer Segmentation Predictive Modeling Marketing Analytics
Education
Florida International University
BS · Computer Science
Florida International University
MBA · Marketing & E-Commerce
Pennsylvania State University
Master of Applied Statistics
Available now
Expert-VettedExpert-Vetted
Top RatedTop Rated
100%Job Success
Case Study
Customer Segmentation
& LTV Analytics
Bilingual DTC Brand
Health Supplements
E-Commerce
83K
Customers analyzed
4
Behavioral segments
2
Storefronts (EN + ES)
01 · 09
Project overview
Key project facts
Client
Bilingual DTC Health-Supplements Brand
Industry
E-Commerce · Health & Wellness
Engagement type
Data Science · Consulting
Platform
Databricks · PySpark
Scope
Segmentation · LTV · Migration
Technical stack
Databricks PySpark Python scikit-learn XGBoost KMeans statsmodels Delta Lake Shopify API Google Analytics API Tableau
Engagement summary

A customer analytics platform for a bilingual DTC health-supplements brand running separate English and Spanish Shopify storefronts (~83K combined customers). Order, customer, and web-analytics data from both storefronts were ingested through PySpark pipelines on Databricks into a medallion (Bronze/Silver/Gold) Delta Lake architecture, then unified into a single customer table keyed on email. Customers were segmented with RFM — recency, frequency, and monetary value — using KMeans on each dimension, producing four behavioral segments: Window-Shoppers, One-Timers, Emerging Loyalists, and Loyalists. A cohort-based XGBoost model then predicted which lifetime-value tier a customer would reach from only their first four months of activity, and a migration analysis tracked how cohorts moved between segments over time. The findings drove a segment-level marketing strategy spanning discounting, channel investment, and reactivation.

20% → 50%
Loyalist Concentration
A cohort's Loyalists (~20% of customers) placed over 50% of all orders
81%
One-Timer Stickiness
Share staying in segment — pinpointed the reactivation opportunity
$163
Highest-LTV Channel
Email had the highest per-customer LTV despite a small order share
02 · 09
Technical Design · Customer Analytics
Data Pipeline
Databricks · PySpark · Medallion Architecture
Sources
Shopify EN
English storefront · orders + customers
Shopify ES
Spanish storefront · orders + customers
Google Analytics
web traffic · channel attribution
Call Center
support + order data
Ingestion
PySpark Pipelines
on Databricks
Shopify API
orders_en/es · customers_en/es
GA Reporting API
sessions · sources
Scheduled refresh
batch
Storage
Bronze
raw ingested
Silver
cleaned + conformed
Gold
Delta Lake · analytics-ready
Features
Unified customer table
keyed on email · EN + ES merged
RFM features
recency · frequency · monetary
Order metrics
AOV · units/order · days-between
Cohort assignment
monthly acquisition cohorts
Scope
~83K customers
EN + ES combined
One-year window
13 monthly cohorts
Delta tables
gold layer
Bilingual
English + Spanish
Technical Approach

Two separate Shopify storefronts — English and Spanish — plus Google Analytics and call-center data were ingested through PySpark pipelines on Databricks into a medallion (Bronze/Silver/Gold) Delta Lake architecture. Both customer bases were unified into a single table keyed on email, with storefront kept as a dimension, then enriched with RFM features and monthly cohort assignments.

Key Decisions
Unify EN + ES on email — one segmentation model across both storefronts, with language/storefront retained as a dimension rather than split into two models.
Medallion architecture — Bronze/Silver/Gold Delta tables keep raw, cleaned, and analytics-ready data separate and reproducible.
Cohort assignment at feature time — every customer is tagged to a monthly acquisition cohort so retention and migration can be measured over time.
03 · 09
Technical Design · Customer Analytics
Segmentation & LTV Modeling
RFM · KMeans · XGBoost
1
Inputs
Recency · Frequency · Monetary
Per customer: days since last order, order count, total revenue since acquisition · computed on the unified customer table
per-customer RFM
2
Clustering
KMeans · R / F / M
Separate KMeans clustering on each RFM dimension produces a recency cluster, frequency cluster, and revenue cluster — each ranked low to high
KMeans × 3
3
Scoring
Overall Score → 4 Segments
The three cluster ranks sum into an overall RFM score, mapped to four behavioral segments
4 segments
Segments
Window-Shoppers One-Timers Emerging Loyalists Loyalists
4
Temporal Setup
First-4-Months Cohort Scoring
Each monthly cohort is scored on its first 4 months of activity; segments are re-scored through the full window to measure migration — a clean train/predict time separation
13 monthly cohorts
5
Prediction
XGBoost Multiclass
XGBoost (multi:softprob) with balanced class weights predicts which LTV tier a customer will reach from early behavior · held-out test set
multi:softprob
6
Outputs
Migration + Targeting
Segment migration matrices, discount-sensitivity by segment, and channel-by-segment analysis feed a segment-level marketing strategy
strategy inputs
Approach

Customers are segmented with classic RFM, but each dimension is clustered independently with KMeans rather than bucketed by fixed thresholds — so the cut points adapt to the actual distribution. Segments feed a cohort-based LTV model that predicts a customer's eventual value tier from only their first four months of behavior.

Key Decisions
KMeans per RFM dimension — data-driven cut points instead of arbitrary quantile thresholds; the clusters reflect how this customer base actually distributes.
Cohort time-separation — features come from the first 4 months, the target from the full lifetime. Early behavior predicts eventual tier without seeing the future.
Balanced class weights — the LTV tiers are imbalanced (few high-value customers), so class weighting keeps the model from collapsing to the majority tier.
04 · 09
Analysis findings
A loyal core drives the revenue
A small, loyal core drives most of the revenue — and each segment responds to discounts differently.
Across monthly cohorts, the Loyalist segment — roughly a fifth of customers — consistently placed more than half of all orders. The One-Timer segment, by contrast, is highly static: most never advance to a higher-value segment on their own. But One-Timers are also the most discount-sensitive segment, which makes targeted second-order offers the clearest lever for reactivation. A flash-sale analysis confirmed a loss-leader effect — a record repeat-customer revenue month, with roughly half the products discounted and half full-price.
Before
One blended customer base with no view of who drives value, how segments migrate, or which customers respond to discounts.
After
Four behavioral segments with measured revenue concentration, migration likelihood, and discount sensitivity — each mapped to a specific marketing action.
20% → 50%
Loyalist concentration
~20% of a cohort's customers placed over half of all orders
81%
One-Timer stickiness
Stay in segment — the reactivation target, and the most discount-responsive group
$380K
Flash-sale record month
Highest repeat-customer revenue month — validated a loss-leader discounting strategy
05 · 09
RFM segment overview
01 / 04Customer Segmentation & LTV · Output
RFM Segment Overview

The full base scored on recency, frequency, and monetary value and split into four behavioral segments, each profiled by AOV, lifetime value, units-per-order, and subscription rate.

What this shows
Four behavioral segments
Window-Shoppers, One-Timers, Emerging Loyalists, and Loyalists, sized by share of the customer base.
Value per segment
AOV, LTV, and units-per-order quantified for each segment, so spend can follow value.
Subscription penetration
Subscription rate climbs sharply from One-Timers toward Loyalists.
Income distribution
Each segment's income mix surfaced to sharpen targeting.
Databricks · PySpark · K-Means · XGBoostFigures illustrative
06 · 09
Acquisition source mix and value
02 / 04Customer Segmentation & LTV · Output
Acquisition Source & Lifetime Value

Where customers come from paired with what each channel is actually worth: traffic mix alongside AOV, LTV, and units-per-order by source.

What this shows
Channel mix
Direct drives 53% of customers; paid channels are a deliberate minority.
Value by source
Email and Direct customers carry the highest LTV; Paid Social the lowest.
LTV, not just volume
Reframes acquisition around lifetime value rather than raw traffic share.
Budget implication
Pinpoints which channels deserve more spend per acquired customer.
GA4 · Shopify · DatabricksFigures illustrative
07 · 09
Segment economics and cohort migration
03 / 04Customer Segmentation & LTV · Output
Segment Economics & Cohort Migration

How the segments compare on RFM and revenue, and how a single monthly cohort migrates toward Loyalist status over a full year.

What this shows
RFM characteristics
Loyalists average 5.3 orders and $437 revenue at 96-day recency, vs One-Timers at 1.0 / $63 / 387.
Revenue concentration
Loyalists are a small slice of the cohort but the majority of its revenue.
Cohort migration
Order share shifts from One-Timer-heavy to roughly 97% Loyalist across twelve months.
Retention signal
Quantifies how quickly the base consolidates into repeat buyers.
Databricks · PySpark · cohort analysisFigures illustrative
08 · 09
Segment migration dynamics
04 / 04Customer Segmentation & LTV · Output
Segment Migration Dynamics

Transition probabilities between segments a year on, showing which customers advance, hold, or slip back, and where intervention pays off.

What this shows
One-Timers are sticky
81% remain One-Timers a year later; only 12% advance to Emerging Loyalist, 7% to Loyalist.
Emerging Loyalists split
59% hold, 17% advance to Loyalist, and 24% slip back to One-Timer.
Loyalist retention
55% stay Loyalist; 44% soften to Emerging and only 1% lapse fully.
Highest-leverage nudge
The One-Timer to Emerging jump is where targeted offers move the most value.
Databricks · PySpark · transition modelingFigures illustrative
09 · 09