Post

Real‑Time Sentiment Analysis API

Real‑Time Sentiment Analysis API

💬 Overview

This post walks through a real-time sentiment analysis API I built to demonstrate ML engineering techniques, API design, caching strategies, and basic observability in a small code base. The service exposes REST endpoints for single and batch sentiment predictions, uses a Hugging Face transformer, caches results using Redis to reduce inference load, and surfaces operational metrics via Prometheus for visualization in Grafana.

🔗 GitHub - Real‑Time Sentiment API

Real‑World Applications

A few illustrative use cases for real-time sentiment analysis:

  • Entertainment: Monitor live audience reactions to TV moments like plot twists
  • Politics: Track sentiment of candidate mentions, baselined against opponents
  • Commerce: Detect shifts in customer sentiment after a new product version release

Features

  • 🚀 Low‑latency ML inference behind an API (FastAPI + Uvicorn).
  • ⚡ Caching for cost + performance benefits (Redis).
  • 📊 Instrumentation (Prometheus metrics, Grafana dashboards).
  • 📦 Reproducible deployment (Docker).

Architecture Diagram

Architecture Diagram


🧠 What this Service Does

At its core, the API accepts text input and returns a sentiment label (POSITIVE, NEGATIVE, or NEUTRAL) plus a confidence score. Under the hood, it loads a transformer model (default: cardiffnlp/twitter-roberta-base, configurable) through the Hugging Face pipeline() abstraction so you don’t have to manage preprocessing or tokenization yourself.

Because transformer inference is relatively expensive compared with simple lookup work, repeated texts are cached in Redis. When an incoming request hits an already‑seen string (within a configurable TTL), the response is served from memory, drastically reducing latency and compute load.


🏗 Architecture at a Glance

ComponentRole
FastAPI appHosts /predict, /batch_predict, /health, /metrics, and /docs.
Hugging Face modelTransformer‑based sentiment classifier loaded via pipeline.
RedisIn‑memory cache keyed by normalized text to avoid repeat inference.
PrometheusScrapes /metrics for request counts, latency histograms, cache hits, etc.
GrafanaDashboards for API performance & system health using Prometheus data.
Docker ComposeSpins up API, Redis, and monitoring stack locally with one command.

Project Structure

1
2
3
4
5
6
7
8
9
10
11
12
realtime-sentiment-api/
├─ src/
│  ├─ api/          # FastAPI routes & schemas
│  ├─ model/        # HF pipeline wrapper, load/warm logic
│  ├─ services/     # Caching, batching, business rules
│  └─ monitoring/   # Metrics helpers, Prometheus exports
├─ tests/
│  ├─ unit/
│  └─ integration/
├─ monitoring/      # Prometheus config, Grafana dashboards
├─ docker-compose.yml
└─ Dockerfile

🚀 Quick Start

Goal: Clone the repo, build the containers, and hit the API locally in under 5 minutes.

Before starting, install:

  • Docker 20.10
  • Docker Compose 2.4+

1. Clone & Enter

1
2
git clone https://github.com/znimon/realtime-sentiment-api.git
cd realtime-sentiment-api

2. Launch the Stack

1
docker compose up -d --build

Docker Compose reads the docker-compose.yml, builds the API image, pulls Redis + monitoring images, and starts the networked services in dependency order.

3. Health Check

1
curl http://localhost:8000/health

You should get a JSON payload confirming service readiness (model load + Redis connectivity).


Try the API

Interactive Docs

FastAPI auto‑generates interactive API docs at /docs (Swagger UI) and /redoc by reading the app’s OpenAPI schema derived from the type‑annotated Python code which is super handy for quick testing.

Single Prediction

1
curl -X POST http://localhost:8000/predict   -H "Content-Type: application/json"   -d '{"text": "I really enjoy using this product"}'

The service returns the original text, sentiment label, confidence, processing time, and whether the response was served from cache.

Batch Prediction

1
curl -X POST http://localhost:8000/batch_predict   -H "Content-Type: application/json"   -d '{"texts": ["Love it", "This is awful", "Meh"]}'

Batching lets you amortize model overhead across multiple inputs.


⚙️ Configuration

The service exposes several environment variables:

VariableDefaultWhat It Controls
REDIS_URLredis://redis:6379Where to connect for caching.
MODEL_NAMEcardiffnlp/twitter-roberta-baseHugging Face model to load via pipeline. Swap for domain‑specific models.
BATCH_SIZE32Max batch size for inference loops; throughput vs. latency tradeoffs.
CACHE_TTL3600Expiration (seconds) for cached predictions; tune freshness vs. cost.

🧠 How Sentiment Inference Works (Under the Hood)

The API loads a transformer text‑classification pipeline from Hugging Face on startup, warming the model into memory so the first request isn’t cold. The pipeline() helper bundles tokenization, model forward pass, and decoding of label probabilities into simple callable usage.

Model warm‑up at process start is important in Uvicorn/Gunicorn multiprocess deployments; each worker loads its own model instance, so container memory sizing matters. Production deployments often pin worker count based on model RAM footprint.


⚡ Redis Caching Strategy

Why Cache?

Transformer inference is compute (CPU/GPU) heavy compared with a memory lookup. If the traffic includes repeated texts (UI forms, common phrases, A/B test prompts), caching reduces latency and infra cost. Redis is an in‑memory key‑value store widely used for application caching.

TTL & Invalidation

The project exposes a CACHE_TTL so cached sentiments eventually expire. This is important when the underlying models change.


📊 Observability: Prometheus + Grafana

Metrics Endpoint

The API publishes Prometheus‑formatted metrics at /metrics. Prometheus scrapes that endpoint on an interval you configure, storing time‑series data for requests, latencies, error counts, and cache effectiveness.

Dashboards

Grafana connects to Prometheus as a data source and lets you build interactive dashboards, latency percentiles, cache hit ratios, error rate panels, CPU/memory graphs for the containers, and alert rules that notify you when service level agreements (SLOs) are threatened.


📈 Sample Performance Numbers

I profiled three scenarios locally using the default model and container settings. Results will vary by hardware, but the relationship between cached vs. uncached paths is instructive:

ScenarioThroughputP95 LatencyCache Hit Rate
Single (cached)1200 req/s15 ms100%
Single (uncached)200 req/s450 ms0%
Batch (10 items)80 req/s600 ms40%

The step‑function jump when hitting Redis demonstrates why layering a cache in front of heavy ML models is standard practice in production systems.


📚 Learning Takeaways

System Design

  • Designed a stateless API layer backed by a stateful Redis cache to decouple inference cost from request volume.
  • Selected FastAPI for performance, async support, self‑documenting endpoints and demoability.
  • Modeled both single and batch endpoints to illustrate latency vs. throughput tradeoffs.
  • Exposed Prometheus metrics to make performance measurable and observable.

ML Engineering

  • Wrapped a Hugging Face transformer with a service boundary instead of embedding inference into each app caller; centralizes model versioning.
  • Demonstrated model warm‑up & caching patterns to reduce tail latency.

Operations

  • Containerized everything and used Docker Compose for reproducible builds.
  • Provided Grafana dashboards to visualize SLOs and error signal.

📦 Closing Thoughts

This project delivers a real‑time sentiment analysis API that’s both practical and modular:

  • It wraps a Hugging Face transformer in a FastAPI­ web service.
  • By layering in Redis caching, it minimizes inference overhead and accelerates repeat request responses.
  • It integrates Prometheus and Grafana for observability: tracking latency, cache performance, and errors.
  • Fully containerized with Docker

If you found this project useful and want to see more like it, your support keeps the ideas brewing:

Buy me a coffee

This post is licensed under CC BY 4.0 by the author.