Data Engineering·January 10, 2026·7 min read

Enterprise Data Archiving with Apache NiFi, Tika, Solr & Ozone

A deep dive into the unstructured data archiving pipeline I built — covering content extraction with Tika, full-text indexing with Solr, and S3-compatible storage with Apache Ozone.

#Apache NiFi #Data Engineering #Solr #Big Data

Why This Stack?

Most enterprise archiving tools are either too rigid (proprietary formats) or too loose (dump-and-forget cold storage). The goal was a pipeline that could:

Ingest any document format (PDF, DOCX, emails, images)
Extract structured content for search
Index everything for full-text retrieval
Store long-term with lifecycle policies

The Pipeline


Files / APIs / Databases
        │
        ▼
   Apache NiFi
   (Orchestration & routing)
        │
        ├──► Apache Tika
        │    (Content extraction)
        │
        ├──► Apache Solr
        │    (Full-text index)
        │
        └──► Apache Ozone
             (S3-compatible storage)

Apache NiFi — The Orchestrator

NiFi's visual dataflow makes it ideal for this kind of multi-sink pipeline. Key processors used:

GetSFTP / ListS3 — source connectors
RouteOnAttribute — route by MIME type
InvokeHTTP — call Tika REST API
PutSolrRecord — index into Solr
PutOzoneObject — store in Ozone

One underappreciated NiFi feature: back-pressure. When Solr is slow, NiFi automatically throttles ingestion rather than dropping records.

Content Extraction with Tika

Apache Tika handles 1,400+ file formats. Running it as a server:


java -jar tika-server-standard-2.9.0.jar --port 9998

Extract text and metadata via REST:


import requests

def extract(file_path: str) -> dict:
    with open(file_path, "rb") as f:
        meta = requests.put(
            "http://localhost:9998/meta/form",
            files={"upload": f},
        ).json()
        
    with open(file_path, "rb") as f:
        text = requests.put(
            "http://localhost:9998/tika",
            data=f,
            headers={"Accept": "text/plain"},
        ).text

    return {"metadata": meta, "content": text}

Solr Schema Design

For archiving, a flat schema with dynamic fields works well:


<field name="id" type="string" indexed="true" stored="true" required="true"/>
<field name="content" type="text_general" indexed="true" stored="false"/>
<field name="title" type="string" indexed="true" stored="true"/>
<field name="created_at" type="pdate" indexed="true" stored="true"/>
<field name="file_type" type="string" indexed="true" stored="true"/>
<dynamicField name="meta_*" type="string" indexed="true" stored="true"/>

The content field is indexed but not stored — this saves significant disk space since the original file lives in Ozone.

Ozone for Long-Term Storage

Apache Ozone is HDFS's successor — an object store with an S3-compatible API. Lifecycle rules automatically transition objects to cheaper storage tiers after 90 days:


ozone sh bucket setlifecycle /vol/archive-bucket \
  --lifecycle '{"Rules":[{"Status":"Enabled","Transitions":[{"Days":90,"StorageClass":"GLACIER"}]}]}'

Lessons Learned

Tika memory usage spikes on large PDFs — run it in a container with memory limits.
Solr commit intervals — don't auto-commit per document; batch commits every 30s instead.
NiFi provenance is invaluable for debugging — turn it on even in prod.

Sohaib Sarosh Shamsi

Full-Stack & AI/ML Engineer — building intelligent systems.

GitHub LinkedIn

Building My Portfolio with Next.js 16, Tailwind v4 & Framer Motion

Building a Real-Time Fraud Detection Pipeline with Elastic Retraining

The Blog