Writing & Thoughts
The Blog
Deep-dives into AI, ML systems, data engineering, and full-stack development — written from the trenches.
Deep-dives into AI, ML systems, data engineering, and full-stack development — written from the trenches.
A deep dive into the unstructured data archiving pipeline I built — covering content extraction with Tika, full-text indexing with Solr, and S3-compatible storage with Apache Ozone.
Most enterprise archiving tools are either too rigid (proprietary formats) or too loose (dump-and-forget cold storage). The goal was a pipeline that could:
Files / APIs / Databases │ ▼ Apache NiFi (Orchestration & routing) │ ├──► Apache Tika │ (Content extraction) │ ├──► Apache Solr │ (Full-text index) │ └──► Apache Ozone (S3-compatible storage)
NiFi's visual dataflow makes it ideal for this kind of multi-sink pipeline. Key processors used:
One underappreciated NiFi feature: back-pressure. When Solr is slow, NiFi automatically throttles ingestion rather than dropping records.
Apache Tika handles 1,400+ file formats. Running it as a server:
java -jar tika-server-standard-2.9.0.jar --port 9998
Extract text and metadata via REST:
import requests def extract(file_path: str) -> dict: with open(file_path, "rb") as f: meta = requests.put( "http://localhost:9998/meta/form", files={"upload": f}, ).json() with open(file_path, "rb") as f: text = requests.put( "http://localhost:9998/tika", data=f, headers={"Accept": "text/plain"}, ).text return {"metadata": meta, "content": text}
For archiving, a flat schema with dynamic fields works well:
<field name="id" type="string" indexed="true" stored="true" required="true"/> <field name="content" type="text_general" indexed="true" stored="false"/> <field name="title" type="string" indexed="true" stored="true"/> <field name="created_at" type="pdate" indexed="true" stored="true"/> <field name="file_type" type="string" indexed="true" stored="true"/> <dynamicField name="meta_*" type="string" indexed="true" stored="true"/>
The content field is indexed but not stored — this saves significant disk space since the original file lives in Ozone.
Apache Ozone is HDFS's successor — an object store with an S3-compatible API. Lifecycle rules automatically transition objects to cheaper storage tiers after 90 days:
ozone sh bucket setlifecycle /vol/archive-bucket \ --lifecycle '{"Rules":[{"Status":"Enabled","Transitions":[{"Days":90,"StorageClass":"GLACIER"}]}]}'