What is Fichero? It is a Python-based pipeline for large-scale document processing—think thousands of scanned pages, each requiring cropping, enhancement, OCR, and transcription.
I’ve had versions of Fichero working for six month. But, it was slow. Fichero processed one step at a time, then the next. It was slow. The challenge, taking advantage of computing power, multiple processors, and not getting bogged down.
This week, I got fichero_director.py working, which uses Celery to run Fichero “workers” on multiple processes, each doing different steps at the same time.
Imagine one cook making 100 pizzas, one after the others. Versus 8 cooks making 8 pizzas at a time.
How does it work? Fichero_director.py breaks down workflows into CPU-intensive and I/O-intensive steps. For example, image tasks are CPU-heavy, while transcription using language models or converting to Word documents is mostly I/O-bound by disk or LLM inference. Tasks are sent to Celery queues based on script type.
Fichero_director tries to tune for the host system. On M1 Macs, for example, the hope is CPU workers use the performance cores, while more numerous I/O workers handle slower operations on the efficiency courses.
On my machine, I have 8 cores, and fichero_directory.py uses them all.
To follow along, there is a simple dashboard that tracks real-time progress, showing each folder’s status and current step. Each folder is processed independently, with logs written per folder and per step.
