Fichero

Fichero transcribes and auto-catalogues historical archives using vision large language models and artificial intelligence, running locally or in the cloud. It is in early development. To learn more about Fichero, please read the Fichero development blog and the frequently asked questions.

Releases

0.0.3 – June 5, 2025

  • Added support for multi-step LLM processing, using llm_process.py, llm_utils.py, and a config file. Tested with OpenAI models. Config files set prompts, choose batch sizes, decide between sending pages individually or in groups up to chunk size based on max tokens, allow combining results, and sending previous results. ,
  • Added a config file to auto-catalogue by generating using LLMs named entity recognition for people, places, organizations, events, but also specific things we are interested in, e.g. mines, animals, plants, injuries; a timeline of events; most important people; ten tags or keywords; and a 150 word summary.

0.0.2 – May 25, 2025

  • Added support for running on multiple processors and in parallel, and doing asynchronous transcription calls. Fichero_director.py.

0.0.1 – May 9, 2025

  • Initial development release, which crops, splits, rotates, enhances, removes background, and transcribes.

Frequently Asked Questions

What is ‘Fichero’?

It’s a digital tool being developed by myself, Daniel Tubb and Andy Janco, along with support from many others. I am anthropologist who has published work on the Chocó, in Colombia and has some experience with the programming language Python. Andy Janco is a systems engineer and also holds a PhD in history. Daniel is based at the University of New Brunswick in Canada, and Andy is at Princeton. Fichero is still very much a work in progress! The group testing it and producing inputs for Andy and Daniel consists of about six people.

What does Fichero do?

Fichero can process digitized documents: it first transcribes them and then generates automated cataloguing. This allows it to produce basic metadata so we can continue working with documents.

What has our collaborative process been like?

Andy and I began working with a group of researchers and young people working on or from the Chocó 2023, in the context of a British Library project—EAP1477—where we held a workshop on how to produce metadata and develop a catalogue. With the Semillero de Jóvenes de Muntú Bantú—a social foundation focused on the African Diaspora—we began cataloguing with the goal of producing basic metadata for all the case files we had digitized. As part of that project, beyond cataloguing, each of us selected a case and developed a historical analysis, publishing our essays in a short book [Link].

Faced with the Istmina Circuit Court Archive, which contains more than 60,000 images we had digitized through EAP1477, we realized we wouldn’t be able to finish the cataloguing. That’s when we began experimenting with artificial intelligence. With Andy, our group tried Kraken, eScriptorium, and Transkribus. However, the segmentation was inconsistent, missing text was common, and the transcriptions of different styles of handwritten scripts was so inaccurate as to unusable, no matter how much fine tuning we do.

Those tools were fragile when dealing with mixed formats. Even after fine-tuning Kraken’s models, we made no real progress. For our group, it meant months of not seeing anything useful from AI.

During that time, the young people reading and cataloguing manually.

Unlike the approaches we’d tried since 2023, Fichero now produces structured and reproducible transcriptions and catalogs of Colombian legal and historical texts. It’s still in the early stages of development, with Andy and I actively building it. But, since September 2024, results from using the open source AI model QWEN-VL, provided by Alibaba have been impressive.

So you’re sending everything to a server in China?

At the moment, Fichero uses Alibaba for transcription and Open AI for auto cataloguing. But, the tool is designed so the user can choose between using a local model or sending it to a service or model hosted on a server./

If you can choose, why send it to Qwen and ChatGPT, rather than process locally?

The results are better and much faster, because we can process multiple documents at the same time. What would take weeks on a laptop, takes days.

How exactly does Fichero work?

A: Fichero processes and transcribes large collections (100s or 10,000s) of scanned or photographed documents. It crops, splits notebooks, cleans, segments, and runs OCR (Optical Character Recognition)/HTR (Handwritten Text Recognition) using AI, then outputs the results in formats like Word automatically, and at scale.

You can customize each step or run it end-to-end with a single command. Fichero is designed for researchers and archivists who need to transform thousands of archival pages into clean, searchable text.

How do you use Fichero?

Carefully. Fichero is still in development on GitHub. If you’re comfortable using the terminal, installing Python apps, setting up environments, running LM Studio, or signing up for an API key from Alibaba, OpenAI, etc., then follow the instructions on GitHub. (https://github.com/dtubb/fichero)

Otherwise, wait. We’re working on improving Fichero, and making it easier to use as an app using Briefcase and Toga.

Q: How are you making Fichero?

I am an anthropologist, an ethnographer, a writer. I’m also a closet, long time Mac aficionado, a computer geek, and as a teenager and high school student, dabbled in programming in Pascal, C, and C++.

But, for two decades (2003 to 2022), I focused on learning Spanish, on writing, and on anthropological work. I wrote books, articles, got tenure, edited book reviews, etc. But, in 2023, while visiting Colombia on a project with Ann Farnsworth-Alvear and UPenn digital librarians Cynthia Heider (Public Digital Scholarship Librarian) and Andrew Janco (now Digital Scholarship Librarian at Haverford College). We got to talking about programming, AI, LLMs, etc. Cynthia recommended Al Sweigart’s Automate the Boring Stuff with Python.

Since 2023, I’ve been learning Python and working on how to make Fichero. Since then, what has become know as vibe coding has become a thing, although was named in February 2025. Vibe coding means code written in tandem with an AI. My contributions has been vibe coding AI assistants, currently in Cursor, and getting crucial technical advice, code, and new and best practices from Andrew Janco. I have dumb, obvious questions, and he gives wise answers. I’ve come to think of Fichero as a product of artisanal Artificial Intelligence.

Who can use Fichero?

It scratches our itch to catalogue an archive in Colombia that we are all interested in. But, we know Fichero will be useful beyond our small area. Its approaches could be useful for cataloguing any archive, but also transforming any document or collection of images into structured research data.

How can I use Fichero? How can I help?

Follow along on our [development notes] page, or stay tuned here. If you can code, and are interested in helping out. Get in touch. See below.

How can I get in touch?>

I’m happy to hear your thoughts. Email daniel@tubb.ca.