Fichero Update and Working Towards 0.1.0-dev

Where am I in Fichero?

I spent the two weeks at the end of June doing a pretty major rewrite. So what did that involve?

I had a director.py module that handled all the parallel scripts. It was a mammoth—1200 lines of spaghetti code. It was disorganized. It worked in Celery/REDIS, as a way of doing multiprocessing on separate processes. But, how to share data? I liked it because it worked, as a hack. Basically, it just duplicated folders to process and ran the tools on multiple sets of folders simultaneously. It worked well enough. The problem was, director.py was super complex and very interconnected, and hard to maintain. So, I spent a week in June refactoring it.

Now, there are various modules of a director backend that each document window can talk to. “Take this folder,” and it processes it. The director.py sends files to various modules, and keeps track of various things.

It prepares the folders, copying them into place.

Then a workflow executor loads the plan files, which are stored either in the application’s resources folder, or in the default file locations for Linux, macOS, or Windows settings. The plan is a modified YAML version of the Weasel format, but with the Weasel-specific parts removed and simplified for the workflow executor. The executor communicates with a backend for multiprocessing, which then calls the tools.

The multiprocessing backend is either Celery/REDIS, or a Python-native backend using concurrent.futures and threads.

I’ve got the Python-native backend working well, but haven’t tested Celery/REDIS recently. So, I turn it of.

For both, we have two kinds of workers:
• CPU workers, for CPU-intensive tasks (like image manipulation), and
• I/O workers, which are much less intensive and handle tasks that involve waiting—e.g., writing to disk or waiting on network access (especially inference from the big language models).

At present, the Python backend only works with the CPU workers. Which you can set.

The way it will work is: we spin up four CPU workers to handle image processing, and once that’s complete, the images gets passed to the I/O workers. Ideally, the four CPU workers can keep things moving efficiently. This works well enough. But, I need to get the I/O workers working again for the Python backend.

The tools can also run their own multithreading, e.g. for LLM calls, or when they are only running on one folder. This setup works well, for example, when sending requests to AI models—doing so using threads and concurrent.futures.

Why concurrent.futures and threads (and not async or subprocesses) because Toga uses async, and I was running into issues trying to do async communication with the Python backend. Sub processes were not happy in a macOS build.

In short, refactoring involved breaking out all the utilities of director into smaller files, it’s now much clearer how it works. And it’s easier to bug fix. But, it ends up being way more code. I suspect Cursor.ai has over engineered it. And, I’m slowly going through each to make sure it’s minimal. But, that can wait to a later revision. I’m sure there are some things that are stupid or don’t make much sense, but as time allows, I’ll refactor each module to get it tighter with less code.

Speaking of other changes, I have put together a Toga GUI, synced as much as possible with the CLI. It now builds on a Mac, and runs in Briefcase Dev on Linux and Windows. Which I haven’t gotten around to testing yet, as I don’t have a PC.

Big issues to address in the future. One is the backend data model. Currently, each step creates its own manifest.jsonl file—we generate a new one each time. I wonder about revising so each folder as a single manifest file. The manifest file gets updated and edited there. The nice thing about the current process is that it’s reliable with multiple threads, but I think we could do better. Still, troubleshooting tools in the current setup is easy. I can go into the Finder and make changes to the files directly. If it’s abstracted too much, it might be easier to code, but harder for to understand.

Fichero is for researchers—it doesn’t have to be super slick. In fact, the clearer it is to the people using it how it works, the less “magic”
it will feel. I hope the transparency makes it more useful, and it gives users the ability to use it as a tool, not a replacement for their work.

I also added more reliable error checking. It used to be that if we hit an error in one process, it would just keep going. Now it stops.

Along with a GUI, inspired by Hyperspace and BBEdit, I spent some time working on a task monitoring interface. That said, having a clean display that shows what processors are doing, what errors have occurred, etc.

Anyway, what’s left? Lots. But, I’m hoping to get a build anyone can run for 0.1.0-dev.

This requires:

  • Building the macOS version, and testing it and the CLI version: That is, making sure everything runs smoothly on macOS, both the GUI and the CLI.
  • Testing the Windows version and CLI version: Run the build on Windows. Make sure the GUI launches, folders can be selected, and things process properly. Same for the CLI—it should run clean.

  • Testing the Linux version and CLI version: Try it out on a couple of Linux distros. Make sure GTK stuff works, files save in the right place, and multiprocessing behaves. Both UI and CLI.

  • Double-check the whole UI is translated. I added some internationalization. Go through the interface and make sure all the text is showing up in the right language. Add any missing translations. I think we can add a function to figure out language from the OS at launch

  • Build: Get builds working and ready to share. Signed if possible. Test install and run on all three platforms.

  • Update the documentation: Rewrite or clean up the README, dev notes, and any how-to guides. They should reflect where the app is now—not where it was a month ago.

  • Disable unused UI stuff. I hid buttons and features that aren’t wired up yet—Plans, Prompts, Commands like Save, Open, etc. No need to confuse people with stuff that doesn’t work yet. I need to double check that.

Create a workflow that goes straight to transcription: Not everyone wants their images changed. If I can pull it off, I’d like users to be able to edit workflows directly in the app. Doesn’t have to be fancy—just a way to open, tweak, and save a plan.

Writer’s Diary #58: Getting Back to It

I’ve not done much book writing in the last month. To many thesis to read, proposals to edit, self-evaluations to write, and emails to get send of. That, and I have been hard at working getting Fichero working with some reliability. I had to rewrite the multiprocessing back end in pure Python, with the help of Cursor. I’m not totally convinced it’s as efficient as the other option which is Celery and RIDIS. But, the goal is a Mac, Windows, and Linux app, without dependencies, and which case having RIDIS as an external dependency isn’t ideal. But, in any case. We have a Mac App, using Beeware’s totally brilliant Briefcase package management system and Toga to make a Mac/Windows/Linux app. I’m quite proud of it. Not perfect, lots of bugs, but its runs, and does its magic of taking old documents and cataloguing them.

Today, I want to think about getting back to writing. I have lots of projects, lots of text, and no finished book. I’m going to try to be more regular again, and one way to do that is to have achievable goals. The goals are flexible though. What am I working on? I am coding, organizing, revising old text, writing new text, and polishing. I propose therefore:

A win is:

Daydream: 50 minutes
Code: 25,000 words
Organize: 5,000 words
Revise: 1,000 words
Write: 300 words
Polish: 3,000 words

Totally arbitrary, I know. But, sometimes achievable goals are important.

Today, a win is coding 25,000 words. (By coding, I mean tagging text with themes and topics using Structur

Update: 11:43 am. Today I coded 25,000 words!

fichero_director.py: Running Fichero on Multiple Processors

What is Fichero? It is a Python-based pipeline for large-scale document processing—think thousands of scanned pages, each requiring cropping, enhancement, OCR, and transcription.

I’ve had versions of Fichero working for six month. But, it was slow. Fichero processed one step at a time, then the next. It was slow. The challenge, taking advantage of computing power, multiple processors, and not getting bogged down.

This week, I got fichero_director.py working, which uses Celery to run Fichero “workers” on multiple processes, each doing different steps at the same time.

Imagine one cook making 100 pizzas, one after the others. Versus 8 cooks making 8 pizzas at a time.

How does it work? Fichero_director.py breaks down workflows into CPU-intensive and I/O-intensive steps. For example, image tasks are CPU-heavy, while transcription using language models or converting to Word documents is mostly I/O-bound by disk or LLM inference. Tasks are sent to Celery queues based on script type.

Fichero_director tries to tune for the host system. On M1 Macs, for example, the hope is CPU workers use the performance cores, while more numerous I/O workers handle slower operations on the efficiency courses.

On my machine, I have 8 cores, and fichero_directory.py uses them all.

To follow along, there is a simple dashboard that tracks real-time progress, showing each folder’s status and current step. Each folder is processed independently, with logs written per folder and per step.

Morning Pages

It’s been a long time since I written what Julia Cameron calls morning pages. Free writing, first thing. But with a summer of writing stretching out before me, I’m going to get back to the habit of writing a little, first thing. Writing any long piece is an exercise in sustained returning to the words. But, I am busy with academic chores, and the morning can be a moment to get back to the work of putting the words together, which I want to complete this summer. A moment to think on the page, with a pencil, as a way to think on paper through the coming day, not just a day of writing, but also a day of writing.

What to work on today?

A chapter that needs an edited. Rather than doing a continuation from where I left of yesterday, which was revising, I’m going to go back to where I was a few weeks, to work on rewriting the whole thing into shape. Then I can polish. Forward momentum is more important than polish, at least in the early stage.

Posted in Uncategorized

pdf_splitr.py

On problem with scanning books, for academic purposes, is one often ends up with two pages side by side, in a single PDF. This makes it hard to OCR, read, annotate, or process.

pdf_splitr is a (very) simple Python tool, which I ‘wrote’ with cursor.ai and Claude. It splits each page into left and right halves, while preserving annotations and handling different page sizes. It uses the Media Box, so as not to change the resulting file size.

It runs from the command line or as a drag-and-drop macOS app using Automator, making it easy to turn scans of two pages into 1 page PDFs.

pdf_splitr.py

2 pages

1 Page

wordwright.py

As a graduate student, my supervisor sat me down one day and confessed to me that he used WordPerfect’s spell-checking window because it helped him find passive voice. He, like I, overused it.

From that moment, over the years, I’ve found many ways to automate the flagging of passive voice in my writing. I’ve written scripts to find it in Tinderbox and BBedit. I’ve written scripts to find words I don’t want or that are redundant. But, with those scripts, it means I have to read and remove the words by hand. Sometimes, this forces me to think to find a better way of saying something. Sometimes, they can deleted without much care.

WordWright.py is a collection of python scripts that automates these editing steps, the ones I use on a regular basis. It’s a variation on what I described a few years ago in Writer’s Diary #09: On Freewriting a First Draft.

Simply. I free-wrote > used ChatGPT to fix typos > used DeepL to make minor changes > used ProWritingAid to remove adverbs and redundant expressions and make minor stylistic fixes.

WordWright automates this process, except for the ProWritingAid step. With a keyboard shortcut, I can write a paragraph, then use wordwright to grammar check the text and remove stylistic bugaboos.

It’s not so different from what John McPhee describes in his New Yorker article on Structure, or in the book Draft No. 4. He uses tools to find duplicate expressions.

Writing is not one step; it is many steps. Hundreds. Wordwright helps make a few of those steps easier, but it won’t help you with figure out what you think, make your ideas your own, reworking them, or make them sing. That takes time, at a desk, doing the work.

WordWright doesn’t fundamentally change this process, I don’t find.

But, it is a little easier to get into a state of flow, because I don’t have to stop and go to different apps to fix typos, grammar, adverbs, or overused expressions.

In fact, it’s the case that prolific writers have wonderful editors—sometimes it’s a spouse, an assistant, and publisher.

What WordWright gives offers is not so much a first reader, but a first editor. Never the last, mind.

Using AI this way is, in my mind, not so different from a spell checker or WordPerfect’s passive voice checker. Just more powerful.

WordWright GitHub page.

Writer’s Diary #57: A Room, A Purse, and No Phone?

I seem to have spent a few days reading and reacting to Virginia Woolf’s A Room of One’s Own. It’s my third reading now. First thing. Four fifty-minute chunks. As I re-read it and my notes, I respond to it, and my notes, and I (re)write.

What to make of her description of a rather unsuccessful morning in the British Library, where all she finds is men writing about women? Obsessed it seems. She’s angry.

But then she goes for lunch.

A nice lunch. She has coffee too. And find a newspaper. To pay, she reaches into her purse. Five shillings and ninepence. Her purse produces 10 shilling notes. Her Aunt died, giving her 500 pounds a year. In perpetuity. (It was only 2500 pounds, in total, in reality.)

That’s 39.10 CAD for a lunch! A good lunch. Expensive, I imagine? How can she afford it?

Her aunt’s money. She doesn’t have to work.

It hits me in a flash.

When I was a student, I had no money, but I did have some. Not like my students today. I had scholarships and loans and easy student jobs.

First year, I lived with friends.

Later, I parlayed them into cheap rent and cheap food through global arbitrage. (I was a librarian, making minimum wage, in Ontario, working in Ecuador.)

I lived frugally, but rent and wine and food in Spain 2002 and Ecuador in 2005 were far cheaper than in Ontario.

Rent in Colombia as a graduate student doing ethnographic fieldwork, when I spent my days walking, thinking, eating good food, and reading, was something else. Less still, when I went the gold mines.

That fieldwork was funded by the Canadian taxpayer and Carleton University. To great expense. It was decent money. And, my accidental exercise in global arbitrage, made my purchasing power much higher.

Not deliberately, but accidentally. I moved where my money went further, and the scholarship and grants allowed me to spend two years wandering around Bogotá and the Chocó and letting my mind wander and writing.

(When the purse ended, I experienced the opposite. Moving to Yale, with a shrinking scholarship as the Canadian dollar collapsed.) I walked, and wrote, but was far more stressed about money. Anxious. I spent those three years trying to find a job. Which I did. Then, a decade worrying about money as one salary only goes so far.

But then, and maybe now again as Mercedes works, I have the equivalent for lunches.

(I used to go to the library, then walk and spend 40,000 pesos on a lunch in Bogotá now. That is one day’s minimum wage.)

But, for the last decade, my phone would have been in the way. Making the mind wandering impossible.

But, not as as student. As a student, I had no phone. So, many times, I did what Woolf describes—the thinking and daydreaming and writing and letting the mind wander and making connections. It’s this that Woolf’s famous essay is an exercise in. (It does it by showing, not telling.)

It’s fiction, to be sure. But there is an element of autobiography. What about calling it a fictionalized auto-ethnographic account of writing? The day dreaming by the river, the flash of insight lost by walking on turf, the lunch, and a walking into the evening thinking about the gold that went into the college, and a walk to the library and then the next day at the British Museum and then lunch paid for by her Aunt’s inheritance.

I think it is.

But, of course, she had no iPhone, computer, or Internet. Has all of this connection robbed us of our ability to let our mind wander and make connections?

Yes.

But, need it?

No.

Zadie Smith, an other famous English novelist, essayist, and short-story writer, doesn’t have a phone.

These two facts might be connected.

Writer’s Diary #56: A room of One’s Own

In the draft, I had an aside about the importance of a room of one’s own with a lock. My thought? A room of one’s own, with a lock, and without a computer, phone, or interruptions. But, yesterday morning, and again this morning, I re-read Virginia Woolf’s famous essay. Not only is it a feminist critique of the materiality of artistic creation and the ways in which women have been excluded for centuries. But in Woolf’s words, and in the story she weaves, you can also begin to see the glimmers of a method to fiction.

The walking and daydreaming, the trespassing on lawns, the lunches, the attempts to go to libraries, the walks before dinner and the remembering of snippets of ideas, of poems. The walks and strolls across in Oxbridge.

But it’s also the next day, and other days, of being in the room and taking books down and putting them back on the shelves, of going to the library and reading with a notebook, and of misremembered lines and lost quotations, and the concentration that goes into the work.

Even as I was doing that, I had my daughter behind me, asleep on the couch at 4 o’clock in the morning because she couldn’t sleep. She woke up early. I cuddled her. She fell back into bed.

And as I was writing, there were four or five messages. Running partners. Dentist appointments. Concentration.

But there’s also a bricolage in there.

Pulling down books. Looking at shelves. Going to the library for ideas.

Pulling down books. Looking at shelves. Going to the library to get ideas.

I read in the introduction to the 2000 Penguin edition that on the day she gave the lecture she wrote:

“My ambition is, from this very moment—eight minutes to six, on Saturday evening—to attain complete concentration again.”

Total concentration! It takes a room and money (CAD$70,000 in Canadian money, I’m guessing), and I know it helps to be white.

Total concentration! It takes a room and money (CAD$70,000 in Canadian money, I’m guessing), and I know it helps to be white and a man.

But a walk, and lunch, and time, and concentration are best achieved without the technology in my pocket. Which is its own difficulty.

Writer’s Diary #55: Just do it

Today’s update:

I met a drummer friend yesterday, along with another artist—a goldsmith. My drummer friend has been trying to work steadily every day. Four hours. He writes it down. He inspired me to try again. If he can drum for four hours a day, maybe I can find time to write? Just do it. “Do it” He said

He works at night; I am a morning person. So, I did my four hours in four 50 minute chunks this morning. It’s nice to be done by 9:30.

The other friend, the goldsmith, said, “I only let myself start on something new once I’ve finished something.”

That is my challenge—always starting something new. Perhaps I can finish something, before starting on another big project..

Anyway, a nice morning on the bricolage chapter. Re-read the whole draft. Lots to do.

I ended on a side tangent re-reading Woolf’s A Room of One’s Own. She’s talking about women and fiction; I think I’m talking about distraction and anthropology.

But anyway, I have to finish reading it again tomorrow.

For now, I’m done and of to a maple syrup sugar shack with the kids.

In Praise of Makeshift Finishing

Tubb, Daniel. “In Praise of Makeshift Finishing.” Anthropologica 66, no. 2 (2025): 1–8. DOI / Mirror

This article reflects on the challenges of writing and finishing. Using experience of sorting ethnographic field notes, I explores how the desire for a perfect structure and method hinders progress. It is an argument for imperfection in the writing. An argument for finishing, even imperfectly, as essential to transforming ideas into tangible work. It advocates an iterative, hands-on approach to writing.