Building a (jank) hOCR Editor in Rust
Motivation
A month or so ago, I collected game manuals for the console games in my backlog and noticed that OCR output on them was…iffy, to say the least. Surely I could edit and fix this output, right? I learned several things while trying this on PDFs:
- OCR software usually writes an invisible text layer to a PDF over the original image
- Using Adobe Acrobat (gasp!) to edit this text layer is not very helpful – when I was OCR’ing within the app, I could only correct poorly scanned words, but I couldn’t add a whole new region the scanner had missed or add in extra words, and although I could see the text layer added by other apps, trying to edit it had me opening things in…Inkscape? Maybe I missed something, but it seemed like a dead end.
- Online tips included using LaTeX + the output of Google Cloud Vision (see https://tex.stackexchange.com/questions/562338/manually-add-text-layer-ocr-over-a-scanned-image) and creating a new PDF, manually adding the text where it appeared, and importing it as a layer on the original image.
Understandably I did not want to do that. But the fact that Google Cloud Vision outputted a “box” giving coordinates of each area intrigued me, and after more poking around, I found an open-source tool called tesseract that could do something similar. Instead of a text layer directly in a PDF, you can ask tesseract to output a separate file (like the GCV JSON) in the hOCR format. This is an HTML file that tells you all the areas the scanner finds + recognized text, and it includes all the OCR information in the title attribute. This seemed perfect – I could directly correct the scan output via the file, then use something like hocr-tools to combine them back into a searchable PDF. So now my workflow would look something like
- Use
magick -density 300 Dark\ Half\ \(Japan\).pdf -quality 100 dark_half.pngto convert my pdf into a bunch of pngs - Use
tesseract /Users/serinahu/Downloads/test_dark_half/dark_half-1.png - -l jpn hocr > dark_half-1.htmlto OCR each image and output an hOCR file - Edit the hOCR file somehow
- Use
hocr-pdf —savefile dark_half.pdf test_dark_half/to get back a correctly scanned PDF.
I just needed a convenient way to do step 3! I found a repo on Github for a TypeScript-based editor at https://github.com/GeReV/hocr-editor-ts but it hadn’t been updated in 2 years and still described itself as a work in progress. Since I was interested in writing more Rust because of my paradedb work, I thought it would be a fun project to roll my own editor in Rust.
Picking a GUI
Actually, deciding to use Rust was kind of a bad idea because the GUI ecosystem isn’t super developed. The main choices seemed like egui, iced, and Tauri (though as I’m writing this I see there are also gtk4 bindings), and I had never written a GUI app before, so I went with the one that advertised itself as the easiest, which was egui. The demo at egui.rs was pretty cool and had what I wanted: I could render images, create buttons, have right-click menus, drag shapes around, collapse headers, etc.
Parsing the hOCR file and rendering it
As a proof of concept, I wanted to just have an interactive view of the hOCR file via a bunch of collapsible headers within headers that represented the hierarchy of the different OCR elements. This meant figuring out how to parse the file and print it out.
Rust has several crates for parsing HTML, but the most “maintained” seemed to be scraper. Following the egui example for opening a file dialog and the scraper example for parsing HTML from a file, I parsed my file and wrote a recursive function that would render any element with a given OCR class into a collapsing header containing the children of that element. At this stage I made an OCRClass struct for future use, though right now it had no function except converting the hOCR class (e.g. ocr_carea) to a label for my tree (Area). Yay! Egui lived up to its promise of a quick start – to create the panel for the tree, I called egui::SidePanel::right(“name”).show(ctx, |ui| { egui::ScrollArea::vertical().show(ui, |ui| { // collapsible headers }) }).
Along the way, I wanted to turn my various CSS selectors into global variables. Rust told me that it didn’t really like statics, and I wasn’t allowed to make those selectors into global variables directly because they were function outputs. I used the lazy_static crate to get those working.
The next step was rendering the image associated to the hOCR file. That part was actually pretty easy – get the file path from the title attribute of the ocr_page element, then add the image via ui.image(path). I also prevented my code from re-parsing the file every frame by setting a “file path changed” flag. (This was probably the first clue that I wanted something reactive rather than immediate.)
Since I wasn’t making anything mutable just yet, I didn’t have to worry about the borrow checker yelling at me yet. Also, getting previews for the text of each OCR element was easy at this stage because scraper’s Elements had a text() function.
At some point, I also created a left panel showing the properties for the selected element.
Mutable state!
With something visual on my screen, I started thinking about the mutable parts of the app. For one, the user should be able to select an OCR element and edit it, so step 1 was selecting something from the tree and recording that in the struct representing app state. Conveniently, egui already has a function for that: selectable_value creates a widget that, when you select it, sets a mutable reference to a preset value. If the reference’s value is the value of the widget, then the widget gets highlighted.
In the first iteration of this, I used each element’s HTML id to select it and displayed the selected ID in a different part of the screen. However, Rust’s borrow checker started yelling at me. My global state looked like
struct HOCREditor {
… // Other fields
selected_id: String
}
And when I tried to create each selectable_value widget, I needed to both borrow the global state for rendering the tree + borrow it mutably specifically for selected_id. If I wanted to mutably borrow something, I wasn’t allowed to borrow it in any other way. What I needed was more fine-grained control – I wanted to say that I wouldn’t mutate any part of the global state besides the ID. I found that I could use std::cell::RefCell to wrap selected_id. Then I could immutably borrow my global state while using RefCell::borrow_mut() to mutate selected_id. The main drawback of this approach was that the borrow checker wouldn’t be able to detect simultaneous borrows with a mutable borrow anymore. My code would crash instead if I violated the borrowing constraints! Oops. But so far, it was easy to understand who was borrowing what.
Bounding Boxes
The next step was visually representing the bounding box of the selected OCR element on the displayed image. I needed to transform something of the form bbox 0 0 120 140 into a rectangle; on my first pass, I made a custom BBox type and read it in there, then wrote a function to paint a bbox. I was worried I’d have to do some funny transformations with the coordinates, but it was pretty simple – the response returned by my call to ui.add(egui::Image(…)) included where the top left corner of the image was relative to the visible window, and I just had to place the rectangle relative to that.
I also wanted to draw the bounding boxes of siblings of the selected element. Scraper has tools for working with the HTML tree, and I could get the siblings using ElementRef::prev_siblings() and ElementRef::next_siblings(). Painting the sibling bboxes a different color was also a breeze.
The next step was making the boxes into “selectable boxes” – it’s like a selectable value widget, but for the painted rectangle rather than the text. I mimicked how egui designed selectable value: first I made a SelectableRect struct that implemented egui::Widget by copying SelectableLabel, then a selectable_rect function that would make a SelectableRect and change the specified value when the rectangle was clicked. I also made the rectangle’s color change on hover.
Now I had a somewhat-interactive way to view the OCR tree! I could click around various boxes and see them all over my image.
Trees in Rust
Now that I could view the file, I wanted to be able to edit it. Scraper didn’t support most editing operations – I could add siblings and doctype nodes and comments and stuff, but no deletion, no editing of HTML element text, etc. I decided it would be a good Rust exercise to translate the relevant parts of the HTML tree into an internal tree that I could then edit at will.
Unsurprisingly, Rust’s borrow checker makes it difficult to implement recursive data structures, because that would mean multiple mutable borrows! Instead, I separated my tree into two parts: one was the structure of the tree, consisting of a graph of nodes, and one was a hash map with the content of each node, which mapped node IDs to actual data. A node consisted of an ID, the ID of its parent, and the IDs of its children. If I wanted to edit a node, I would actually edit the hash map. If I wanted to change the structure of the tree, e.g. inserting a node, I would edit the tree, then insert a new entry into the hash map, etc. Apparently this is called an arena allocated tree.
With my basic tree set up, I needed to translate the scraper Html tree into my tree. I created an OCRElement struct and a function to parse a specific element into my custom struct. At this stage I ignored error handling and just assumed everything was well-formed. Then I called this repeatedly on the top-level ocr_page elements and their children using ElementRef::children() to navigate the scraper tree. I decided that since I wouldn’t be storing the id of each OCR element, though: if I inserted new elements, I would be changing around IDs, and that didn’t seem too appealing. I would instead create those ids on the fly when I wrote my custom tree back to an HTML file.
With my internal tree created, I had to convert my GUI code to use that instead of the HTML tree. Thankfully not much changed: I was still recursively traversing it, I’d just have slightly different ways of accessing the data of each element.
Saving to a file
Now that I was interacting with my internal tree, I’d eventually have to save it to a file after editing. I first wrote the code to translate the internal tree into part of the body of an HTML document. That part was easy: I had already stored most of the data I needed, and I could generate IDs for HTML elements as I went with a counter. But what about the head of the HTML document? Trying to grab that wasn’t well-documented. I ended up doing the following:
I created a new HTML document that would store all the non-body parts of the original hOCR file. The HTML tree started with a document node (Html::get_document()); this would have maybe doctypes, processing instructions (like which version of XML), comments, and a single HTML node (which you can get via html_tree.root_element()) as children. I could get the head of the document by using the CSS selector “head” on the HTML node, copy over its attributes, and add it as a child to the HTML node of my own document. Doing this recursively, I could copy over everything in <head>…</head>.
For the doctype, etc., I used a match statement to copy over the relevant parts and pass them into scraper’s functions for adding a doctype, processing instruction, or comment to a document. I did run into a problem here – the public and system IDs of the doctype node were passed in, but they never got written to the final file when I wrote my custom HTML document to a file. I still haven’t figured that out, but for the most part, I had a functioning and mostly idempotent way of opening and saving hOCR files.
Editing the tree
Implementing basic editing function for the structure of the tree was not too bad – I mapped a keypress (for deletion) and clicking buttons on a right-click context menu (for insertions). I did have to decide what should be inserted when a user requests a new element; for convenience, I settled with cloning the data from the parent (if making a child) or the sibling (if making a sibling) and having the user manually set it to what they wanted afterwards. I still haven’t figured out what the best flow for that.
I created an edit “mode” that showed only the selected bbox and allowed users to manipulate it. That was the hardest part. Egui.rs had a convenient demo of a widget where you could drag the nodes of a bezier curve, so I modeled my code after that. For each corner and side of my rectangle, I created a rectangle around it that would serve as the “draggable area”, assigned that rectangle an ID using the response, and used ui.interact(rect, hovered, id, sense) to indicate I wanted this rectangle to be draggable. This returned a Response and calling drag_delta() on it told me how far the user had dragged that corner or side between frames, allowing me to edit the coordinates of the bounding box.
In edit mode, I had the left panel display all the properties as editable – numbers turned into slider buttons and text into a text area. This gave me the basic tools I needed for my editor! I could change the text of an incorrectly scanned word, add new areas, delete extraneous areas, and change bounding boxes, then save to a file!
Better error handling
Now that I had a prototype, I had to replace all those unwrap()s with actual error handling. First, I learned (from ChatGPT) that Results were better when things could fail and the user should know why they failed – using an Option obscures why something went wrong. I also learned (from reading the documentation) that instead of all those if let Some(x) = y { … } clauses, I could use map() or and_then() to do operations only on Ok or Some values.
I changed my OCR properties parser to return a Result – failure occurred when an element had no bounding box – and handled Errs when parsing specific properties. My HTML element to OCR element function also returned a Result now; it returned an error if its class was incorrect or if its properties failed to parse. When I converted the whole tree, I skipped any elements that failed to convert. This also meant skipping their children.
Smoother UI
Now that my code wasn’t going to crash when it hit some malformed file, I focused on making it easier to use.
First, I implemented navigation of the tree via arrow keys: up/down for parent/child, and left/right for siblings. Consuming input in egui involved calling ui.input_mut(|i| I.consume_key(…)) and just doing the function, which was very convenient.
I haven't done anything else yet (more updates forthcoming!)...
Fixing hocr-tools
With a jank but functioning hOCR editor, I was ready to test out my workflow by using hocr-tools to combine the hOCR with my picture into a PDF. Obviously nothing in life is easy, so…
- Hocr-tools refused to work when installed from pip because of incompatible Python versions or something. I installed from source and changed to
/usr/bin/python3and at least it ran. - Hocr-tools only accepts .hocr and .jpg files. I had .html and .png. I changed my file extensions, but I should probably edit it so all file extensions are OK.
- Scraper saved the html with meta tags like
<meta>which is OK for browsers and stuff but hocr-tools used etree, which yelled at me. So I manually changed my tags to<meta/>, but I also need to fix that (I’m not even sure if I can just with scraper). - Python library documentation is much more scattered than Rust documentation. Boo. Trying to figure out bidi and reportlab was really annoying.
- Fixing those issues, I managed to actually create the PDF, but copy-pasting revealed the text had…random glyphs. Oh no! The issue was that hocr-tools registers an “invisible font” (which I think is just a very transparent English font) that doesn’t support Japanese characters. I found out how to change the font to a Japanese one in the reportlab docs:
pdfmetrics.registerFont(UnicodeCIDFont('HeiseiMin-W3')) - Then in addition to the
setTextRenderModecall, I tried setting the fill/stroke color to a transparent color while writing the text layer.
With all that done, I created my PDF…and it worked! The main issue is that when I tried to put it into OmegaT, it inserted spaces between lines and some of the words. The spaces between words didn’t appear when copy-pasting, but the spaces between lines are unavoidable, I think. Yay! (Though the spaces between the words do make OmegaT unhappy because it treats A B differently than AB, understandably.)