A Stack of Paper – And the Need for Easy Access

A Stack of Paper – And the Need for Easy Access

I recently took on a self-imposed project: digitizing and securely publishing large documents or public contracts. This post is partly for my own reference – a way to remember the steps and tools involved – but I'm hoping it might also spark some curiosity for others who are also interested in doing something similar.

Digitization

Scanning with Epson

The first step was simple: scanning. I used my Epson scanner to meticulously scan each page of the contract, saving them as PDF files. A good scanner is really important for this step. A clean, high-resolution scan ensures readability and minimizes errors later in the process.

Converting to Markdown

Using Marker for OCR

Next, I needed to convert those PDFs into a format I could edit. Enter Marker. Marker is an incredibly useful tool that leverages OCR (Optical Character Recognition) to extract text from PDFs and convert them into Markdown. It’s not perfect – OCR always has its quirks – but it provided a fantastic starting point, significantly reducing the manual effort required.

Structuring and Cleaning the Markdown

Initial OCR Refinement with Ollama and Gemma

Before using Obsidian, I leveraged the power of Ollama with the Gemma 3:12b model to perform an initial cleanup of the Marker OCR output. This significantly improved the raw Markdown’s readability and accuracy, reducing the workload in the subsequent editing phase.

What are Ollama and Gemma?

  • Ollama: Ollama is a framework that makes it incredibly easy to run large language models (LLMs) locally on your computer. Developed by LAION (Large-scale Artificial Intelligence Open Network), it simplifies the process of downloading, running, and managing these powerful AI models, allowing you to experiment with them without needing specialized hardware or cloud services.
  • Gemma: Gemma is a family of open-source, state-of-the-art language models created by the Gemma team at Google. Built upon the lessons learned from developing Gemini, Google’s largest and most capable model, Gemma models are designed to be efficient, accessible, and adaptable for a wide range of applications, including text generation, code completion, and more. The 3:12b variant I used is a smaller, more manageable model suitable for local execution.

Working with Obsidian

Obsidian is a powerful knowledge base and note-taking app that handles Markdown beautifully. I used it to:

  • Clean up the Markdown: Correct OCR errors, remove unwanted characters, and ensure consistent formatting.
  • Add Semantic Headings: Structure the contract logically with clear headings (H1, H2, H3, etc.) for easy navigation.
  • Create Tables: Many sections contained tabular data. I reconstructed these tables within Markdown for clarity.
  • Add Internal Links: Creating links between sections of the contract for quick reference.

Obsidian's live preview and ability to visually see the structure made the editing process much more efficient.

Building the Static Website with Jekyll

Before I delve into the editing, let’s talk about why I chose to build a static website. A static website, unlike a dynamic one, doesn’t rely on a database or server-side processing. This offers a ton of advantages, especially when dealing with important documents:

  • Enhanced Security: Less server-side code means fewer potential vulnerabilities to exploit.
  • Increased Reliability: Static files are incredibly easy to serve, making the website extremely resistant to downtime. You're less reliant on a complex backend.
  • Faster Loading Times: Static files load quickly, providing a better user experience.
  • Simplified Hosting: Easier and often cheaper to host.

With the Markdown files cleaned and structured, it was time to build the website. I chose Jekyll, a static site generator built with Ruby. Jekyll takes the Markdown files and transforms them into static HTML, CSS, and JavaScript files – a fast, secure, and easily deployable website. I used a Jekyll theme called Just the Docs to display the content clearly and ensure a consistent look and feel.

Hosting and Security – A Homelab Stack

My homelab setup allows for a surprisingly robust hosting solution. Here's the stack:

  • Hardware: A Mac Studio – plenty of power for running the build process and serving the website.
  • Containerization: Docker allows me to package the Jekyll build environment and the website files into a container, ensuring consistency and easy deployment.
  • Secure Exposure with Cloudflare Tunnel: I wanted to make the website accessible online without exposing my home network to potential security risks. Cloudflare Tunnel was the perfect solution. It creates an outbound-only connection from my Mac Studio to Cloudflare’s network, allowing users to access the website securely without opening any ports on my router.
  • Authentication with Cloudflare Zero Trust: To restrict access to the contract, I leveraged Cloudflare's Zero Trust Secure Sign-On (SSO). This allows for secure authentication before users can view the contract, further protecting its confidentiality.

Lessons Learned and Future Plans
This project was a great learning experience. I was impressed by the power of open-source tools and the flexibility of a homelab setup. Future improvements might include:

  • Improved OCR: While Marker did a good job, better OCR techniques could further improve the initial Markdown extraction.
  • Version Control: Implementing a version control system (like Git) would allow for easy tracking of changes and collaboration.
  • Scanning Optimization: Instead of scanning directly to PDF, a more effective approach for future projects would be to scan to a 300 DPI image file (TIFF or PNG) and then leverage macOS’s built-in PDF export functionality. This method often results in a higher-quality PDF, which improves the accuracy and efficiency of the Marker OCR process.
  • Exploring Alternative OCR Tools: While Marker served well, sometimes using macOS’s built-in image-to-text selection worked better.