ranton.org - From Attic to Archive - A Guide to OCR Correction with Generative AI

Introduction

Many years ago—back in 1999—my late uncle gave me a printed autobiography written by August Anton (1830-1911), my great-great-grandfather. It was a fascinating 30-page document detailing his childhood in Germany, his involvement in the 1848 revolution, and his journey to America. I read it and shared it with my kids, treating it as a treasured link to a side of the family I barely knew.

But as the years passed, I started worrying about preservation. The document appeared to be a photocopy of a photocopy. The paper was aging, the ink was fading in places, and the whole thing had that characteristic look of documents that have been through too many copy machines. As far as I knew, only a handful of copies existed. I wanted to digitize it, but I kept procrastinating.

When I reached the point where I was experimenting with natural language processing and multimodal models, I decided to use this manuscript as a practical test case. I didn’t have a professional scanner, so I used my iPhone—a “worst-case scenario” input that would force me to write better code.

Example page from the August Anton autobiography

An example page showing the typical challenges: aged paper, faded ink, and photocopying artifacts that the OCR system needed to handle.

What started as a personal project to preserve family history turned into a deep dive into production-ready OCR systems. I ended up building a complete pipeline that combines traditional computer vision techniques with modern AI correction.

In this article, I’ll walk you through what I built and what I learned:

Intelligent Preprocessing - How to optimize aged document images for OCR accuracy
Region-Based Extraction - A technique that maintains document structure and reading order
AI-Powered Correction - Using GPT-5 to fix OCR errors while preserving original meaning
Interactive Viewer - A Streamlit app for validating results and catching errors
Performance Benchmarking - Measuring accuracy and understanding trade-offs

The system handles batch processing, exports to multiple formats (TXT, Markdown, PDF), and achieves 91.8% accuracy on typed historical documents. Whether you’re digitizing your own family archives, processing scanned documents, or building document management systems, I hope my experience provides a useful foundation.

Measured Performance

Introducing OCR Accuracy Metrics

Before diving into the technical details, I need to explain how I measured whether the OCR was actually working. You can’t improve what you don’t measure, so I used two standard metrics:

Character Error Rate (CER) measures accuracy at the character level. It’s calculated as the minimum number of character-level edits (substitutions, deletions, and insertions) needed to transform the OCR output into the ground truth, divided by the total number of characters in the reference text:

CER = (substitutions + deletions + insertions) / total characters in reference

CER is ideal for measuring OCR quality because it captures all types of errors—misspelled words, missing characters, extra characters, and punctuation mistakes.

Word Error Rate (WER) measures accuracy at the word level:

WER = (word substitutions + word deletions + word insertions) / total words in reference

WER is useful for understanding readability and practical usability.

Both metrics use the Levenshtein distance algorithm to find the minimum edit distance between the OCR output and ground truth text.

Here’s what I measured on 5 pages of the August Anton documents:

Approach	Character Error Rate (CER)	Processing Time	API Cost
Pytesseract alone	0.082 (91.8% accuracy)	3.28s/page	$0
Pytesseract + GPT-5 (improved prompt)	0.079*	259.67s/page	~$0.01/page
No preprocessing	Higher error rate	Similar	$0

*The improved prompt was critical. My first attempt at GPT-5 correction actually made things worse (CER >1.0) because the prompt was too vague and the model over-edited the text. I’ll explain the prompt design later.

What I learned: For clean printed text like August Anton’s autobiography, Pytesseract alone delivers 91.8% accuracy, better than I expected. Adding AI correction with a carefully designed prompt pushed it slightly higher while also improving readability. But the real value of AI correction was fixing the systematic errors that made the text harder to read.

Part I: Traditional OCR for Document Digitization

When I started this project, I assumed the hard part would be the OCR itself. I was wrong. The hard part was preparing the images so the OCR could succeed. Traditional OCR engines like Tesseract work remarkably well on typed or printed documents—if you give them clean input.

Prerequisites

System Requirements

Python 3.11+ and familiarity with OpenCV/Pillow

OpenAI API access for GPT-5 correction (optional but recommended)

System dependencies: Tesseract OCR, Poppler (for PDF handling)

Basic computer vision knowledge - understanding of image processing helps

Understanding the Input: The Challenge of Historical Documents

The August Anton autobiography presented several challenges that are typical of historical documents:

Aged paper with yellowing and texture that confused color-based algorithms
Faded or inconsistent ink from multiple generations of photocopying
Artifacts from scanner noise and iPhone camera limitations
Occasional multi-column layouts that needed proper reading order
Varying font sizes between titles and body text

I needed preprocessing that could handle all of this without losing the text itself. The solution I settled on addresses these challenges systematically.

Image Preprocessing: The Foundation of Accuracy

The quality of OCR output depends on preprocessing. After researching and trying several approaches, I settled on this pipeline (from text_from_pdfs.py):

def preprocess_image(img):
    """
    Preprocess image for better OCR results

    Steps:
    1. Convert to grayscale
    2. Apply median blur to reduce noise
    3. Use Otsu's thresholding for binarization
    """
    # Convert to grayscale
    gray = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)

    # Apply median blur to remove noise while preserving edges
    # Kernel size of 5 works well for most scanned documents
    blurred = cv2.medianBlur(gray, 5)

    # Otsu's thresholding automatically determines the optimal threshold
    # THRESH_BINARY_INV inverts colors to create white text on black background
    # (Tesseract works better with light text on dark backgrounds)
    _, thresh = cv2.threshold(
        blurred,
        0,
        255,
        cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU
    )

    return thresh

Why I chose these specific techniques

Grayscale conversion - Converting to grayscale eliminates the color variation while preserving the text contrast that matters for OCR.
Median blur - Unlike Gaussian blur, median blur preserves edges while removing the salt-and-pepper noise from photocopying. The kernel size of 5 (creating a 5×5 neighborhood) was large enough to remove noise but small enough to preserve detail.
Otsu’s thresholding - Rather than manually tuning a threshold value for each image, Otsu’s method automatically finds the optimal threshold by analyzing the image histogram. The THRESH_BINARY_INV flag inverts colors because Tesseract works better with light text on dark backgrounds.

Region-Based Text Extraction: Maintaining Document Structure

My first attempt used whole-page OCR, which worked poorly. Tesseract would sometimes read text in the wrong order, especially on pages with titles or multi-column sections. The extracted text would jump around randomly, making it unreadable.

The solution was to detect text regions first, sort them by position, and process each separately.

def extract_text(img):
    """
    Extract text using region-based approach

    This method:
    1. Identifies text regions using morphological operations
    2. Sorts regions by Y-coordinate (top to bottom)
    3. Detects paragraph breaks based on vertical gaps
    """
    # Create rectangular kernel for dilation
    # Size (50, 40) connects nearby text into regions
    rect_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (50, 40))

    # Dilation connects nearby characters into text blocks
    dilation = cv2.dilate(img, rect_kernel, iterations=1)

    # Find contours (text regions)
    contours, _ = cv2.findContours(
        dilation,
        cv2.RETR_EXTERNAL,
        cv2.CHAIN_APPROX_NONE
    )

    # Extract text from each region
    cnt_list = []
    for cnt in contours:
        x, y, w, h = cv2.boundingRect(cnt)
        cropped = img[y:y + h, x:x + w]

        text = pytesseract.image_to_string(cropped)
        text = text.strip()

        if text:
            cnt_list.append((x, y, text))

    # Sort by y then x position to keep text in correct order.
    sorted_list = sorted(cnt_list, key=lambda c: (c[1], c[0]))

    # Build final text with paragraph detection
    all_text = []
    last_y = 0

    for x, y, txt in sorted_list:
        gap = y - last_y
        if gap > 30:  # Large gap indicates new paragraph
            all_text.append("\n\n")
        elif gap > 1:
            all_text.append("\n")
        else:
            all_text.append(" ")

        all_text.append(txt)
        last_y = y

    return ''.join(all_text)

What I learned about region detection:

Morphological dilation connects nearby characters into coherent regions.
Y-coordinate sorting preserves reading order.
Paragraph detection based on vertical gaps maintains paragraph structure surprisingly well.

Note on Complex Layouts: While sorting by Y-coordinate works for single-column text or simple layouts, it can fail on complex multi-column documents (like newspapers) where text should be read down one column before moving to the top of the next.

Batch Processing: Production-Scale Document Handling

Even when processing a small batch of documents, building a robust pipeline is useful. I also wanted traceability across intermediate artifacts for each page for comparison and debugging purposes.

def main():
    """Process multiple images in batch"""
    output_dir = "output"
    os.makedirs(output_dir, exist_ok=True)

    # Read list of input files
    with open("input_file_list.txt") as f:
        files = [line.strip() for line in f if line.strip()]

    results = []
    extracted_texts = []

    for image_path in files:
        print(f"\nProcessing {image_path}...")

        try:
            text = process_image(image_path, output_dir)
            extracted_texts.append(text)

            results.append({
                'image_path': image_path,
                'extracted': text,
                'status': 'success'
            })
        except Exception as e:
            print(f"Error processing {image_path}: {e}")
            results.append({
                'image_path': image_path,
                'status': 'failed',
                'error': str(e)
            })

    # Save results to CSV for traceability
    df = pd.DataFrame(results)
    df.to_csv(os.path.join(output_dir, 'results.csv'), index=False)

    # Save combined extracted text
    with open(os.path.join(output_dir, 'extracted.txt'), 'w') as f:
        f.write('\n\n'.join(extracted_texts))

This approach creates an audit trail that is useful during development:

results.csv - Shows which pages succeeded or failed, with error details
extracted.txt - Combined text output for easy reading
Preprocessed images - Saved for manual inspection when OCR results looked wrong

AI-Powered OCR Correction: Fixing Common Errors

Even after careful preprocessing, Tesseract made predictable mistakes:

“rn” would be misread as “m”
“l” (lowercase L) confused with “I” (capital i)
Missing or extra spaces where the text was faded
Broken words at line endings that should have been hyphenated

I realized GPT-5 could fix these errors by understanding context. A human can tell that “Gernany” should be “Germany” from context, even if the OCR only sees “Germ any”. Why couldn’t an LLM do the same?

The challenge was getting GPT-5 to fix errors without “improving” the text. My first attempts failed spectacularly—the model would rewrite entire sentences to sound better, destroying the original meaning.

def ask_the_english_prof(client, text):
    """
    Use GPT-5 to correct OCR errors

    The prompt emphasizes:
    - Fixing ONLY OCR errors (misspellings, character misrecognitions)
    - Preserving EXACT original structure, formatting, and meaning
    - NOT rewriting, reformatting, or improving the text
    - Responding only with corrected text
    """
    system_prompt = """You are an expert at correcting OCR errors in scanned documents.
    Your task is to fix OCR mistakes while preserving the original text structure,
    formatting, and meaning exactly as written."""

    user_prompt = f"""The following text was extracted from a scanned document using OCR.
    It contains OCR errors that need to be corrected.

IMPORTANT INSTRUCTIONS:
- Fix ONLY OCR errors (misspellings, character misrecognitions, punctuation mistakes)
- Preserve the EXACT original structure, line breaks, spacing, and formatting
- Do NOT rewrite, reformat, or improve the text
- Do NOT add explanations, suggestions, or commentary
- Do NOT change the writing style or voice
- Return ONLY the corrected text, nothing else

OCR text to correct:

{text}"""

    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt}
    ]

    completion = client.chat.completions.create(
        model="gpt-5",
        messages=messages
    )

    return completion.choices[0].message.content


# Process all pages with correction
load_dotenv()
client = OpenAI()

corrected_texts = []
for text in extracted_texts:
    corrected = ask_the_english_prof(client, text)
    corrected_texts.append(corrected)

# Save corrected output
with open(os.path.join(output_dir, 'corrected.txt'), 'w') as f:
    f.write('\n\n'.join(corrected_texts))

Cost considerations

GPT-5 API pricing: $1.25 per 1M input tokens, $10.00 per 1M output tokens (OpenAI Pricing)
Actual cost: ~$0.01 per page (measured across all 30 pages)
Total for the August Anton project: ~$0.30

The real trade-off was time—AI correction took about 80× longer than raw OCR. For a 30-page document, that’s the difference between a few minutes and a few hours. But since it ran unattended, I didn’t care.

⚠️ Important: Prompt Sensitivity

My initial prompt was something like: “Correct any typos in this text using common sense.” GPT-5 took this as license to “improve” the text:

Rewrite awkward phrasings
“Fix” old-fashioned word choices
Restructure sentences for clarity

The result? A CER of 1.209—worse than doing nothing.

After testing dozens of variations, the prompt shown above emerged as the winner. It achieves CER 0.079 by explicitly telling the model to:

Fix ONLY OCR errors, not improve the writing
Preserve EXACT structure and formatting
NOT rewrite, reformulate, or modernize anything
Return ONLY the corrected text with no commentary

The difference between these prompts represents a 93% reduction in errors. Prompt design matters enormously for this use case.

Running the Document OCR Pipeline

Setup:

# Install system dependencies
brew install tesseract poppler  # macOS
# apt-get install tesseract-ocr poppler-utils  # Ubuntu

# Create virtual environment and install dependencies
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

# Set API key
echo "OPENAI_API_KEY=your-key-here" > .env

Processing documents:

# Process all images listed in input_file_list.txt
python text_from_pdfs.py

# Or limit to first N images
python text_from_pdfs.py --max 5

Outputs:

output/results.csv - Processing results and metadata
output/extracted.txt - Raw OCR text
output/corrected.txt - AI-corrected text
output/*_proc.jpg - Preprocessed images

Markdown Formatting with AI: Creating Structured Documents

After OCR correction, I had accurate plain text—but it was still just plain text. For sharing with family members, I wanted something more presentable. That’s where the Markdown formatting step came in.

Why a separate formatting step?

I initially tried to have GPT-5 do both correction and formatting in one pass, but it confused the model and led to over-editing. Separating the concerns worked much better.

The make_md.py script handles this second-stage processing:

def gen_markdown(client, text):
    """
    Convert plain text to structured Markdown

    Uses GPT-5 for intelligent formatting
    """
    messages = [
        {
            "role": "system",
            "content": """You are a helpful AI text processing assistant.
            You take plain text and process it intelligently into markdown formatting
            for structure, without altering the contents.

            Look at the structure and introduce appropriate formatting.
            Avoid adding headings unless they appear in the text.

            Do not change the text in any other way.
            Output raw markdown and do not include any explanation or commentary."""
        },
        {
            "role": "user",
            "content": str(text)
        }
    ]

    completion = client.chat.completions.create(
        model="gpt-5",
        messages=messages
    )

    return completion.choices[0].message.content

Running markdown generation:

# Process all pages
python make_md.py --file output/results.csv

# Limit to first N pages
python make_md.py --file output/results.csv --max 10

Optional: PDF Generation

# Install Pandoc
brew install pandoc  # macOS
# sudo apt-get install pandoc  # Ubuntu

# Generate PDF
pandoc output/pages.md -o output/document.pdf

Part II: Building the Interactive Viewer

Early in the project, I realized I needed a way to validate results without manually opening CSV files and cross-referencing image files. When you’re processing 30 pages, you want to quickly spot problems and verify the OCR worked correctly. That’s why I built an interactive viewer using Streamlit.

The viewer lets you inspect preprocessing results, compare raw OCR with corrected text, and navigate through the document page by page.

Viewer Architecture

Here’s the core of the viewer (viewer_app.py):

import os
import streamlit as st
import pandas as pd
from PIL import Image
from common import get_preproc_path

st.set_page_config(
    page_title="August OCR",
    page_icon="📖",
    layout="wide",
)

def main():
    st.title("OCR Comparison App") # Updated title
    st.write("""This shows traditional OCR using PyTesseract, Pillow, and opencv-python.
It performs preprocessing steps to improve results, then uses OpenAI's GPT-5 to correct the OCR output.
This works best for typed or printed documents.""") # Updated description

    results_file = "output/results.csv"

    if not os.path.exists(results_file):
        st.warning(f"Results file not found: {results_file}")
        st.info("Run `python text_from_pdfs.py` to generate document OCR results.")
        return

    df = pd.read_csv(results_file)
    n_pages = len(df)

    if n_pages == 0:
        st.write("No pages to show")
        return

    # Basic Page navigation (simplified for article)
    page = st.slider('Select Page', 1, n_pages, 1) # This is the primary navigation method

    # Display current page content
    image_path = df.loc[page - 1, 'image_path']
    extracted_text = df.loc[page - 1, 'extracted']
    corrected_text = df.loc[page - 1, 'corrected']

    output_dir = "output" # Assuming output_dir is defined

    image = Image.open(image_path)
    pre_path = get_preproc_path(image_path, output_dir)
    pre_image = Image.open(pre_path) if os.path.exists(pre_path) else image

    col1, col2 = st.columns(2)
    with col1:
        st.image(image, caption=f'Original Page {page}', use_container_width=True)
    with col2:
        st.image(pre_image, caption=f'Preprocessed Page {page}', use_container_width=True)

    col1, col2 = st.columns(2)
    with col1:
        st.subheader("Extracted Text")
        st.write(extracted_text)
    with col2:
        st.subheader("Corrected Text")
        st.write(corrected_text)
        if corrected_text and isinstance(corrected_text, str):
            char_count = len(corrected_text)
            word_count = len(corrected_text.split())
            st.caption(f"{word_count} words, {char_count} characters")

Streamlit viewer application showing side-by-side comparison

This side-by-side comparison was crucial for debugging:

Original - Shows the input quality
Preprocessed - Lets me verify that preprocessing actually improved clarity
Extracted - Raw Pytesseract output
Corrected - GPT-5 corrected text where I can verify it fixed errors without rewriting

Example: OCR Errors and GPT-5 Corrections

Before (Raw OCR):

Approached from many sides to write down my life's memories as well as the events
Of the year '48, as far as | was personally touched by them, and to publish these
Memories, | will herewith fulfill the wish of my friends and only ask for your kind
indulgence, if my descriptions fail to be elegant. Well then, | will do the best / can.

Once upon a time, many, many years ago, in the old city of Zerbst, in the beautiful
'and of Anhalt, located in the German homeland, a strong boy was born to an honest
Oraper by his mistress.

After (GPT-5 Corrected):

I was approached from many sides to write down my life's memories as well as the events
of the year '48, as far as I was personally touched by them, and to publish these
memories. I will herewith fulfill the wish of my friends and only ask for your kind
indulgence, if my descriptions fail to be elegant. Well then, I will do the best I can.

Once upon a time, many, many years ago, in the old city of Zerbst, in the beautiful
land of Anhalt, located in the German homeland, a strong boy was born to an honest
draper by his mistress.

These are exactly the kinds of systematic OCR errors that are tedious to fix manually but trivial for an LLM with proper context.

Running the Viewer

streamlit run viewer_app.py

The viewer opens at http://localhost:8501 with:

Page navigation (slider)
Image comparison (original vs. preprocessed)
Text comparison (extracted vs. corrected)

Part III: Performance Analysis and Best Practices

Benchmarking OCR Accuracy

I couldn’t improve the system without measuring it objectively. The benchmark.py script compares different approaches using standard metrics:

from Levenshtein import distance as levenshtein_distance

def calculate_cer(reference, hypothesis):
    """
    Character Error Rate (CER)

    CER = (substitutions + deletions + insertions) / total characters

    Lower is better. 0.0 = perfect, 1.0 = completely wrong
    """
    if not reference:
        return 1.0 if hypothesis else 0.0

    distance = levenshtein_distance(reference, hypothesis)
    return distance / len(reference)


def calculate_wer(reference, hypothesis):
    """
    Word Error Rate (WER)

    WER = (substitutions + deletions + insertions) / total words
    """
    ref_words = reference.split()
    hyp_words = hypothesis.split()

    if not ref_words:
        return 1.0 if hyp_words else 0.0

    distance = levenshtein_distance(ref_words, hyp_words)
    return distance / len(ref_words)

Running benchmarks:

# Create ground truth template
python benchmark.py --input images/ --create-template

# Edit ground_truth/*_ref.txt files to add correct text for each image

# Run benchmark comparing different methods
python benchmark.py \
  --input images/ \
  --methods pytesseract pytesseract_no_preprocess pytesseract_gpt5 \
  --output benchmark_results.csv \
  --report benchmark_report.md

Performance Comparison

After manually creating ground truth for 5 pages, here’s what I measured:

Method	CER (avg)	WER (avg)	Speed (CPU)	Cost
Pytesseract	0.082	0.196	3.28s/page	Free
Pytesseract + GPT-5 (improved prompt)	0.079*	0.177*	259.67s/page	~$0.01/page

*Results shown use the improved prompt from this article. A vague prompt achieved CER 1.209 (worse than raw OCR). See “Prompt Sensitivity” section above.

What surprised me:

Pytesseract was better than expected: CER 0.082 (91.8% accuracy) right out of the box.
GPT-5 correction helped, but marginally: CER improved from 0.082 to 0.079. The real benefit was readability.
Preprocessing mattered more than I expected: The “no preprocessing” baseline had noticeably higher error rates.

When to Use Which Approach

For printed historical documents like mine:

Start with Pytesseract alone and see if the accuracy is good enough
If the text is readable but has annoying systematic errors, add LLM correction
Test your correction prompt thoroughly
Build a viewer for quality assurance

Part IV: Advanced Extensions

The system I built was tailored for the August Anton project, but there are several extensions that would make it more generally useful.

Multi-Language Support

The August Anton autobiography was in English, but I could imagine needing German language support for other family documents. Tesseract makes this straightforward:

Installing language packs:

# Install additional languages
brew install tesseract-lang  # Installs all languages

# Or specific languages
sudo apt-get install tesseract-ocr-deu  # German
sudo apt-get install tesseract-ocr-fra  # French

# Use in code
text = pytesseract.image_to_string(image, lang='deu')  # German
text = pytesseract.image_to_string(image, lang='fra')  # French

Vision-Assisted Correction with GPT-5

One extension I considered but didn’t implement was using GPT-5’s vision capabilities. Instead of feeding it OCR text to correct, you’d feed it the image directly:

import base64

def correct_with_vision(client, image_path, extracted_text):
    """Use GPT-5 Vision for context-aware correction (check model capabilities)"""
    # Encode image
    with open(image_path, "rb") as image_file:
        base64_image = base64.b64encode(image_file.read()).decode('utf-8')

    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": f"""Using the image for context, correct any OCR errors
                    in this text. Respond only with corrected text:

{extracted_text}"""
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{base64_image}"
                    }
                }
            ]
        }
    ]

    response = client.chat.completions.create(
        model="gpt-5",
        messages=messages
    )

    return response.choices[0].message.content

Conclusion

What started as a personal project to preserve my great-great-grandfather’s autobiography turned into a deep exploration of OCR systems. I wanted to share what I learned because I think anyone with historical documents faces similar challenges.

Key Takeaways

Here’s what I learned from building this OCR system:

Traditional OCR is better than you think - I went into this expecting to need cutting-edge ML models. Turns out Pytesseract, a 20+ year old engine, delivers 91.8% accuracy on typed historical documents. The real work is in preprocessing and quality assurance, not the OCR engine itself.

Preprocessing matters enormously - The difference between raw images and preprocessed images was substantial. Grayscale conversion, noise reduction, and thresholding aren’t exciting, but they’re essential for good results on aged documents.

LLM correction is powerful but fragile - GPT-5 can fix OCR errors beautifully if you prompt it correctly. But a vague prompt will make things worse. I learned this the expensive way by watching GPT-5 rewrite entire paragraphs when all I wanted was OCR error correction. Test your prompts thoroughly.

Build a viewer for quality assurance - The viewer caught dozens of problems I would have missed and made debugging infinitely easier.

Know your use case - I needed high accuracy because this was family history that would be preserved for generations. If you’re just trying to make documents searchable, you can accept lower accuracy.

The human element still matters - Even with 91.8% accuracy, I still read through the final output to catch the errors that automated systems missed.

If You’re Tackling a Similar Project

Here’s my advice if you’re digitizing your own historical documents:

Start simple - Don’t overcomplicate things. Try Pytesseract first and see if it’s good enough.
Measure objectively - Create ground truth for a few pages and calculate CER/WER.
Build validation tools - A simple viewer will save you hours of debugging.
Test LLM prompts thoroughly - If you go the AI correction route, test extensively before processing your entire collection.
Keep the originals - Save preprocessed images and intermediate outputs.
Document your process - Future you (or future family members) will thank you for writing down what you did and why.
Accept imperfection - Even at 91.8% accuracy, you’ll have errors. Plan for human review of important sections.

Resources and References

Academic Papers

GPT-5 System Card

OpenAI. (2025). GPT-5 System Card. https://openai.com/index/gpt-5-system-card/

An Overview of the Tesseract OCR Engine

Smith, R. (2007)
Proceedings of the Ninth International Conference on Document Analysis and Recognition (ICDAR ‘07), pp. 629-633
IEEE Computer Society
Foundation of open-source OCR technology

Documentation and Tools

Official Documentation:

Libraries and Frameworks:

Preprocessing Techniques:

Benchmarking and Datasets:

About This Project

I built this system to preserve my great-great-grandfather’s autobiography, but I hope the techniques I developed help others preserve their own family histories. Too many historical documents sit in closets, slowly deteriorating, because digitization seems too complex or expensive.

It doesn’t have to be. With open-source tools like Tesseract and modern AI services like GPT-5, you can build a system that delivers production-quality results for less than a dollar per document. The real investment is your time learning how to do it right—which is why I wrote this article.

Acknowledgments

This project built on the work of countless open-source contributors and researchers. Special thanks to:

The Tesseract OCR team for building and maintaining an incredibly capable open-source OCR engine
OpenAI for GPT-5, which made intelligent error correction accessible
The OpenCV community for image processing tools
The Streamlit team for making it trivially easy to build interactive apps

And most importantly, thank you to my late uncle for preserving August Anton’s autobiography and sharing it with the family. This project exists because he saw value in keeping family history alive.

Key papers and documentation:

OpenAI. (2025). GPT-5 API Documentation.
Smith, R. (2007). An Overview of the Tesseract OCR Engine. ICDAR 2007.