How I wrote a PDF editor that really edits a PDF — Part 2

In part 1, CrabPDF was barely an editor.

It could render a PDF, put invisible spans over the text, let me double-click a word, whiteout the old text, and draw the new text with pdf-lib.

That was enough to prove the idea.

But the moment I started using it on real PDFs, the cracks showed up immediately.

Some PDFs had scanned pages. Some text boxes were too big. Some words were grouped badly. Some pages were not white. Some edits needed to be repeated. Some text needed to be redacted for real, not just covered with a black rectangle.

So part 2 became less about:

Can I edit a PDF?

and more about:

Can I make the ugly cases usable?

OCR: making scanned PDFs editable

The first version only worked with PDFs that already had a text layer.

But many PDFs are just images inside a PDF container. For those, pdf.js renders the page fine, but getTextContent() returns nothing.

So I added OCR with Tesseract.js.

The flow is simple:

Render each PDF page to an offscreen canvas.
Send the canvas to Tesseract.
Read the recognized words.
Convert every OCR word into a normal textItem.
Reuse the same editor UI.

result.data.words.forEach(function(word) {
  if (!word.text || word.text.trim() === '') return;
  if (word.confidence < 15) return;
 
  var b = word.bbox;
  var h = b.y1 - b.y0;
  var w = b.x1 - b.x0;
 
  textItems.push({
    str: word.text,
    x: b.x0,
    y: b.y0,
    w: w,
    h: h,
    fontSize: Math.max(h * 0.80, 6),
    pageNum: pageNum,
    originalFont: 'OCR',
    fontFamily: 'Helvetica',
    fromOcr: true,
    confidence: Math.round(word.confidence)
  });
});

This does not magically turn the scanned PDF into a real structured document.

It creates an editable overlay model.

When I edit OCR text, CrabPDF whites out the matching image area and draws new text over it. It is destructive, but for small corrections it works surprisingly well.

Once OCR existed, the next problem was obvious: OCR is wrong a lot.

So I added an OCR correction sidebar.

Each OCR word has a confidence score. Low-confidence words are collected into a review panel. Clicking an item scrolls to the word and opens the editor.

The workflow becomes:

Run OCR.
Open OCR fixes.
Review suspicious words.
Correct them one by one.

This made scanned PDFs much more practical. Instead of hunting for tiny OCR mistakes manually, the app can guide you to the risky parts.

The important part is that OCR words are not a special editing path. They are still just textItems.

{
  str: 'Invoice',
  x: 100,
  y: 200,
  w: 70,
  h: 20,
  fromOcr: true,
  confidence: 91
}

That decision made the rest of the editor simpler.

Native PDF text and OCR text use the same drag/edit/replace pipeline.

Hiding OCR boxes

At first, OCR boxes were always visible.

That was useful for debugging, but terrible for actually reading the document. The page looked like it had been attacked by green, yellow, and red rectangles.

So I added:

state.showOcrBoxes = true;

and a toggle.

Now OCR confidence boxes can be shown while reviewing and hidden while editing normally.

It is a tiny feature, but it changes the feel of the app a lot. Debug UI should not always be product UI.

Smart whiteout background color

The original editor always covered old text with a white rectangle.

page.drawRectangle({
  x: pdfX,
  y: pdfY,
  width: w,
  height: h,
  color: PDFLib.rgb(1, 1, 1)
});

That is fine on clean white PDFs.

It is bad on scanned pages, yellowed paper, gray backgrounds, and screenshots. You remove the text, but you leave a very obvious white patch.

So I added background sampling.

Since the page is already rendered to a canvas, the editor can sample pixels around the text box and estimate the local background color. That value is stored on the textItem.

item.backgroundColor = sampleBackgroundColorFromCanvas(canvas, item);

Then whiteoutItem() can use that color instead of pure white.

This is still approximate. It will not reconstruct complex images or gradients. But for slightly off-white scanned documents, it is much better.

One bug I hit here was funny:

var r = parseInt(hex.substring(0, 2), 16) || 255;

This breaks black.

parseInt('00', 16) returns 0, and 0 || 255 becomes 255.

So black became white.

The fix is to check NaN explicitly.

var r = parseInt(hex.substring(0, 2), 16);
if (Number.isNaN(r)) r = 255;

Tiny bug, very annoying result.

True redaction

Covering sensitive text with a black rectangle is not redaction.

If the original text is still inside the PDF, someone can select it, copy it, extract it, or remove the rectangle.

So I added a separate “true redaction” export.

The workflow:

Enable redact mode.
Draw redaction boxes on the page.
Export redacted PDF.
Render every page to canvas.
Burn black rectangles into the canvas.
Create a new PDF where each page is an image.

This means the exported redacted PDF is no longer text-selectable.

That is intentional.

Normal PDF export:
  editable, text may remain recoverable
 
Redacted PDF export:
  flattened, text is not recoverable

This is one of those features where the “worse” technical result is the safer product result.

A flattened PDF is less flexible, but that is exactly why it works for redaction.

Find and replace

Once text editing worked, the next obvious tool was find and replace.

At first I thought of adding a small floating box, but that felt too limited. So I built it more like VS Code: a sidebar with occurrences grouped by page.

The search runs over state.textItems.

state.textItems.forEach(function(item) {
  if (!item.str || !item.str.trim()) return;
 
  var haystack = normalizeText(item.str);
  var pos = haystack.indexOf(query);
 
  if (pos !== -1) {
    matches.push({
      item: item,
      pageNum: item.pageNum,
      preview: item.str,
      indexInText: pos
    });
  }
});

The result list shows every occurrence. Clicking a result scrolls to the page and highlights the word.

Replace current is simple: edit one textItem.

Replace all is intentionally dumb for now: it loops through every match and calls the same applyEdit() function used by manual editing.

for (var i = 0; i < matches.length; i++) {
  var item = matches[i].item;
  var newText = replaceInString(item.str, query, replacement);
 
  await applyEdit(
    item,
    newText,
    item.fontFamily || 'Helvetica',
    item.fontSize / state.scale,
    item.color || '#000000',
    item.rotation || 0
  );
}

This is not fast on huge documents. It reloads and saves the PDF too many times.

But it has one big advantage: it reuses the exact same editing path, so the behavior is predictable.

The future version should batch edits and save once.

At some point the top bar became ridiculous.

It had upload, download, OCR, hide OCR boxes, fix OCR, find and replace, redact, download redacted, clear redactions.

It looked less like a PDF editor and more like a debug panel.

So I moved toward a VS Code style activity bar on the right.

The top bar should only contain global actions:

Upload PDF
Download PDF

Feature-specific tools live in the right sidebar:

Search
OCR
Redact

This made the UI feel much more scalable.

Each tool can have its own panel, and adding a new feature no longer means squeezing another colored button into the top bar.

Internally it is still simple. The first implementation just orchestrates existing panels.

state.rightToolPanel = null; // 'find' | 'ocr' | 'redact' | null

Eventually I want every tool to behave like a mini plugin:

registerRightTool({
  id: 'find',
  icon: '🔍',
  title: 'Find & Replace',
  mount: mountFindReplacePanel
});

But for now, plain functions are enough.

Formatting: underline and highlight

The first version could change:

font
size
color
rotation

But it still did not feel like a text editor.

So I added underline and highlight.

A textItem can now carry formatting:

{
  underline: true,
  highlightColor: '#fff3a3'
}

The inline editor previews these styles with CSS:

input.style.textDecoration = currentUnderline ? 'underline' : 'none';
input.style.background = currentHighlight || 'white';

When writing to the PDF, CrabPDF draws the highlight rectangle first, then the text, then the underline.

if (highlightColor) {
  page.drawRectangle({
    x: x - 1,
    y: y - fontSize * 0.20,
    width: width + 2,
    height: fontSize * 1.15,
    color: hexToPdfRgb(highlightColor)
  });
}
 
page.drawText(text, { x, y, size: fontSize, font, color });
 
if (underline) {
  page.drawLine({
    start: { x: x, y: y - fontSize * 0.12 },
    end: { x: x + width, y: y - fontSize * 0.12 },
    thickness: Math.max(0.5, fontSize * 0.06),
    color: color
  });
}

This is not advanced typography, but it makes the editor more useful.

It also forced the data model to become more explicit. A text item is no longer just position and string. It is slowly becoming a tiny editable text object.

Zoom

A fixed scale was fine while debugging, but bad for actual editing.

Small text is hard to click. OCR boxes are easier to review when zoomed in. Dragging is more precise when the page is bigger.

So I added zoom controls:

Zoom out
Zoom in
Reset
Fit width

The annoying part is that textItems are stored in canvas coordinates, not PDF coordinates.

So when state.scale changes, the editor must rescale the text items too.

var ratio = newScale / oldScale;
 
item.x *= ratio;
item.y *= ratio;
item.w *= ratio;
item.h *= ratio;
item.fontSize *= ratio;

The same has to happen for groups and redaction boxes.

This works, but it is not the perfect architecture. The better long-term model would store canonical PDF coordinates and derive canvas coordinates from the current scale.

For now, this implementation is good enough and makes the editor much nicer to use.

Surgical split: when PDF.js picks the wrong text

This was one of the most useful small features.

Sometimes the PDF text layer is weird.

You click what looks like one word, but the underlying text item is actually:

pippo:  ciao pippo

or worse, the active box starts from the colon:

:  ciao pippo

No automatic algorithm can guess the user’s intent perfectly.

So I added a manual escape hatch: Split selection.

The workflow:

Double-click the problematic text item.
The inline input opens.
Select the substring you want to isolate, for example pippo:.
Click Split selection.
CrabPDF replaces the original textItem with multiple smaller items.

[pippo:  ciao pippo]

becomes approximately:

[pippo:] [  ciao pippo]

The algorithm is intentionally simple. It splits the string and assigns widths proportionally to character count.

function addPart(str) {
  if (!str) return;
 
  var partW = item.w * (str.length / totalLen);
 
  parts.push(Object.assign({}, item, {
    str: str,
    x: cursorX,
    w: partW,
    _span: null
  }));
 
  cursorX += partW;
}

This does not edit the PDF yet.

It edits the local interactive model.

After the split, the user can click and edit the isolated piece normally.

This is the kind of feature that feels small, but solves a real annoyance. Instead of pretending the picker can always be perfect, the tool lets the user intervene surgically.

The current architecture

At this point the project is no longer a single file.

The code is split roughly like this:

src/
  pdf/
    read.js       pdf.js setup and loading
    render.js     page rendering and text layer
    write.js      pdf-lib editing
    redact.js     true redaction export
    zoom.js       zoom logic
    fonts.js      font loading
 
  ocr/
    tesseract.js  OCR pipeline
 
  interaction/
    drag.js       move text and groups
    selection.js  selection rectangle
    grouping.js   line/proximity grouping
    editors.js    inline input and textarea editors
    split.js      split selected substring into textItems
 
  ui/
    topbar.js
    editToolbar.js
    rightSidebar.js
    findReplacePanel.js
    ocrSidebar.js
    zoomControls.js
    banners.js
    sidebar.js
    footer.js

It is still vanilla JavaScript. No framework. No build-time CSS system. Mostly plain DOM creation.

That is not elegant, but it is easy to inspect and easy to ship.

What still does not work well

There are still many rough edges.

Font matching is approximate. If the original PDF uses a custom embedded font, replacement text will not match perfectly unless the user uploads a similar font.

Whiteout is still approximate. Background sampling helps, but it cannot reconstruct complex images.

OCR editing is visually destructive. It covers part of the scanned image and draws new text over it.

Find and replace is item-based. If a sentence is split across multiple textItems, phrase search may not find it.

Replace all is slow because it applies one edit at a time.

Zoom rescales current UI coordinates, which can accumulate small floating point errors.

Split selection uses proportional character widths, which is not perfect for proportional fonts.

But the editor is now much more usable than the first version.

What I learned

The hard part is not drawing text into a PDF.

That part is easy.

The hard part is building an editable model on top of a format that does not want to be edited.

A PDF page is not a DOM tree. There are no paragraphs. There are no semantic words. Sometimes there is not even text, just pixels.

So CrabPDF is really a set of compromises:

use pdf.js to read/render
use pdf-lib to write
build an approximate editable text layer
let the user manually fix bad segmentation
use OCR when there is no text
flatten when security matters
keep everything local

The most important lesson from part 2 is that trying to be fully automatic is a trap.

For real PDFs, the better product is often the one that gives the user a precise manual tool when the automatic model fails.

You can try it at crabpdf.com, free, no uploads, no backend, everything runs in the browser.

Robert TurturicaRobert T.

Explorer

How I wrote a PDF editor that really edits a PDF — Part 2

How I wrote a PDF editor that really edits a PDF — Part 2

OCR: making scanned PDFs editable

OCR correction sidebar

Hiding OCR boxes

Smart whiteout background color

True redaction

Find and replace

A right sidebar for tools

Formatting: underline and highlight

Zoom

Surgical split: when PDF.js picks the wrong text

The current architecture

What still does not work well

What I learned

Graph View

Latest articles

How I started teaching PDF.js to really edit PDF text

How I wrote a PDF editor that really edits a PDF — Part 2

How I wrote a PDF editor that really edit a PDF

Missing in Go, Part 1 - The Ternary Operator

How this site works

Table of Contents