Automating EPO Bulletin PDF Extraction

Automating EPO Bulletin PDF extraction is the process of turning the weekly European Patent Bulletin — a PDF whose internal structure swings between clean vector text, rasterized legacy scans, and multi-column tables — into validated, deadline-bearing bibliographic records a docketing engine can trust.

The Bulletin is a deadline-critical source: the date a mention is published in it starts statutory clocks (opposition, translation, and renewal windows). A parser that treats every issue as homogeneous will silently drop bibliographic fields, misclassify a publication kind code, or shift a publication date by a day across a daylight-saving boundary — and each of those becomes a mis-computed statutory deadline. This page defines the exact branching pipeline and a single fail-closed ingestion contract, and sits under the EPO Register Headless Browser Fallback layer that owns recovery when the PDF alone cannot yield a complete record.

Technical Specification: What the European Patent Bulletin Is

The European Patent Bulletin is the EPO’s official periodical, published every Wednesday under Article 129 EPC. Its mandatory content — the entries that must appear for each application and patent — is fixed by Rule 143 EPC, which enumerates the bibliographic data the Register (and therefore the Bulletin) carries: publication and grant numbers, dates, applicant and inventor data, IPC/CPC classification, and priority declarations. The mention of grant that starts the nine-month opposition period is itself a Bulletin publication under Article 97(3) EPC.

Two format facts drive the whole pipeline:

Concern	What the Bulletin gives you	Why it breaks naive parsers
Kind codes	WIPO ST.16 codes: `A1`/`A2`/`A3` (application), `B1`/`B2`/`B3` (grant, opposition-amended, limitation-amended)	Each code routes to a different statutory clock; misreading `B1` as `A1` suppresses the opposition deadline
Text layer	Modern issues carry a vector text layer; older/scanned inserts are raster images	The same PDF can mix both, so a single extraction method fails on part of the document
Time reference	Publication dates are the EPO calendar date (Munich, `Europe/Berlin`, CET/CEST)	Deadline math done in the server’s local time or naive UTC drifts by a day near midnight and across DST

Publication dates in the Bulletin are calendar dates in the EPO’s own time reference, so period computation must apply Rule 134 EPC (extension when a period expires on a day the EPO is not open for receipt of documents) rather than a generic weekend rule. The extraction layer’s only job is to recover the raw fields accurately and stamp their provenance; the deadline engine downstream owns the Rule 134 roll-forward.

Minimal Reproducible Implementation

The pipeline hinges on one decision — does this page have a trustworthy vector text layer, or must it be rasterized and OCR’d — followed by validation that fails closed. The function below makes that branch explicit, filters running headers/footers by coordinate, normalizes the publication date to a timezone-aware value in the EPO reference zone using the standard-library zoneinfo module (never pytz), and refuses any record it cannot verify.

from __future__ import annotations

import hashlib
from datetime import datetime
from zoneinfo import ZoneInfo

import fitz  # PyMuPDF
from pydantic import BaseModel, StrictStr, field_validator

# EPO calendar reference for Bulletin dates and Rule 134 period computation.
EPO_TZ = ZoneInfo("Europe/Berlin")  # Munich; CET/CEST

# WIPO ST.16 kind codes that carry a deadline-bearing publication event.
GRANT_KIND_CODES: frozenset[str] = frozenset({"B1", "B2", "B3"})
MIN_VECTOR_CHARS = 50      # below this, treat the page as scanned
MIN_OCR_CONFIDENCE = 75.0  # Tesseract mean confidence floor


class BulletinRecord(BaseModel):
    """One validated Bulletin entry (Rule 143 EPC bibliographic data)."""

    publication_number: StrictStr          # e.g. "EP3765432"
    kind_code: StrictStr                    # WIPO ST.16, e.g. "B1"
    publication_date: datetime              # tz-aware, EPO reference zone
    source_pdf_sha256: StrictStr            # integrity anchor for audit
    extraction_method: StrictStr            # "vector" | "ocr"

    @field_validator("publication_date", mode="before")
    @classmethod
    def to_epo_zone(cls, v: str) -> datetime:
        # Bulletin prints DD.MM.YYYY; parse strictly, never regex-guess,
        # and attach the EPO zone so downstream Rule 134 math is deterministic.
        day = datetime.strptime(v.strip(), "%d.%m.%Y")
        return day.replace(tzinfo=EPO_TZ)

    @field_validator("kind_code")
    @classmethod
    def known_kind(cls, v: str) -> str:
        if not (v[:1] in {"A", "B"} and v[1:].isdigit()):
            raise ValueError(f"Unrecognized ST.16 kind code: {v!r}")
        return v


def extract_page(page: fitz.Page, pdf_bytes: bytes) -> str:
    """Return page text via the vector layer, or signal an OCR fallback.

    Raises FallbackRequired when the vector layer is missing/too sparse or
    the embedded fonts are corrupt — the caller then routes to OCR.
    """
    vector = page.get_text("text").strip()
    fonts = page.get_fonts(full=True)
    if len(vector) >= MIN_VECTOR_CHARS and fonts:
        # Drop running headers/footers by y-coordinate (A4 @ 72 dpi).
        lines = [
            span["text"]
            for block in page.get_text("dict")["blocks"]
            for line in block.get("lines", [])
            for span in line["spans"]
            if 80 < span["bbox"][1] < 780
        ]
        return "\n".join(lines)
    raise FallbackRequired(page.number)  # -> OCR branch (see related page)


class FallbackRequired(Exception):
    """Vector extraction is not trustworthy for this page."""


def sha256_of(pdf_bytes: bytes) -> str:
    # Hash the immutable source once; every record carries it for replay.
    return hashlib.sha256(pdf_bytes).hexdigest()

Pair the extractor with a version-pinned kind-code map so the docketing trigger a code resolves to is auditable rather than buried in branch logic:

# epo_bulletin_kind_codes.yaml
# Source: WIPO Standard ST.16 (kind-of-document codes) + EPC Art. 97/101/105b
# https://www.wipo.int/standards/en/part_03_standards.html
B1:                       # first grant publication
  event: MENTION_OF_GRANT
  starts_clock: OPPOSITION_9_MONTHS   # EPC Art. 99(1)
B2:                       # amended after opposition
  event: OPPOSITION_AMENDMENT
  starts_clock: TRANSLATION_FILING    # national validation windows
A1:                       # application + search report
  event: APPLICATION_PUBLISHED
  starts_clock: NONE

Known Gotchas & Compliance Traps

Mixed vector/raster within one issue. A single weekly PDF often carries clean vector pages plus a scanned insert or a rasterized correction. Running one method over the whole file drops the scanned portion silently. Mitigation: branch per page, not per document — the extract_page guard above rejects sparse or fontless pages and routes only those to OCR, and every page records which extraction_method produced it.
Kind-code misclassification suppresses the opposition clock. OCR and column bleed can turn B1 into 81, Bl, or A1. Because only B-codes start the nine-month opposition window under Article 99(1) EPC, a misread grant code means no opposition deadline is ever docketed — the most dangerous silent failure here. Mitigation: validate against the ST.16 pattern, cross-check the number/date against the Register, and quarantine any code that fails the check.
DST and midnight drift on the publication date. The Bulletin date is an EPO calendar date, but a Wednesday publication reduced to naive UTC can land on Tuesday, shifting a computed deadline by a day. Mitigation: attach Europe/Berlin at parse time (as the validator does), run all period arithmetic against the EPO calendar, and add regression tests around the March/October DST weekends.
Multi-column reading order. PyMuPDF returns spans in layout order, which interleaves adjacent columns and corrupts field boundaries. Mitigation: sort spans by (column bucket, y, x) using the x-coordinate to assign a column before joining, and treat any record whose fields fail the strict schema as a manual_review case rather than accepting fragments.

Integration Point

This extractor is the ingestion edge for EPO PDF data inside the broader Patent Office Portal Sync & Data Ingestion pipeline. When a page cannot be read from the vector layer, it hands off to the OCR path described in OCR for Legacy Patent Documents; when even OCR yields an incomplete record, the EPO Register Headless Browser Fallback recovers the missing bibliographic fields from the Register itself. Every raw payload should pass through disciplined Schema Validation & Error Categorization before a kind code is trusted enough to move a date.

The source_pdf_sha256 on each record is the integrity anchor: it makes a computed deadline reproducible from the exact bytes that produced it, and it lets the pipeline deduplicate on worker restart. Because the Bulletin issues weekly, pair the fetch step with the retry discipline from Implementing Exponential Backoff for Patent APIs, and reconcile the extracted numbers against the canonical EPO Register Sync Architecture so the PDF is a corroborating source rather than the sole one.

Frequently Asked Questions

Where does the EPO publish the Bulletin and how often?

The European Patent Bulletin is published every Wednesday under Article 129 EPC, both on the EPO website and as a downloadable PDF. Fetch it on a weekly schedule, hash the bytes on retrieval, and store the immutable file before parsing so any computed deadline can be replayed from its exact source.

Which kind code starts the opposition deadline?

The mention of grant is published with a B1 kind code, and that Bulletin publication starts the nine-month opposition period under Article 99(1) EPC. A B2 reflects a version amended after opposition and B3 a version amended after limitation; A-codes are application publications and do not start the opposition clock. Validate the code against WIPO ST.16 and fail closed on anything unrecognized.

How do I handle a page that has no extractable text?

Treat missing or sub-threshold vector text (and corrupt embedded fonts) as a signal to rasterize that specific page and run OCR, then reject any OCR output below a mean confidence floor of about 75% and route it to manual review. Branch per page rather than per document, since a single weekly issue frequently mixes vector and scanned pages.

What timezone should the publication date be stored in?

Parse the printed DD.MM.YYYY date and attach the EPO reference zone (Europe/Berlin, CET/CEST) using the standard-library zoneinfo module. Run all period arithmetic against the EPO calendar and apply the Rule 134 EPC extension when a period would expire on a day the office is not open, rather than a generic weekend rule.