WIPO Sequence Listing Format Parsing Guide

A WIPO ST.26 sequence listing is a DTD-based XML file — with an ST26SequenceListing root element and no XML namespace — whose ApplicationIdentification and EarliestPriorityApplicationIdentification blocks carry the filing and priority dates a docketing engine must extract before it can anchor any statutory deadline.

Parsing this format correctly is a narrow but high-consequence task: get the priority date wrong and every downstream PCT and national-phase calculation inherits the error. This guide gives the exact structure ST.26 mandates, a single hardened parser that extracts the docketing anchors safely, and the specific ways naive XML code silently produces the wrong date. It is the sequence-listing-specific counterpart to the WIPO PATENTSCOPE Integration ingestion layer.

Technical Specification: What ST.26 Actually Is

WIPO Standard ST.26 defines the presentation of nucleotide and amino acid sequence listings in patent applications as a single XML instance. It became mandatory on 1 July 2022 for all international and most national/regional applications, replacing the plain-text ST.25 standard. The authoritative source is the WIPO Standard ST.26 handbook and its annexed DTD; sequence content follows the INSDSeq element model shared with the international nucleotide sequence databases.

Two structural facts drive everything below:

The file is DTD-based, not namespace-based. An ST.26 instance opens with an XML declaration and a DOCTYPE that names an external DTD (for example ST26SequenceListing_V1_3.dtd; the exact version string tracks the DTD release). Element names are unqualified PascalCase — ApplicationIdentification, FilingDate, SequenceData — and there is no default or prefixed XML namespace. Any parser configured with a namespace map will match nothing.
Dates are calendar dates, not timestamps. FilingDate and the priority FilingDate are YYYY-MM-DD values with no time and no zone. They denote a legal calendar day at the receiving office, not an instant on the UTC line.

The docketing-relevant subtree is small:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE ST26SequenceListing SYSTEM "ST26SequenceListing_V1_3.dtd">
<ST26SequenceListing originalFreeTextLanguageCode="en" dtdVersion="V1_3"
                     softwareName="WIPO Sequence" softwareVersion="2.3.0"
                     productionDate="2026-01-15">
  <ApplicationIdentification>
    <IPOfficeCode>US</IPOfficeCode>
    <ApplicationNumberText>17/123456</ApplicationNumberText>
    <FilingDate>2026-01-10</FilingDate>
  </ApplicationIdentification>
  <EarliestPriorityApplicationIdentification>
    <IPOfficeCode>US</IPOfficeCode>
    <ApplicationNumberText>63/000000</ApplicationNumberText>
    <FilingDate>2025-01-10</FilingDate>
  </EarliestPriorityApplicationIdentification>
  <SequenceTotalQuantity>2</SequenceTotalQuantity>
  <SequenceData sequenceIDNumber="1"><INSDSeq><!-- ... --></INSDSeq></SequenceData>
  <SequenceData sequenceIDNumber="2"><INSDSeq><!-- ... --></INSDSeq></SequenceData>
</ST26SequenceListing>

Minimal Reproducible Implementation

Extract only the fields the deadline engine needs, harden the parser against the DOCTYPE line, and carry a SHA-256 of the source bytes so any docketed anchor is traceable back to the exact file. defusedxml.lxml is deprecated, so this uses lxml directly with an explicitly locked-down parser rather than a wrapper.

from __future__ import annotations

import hashlib
from dataclasses import dataclass, field
from datetime import date

from lxml import etree

# ST.26 files carry a DOCTYPE that references an external DTD. Harden the parser
# so it never fetches that DTD or expands entities (XXE guard), while still
# tolerating the DOCTYPE line itself. Spec: WIPO Standard ST.26.
# https://www.wipo.int/standards/en/st26/
_ST26_PARSER = etree.XMLParser(
    resolve_entities=False,  # block entity expansion (XXE)
    no_network=True,         # never fetch the external DTD over the network
    load_dtd=False,          # do not load or parse the referenced DTD
    dtd_validation=False,    # schema validation is a separate, explicit step
    huge_tree=False,         # cap tree size; SequenceData can be very large
)


@dataclass(slots=True)
class St26Anchor:
    """The minimal set of ST.26 fields a docketing engine needs."""

    ip_office_code: str | None = None
    application_number: str | None = None
    filing_date: date | None = None
    priority_office_code: str | None = None
    priority_filing_date: date | None = None
    declared_sequence_count: int | None = None
    actual_sequence_count: int = 0
    source_sha256: str = ""
    warnings: list[str] = field(default_factory=list)


def _text(root: etree._Element, path: str) -> str | None:
    # No namespace map: ST.26 elements are unqualified PascalCase names.
    el = root.find(path)
    return el.text.strip() if el is not None and el.text else None


def _iso_date(value: str | None) -> date | None:
    # ST.26 dates are calendar dates (YYYY-MM-DD) with NO time and NO zone.
    # Parse to a naive date; never convert to UTC (that can shift the day).
    return date.fromisoformat(value) if value else None


def parse_st26_anchor(xml_bytes: bytes) -> St26Anchor:
    """Extract the docketing anchors from a raw ST.26 XML instance."""
    anchor = St26Anchor(source_sha256=hashlib.sha256(xml_bytes).hexdigest())
    root = etree.fromstring(xml_bytes, parser=_ST26_PARSER)

    if root.tag != "ST26SequenceListing":
        anchor.warnings.append(
            f"UNEXPECTED_ROOT: got <{root.tag}>, expected <ST26SequenceListing>."
        )

    anchor.ip_office_code = _text(root, "ApplicationIdentification/IPOfficeCode")
    anchor.application_number = _text(root, "ApplicationIdentification/ApplicationNumberText")
    anchor.filing_date = _iso_date(_text(root, "ApplicationIdentification/FilingDate"))

    anchor.priority_office_code = _text(root, "EarliestPriorityApplicationIdentification/IPOfficeCode")
    anchor.priority_filing_date = _iso_date(_text(root, "EarliestPriorityApplicationIdentification/FilingDate"))

    declared = _text(root, "SequenceTotalQuantity")
    anchor.declared_sequence_count = int(declared) if declared and declared.isdigit() else None
    anchor.actual_sequence_count = len(root.findall("SequenceData"))

    # Integrity guard: the declared total must match the SequenceData elements.
    if anchor.declared_sequence_count != anchor.actual_sequence_count:
        anchor.warnings.append(
            f"COUNT_MISMATCH: SequenceTotalQuantity={anchor.declared_sequence_count} "
            f"but found {anchor.actual_sequence_count} <SequenceData> elements."
        )

    # The docketing anchor is the priority date if present, else the filing date.
    if anchor.priority_filing_date is None and anchor.filing_date is None:
        anchor.warnings.append(
            "NO_ANCHOR_DATE: neither priority nor filing date present; route to review."
        )

    return anchor

The function returns data, never dates: resolving the priority date into a PCT or national-phase deadline is the job of a downstream rule engine, and keeping extraction separate from calculation is what lets the same parser feed every jurisdiction.

Known Gotchas & Compliance Traps

Four failure modes account for almost every wrong anchor pulled from an ST.26 file.

Assuming an XML namespace. ST.26 is DTD-based and its elements are unqualified. Code copied from a namespaced format — passing a namespaces= map to find() or writing {http://...}FilingDate XPaths — matches nothing and returns None, which then reads as a missing anchor. Use plain element paths as shown; never invent a namespace URI for ST.26.
The DOCTYPE as an XXE and network vector. Because the file declares an external DTD, a default parser may try to fetch it or expand entities, exposing an XML External Entity attack surface on untrusted third-party filings and stalling on network calls. Lock the parser with resolve_entities=False, no_network=True, and load_dtd=False. This is the same discipline enforced when validating XML patent filings against XSD schemas.
Converting a date-only value to UTC. FilingDate is a calendar day with no time. Wrapping it as midnight and normalizing to UTC can roll 2026-01-10 back to 2026-01-09 for offices east of Greenwich, silently shifting the priority date by a day and every derived deadline with it. Keep ST.26 dates as naive date objects and let the office’s local statute decide day boundaries.
Declared vs. actual sequence mismatch, and the correction deadline it implies. A SequenceTotalQuantity that disagrees with the number of SequenceData elements signals a truncated or malformed listing. A defective or late sequence listing does not just fail parsing — it triggers an invitation to correct with its own response window: PCT Rule 13ter.1(a) at the search stage, USPTO 37 CFR 1.831–1.834, and Rule 30 EPC for European filings. That response deadline must itself be captured and docketed, not just logged.

Integration Point

This parser is one ingestion node, not a deadline calculator. It runs downstream of portal retrieval — the raw ST.26 instance arrives through the WIPO PATENTSCOPE Integration layer or a national-office feed — and its St26Anchor output flows into the rule engine that applies the PCT National Phase Entry Rules framework, where the extracted priority date becomes the 30/31-month clock’s origin. The source_sha256 value is written to the immutable audit trail so any docketed anchor can be replayed against the exact bytes that produced it, and who may override a parsed anchor is governed by the Security & Access Control Boundaries module. Any file that raises a COUNT_MISMATCH or NO_ANCHOR_DATE warning must be quarantined and routed to paralegal review before it reaches active docket state.

Frequently Asked Questions

Does WIPO ST.26 XML use an XML namespace?

No. ST.26 instances are DTD-based with unqualified PascalCase element names and no default or prefixed namespace. Configure your parser with plain element paths such as ApplicationIdentification/FilingDate; passing a namespace map will match nothing and produce false "missing anchor" errors.

Since when is ST.26 mandatory, and can I still file an ST.25 listing?

ST.26 has been mandatory since 1 July 2022 for international applications and most national/regional filings with a filing date on or after that date. Plain-text ST.25 listings are no longer acceptable for those applications; a legacy ST.25 file submitted for a post-transition application will draw an invitation to refile in ST.26 format.

Should I convert the ST.26 FilingDate to UTC before docketing?

No. FilingDate and the priority FilingDate are calendar dates with no time component. Treating them as midnight and converting to UTC can shift the day for eastern offices and mis-anchor the priority date. Keep them as naive date values and let the jurisdiction's statute define the day boundary.

What deadline does a defective or missing sequence listing create?

The office issues an invitation to correct with a response period — PCT Rule 13ter.1(a), USPTO 37 CFR 1.831–1.834, or Rule 30 EPC depending on the venue. That response deadline is a docketable event in its own right. A COUNT_MISMATCH between SequenceTotalQuantity and the number of SequenceData elements is an early signal that such an invitation is likely.

For authoritative references, consult the WIPO Standard ST.26 handbook and DTD, the WIPO Sequence software and validator, and USPTO MPEP § 2412 on ST.26 sequence-listing requirements; Python implementations should rely on the lxml.etree hardened parser and the standard-library datetime.date type. This guide sits under the parent ← WIPO PATENTSCOPE Integration framework within the broader Core Docketing Architecture & Deadline Types schema; practitioners parsing filings will also want the sibling guide to validating XML patent filings against XSD schemas.