Methods & coverage

How HealthArchive.ca is being built

This page outlines how HealthArchive.ca captures, preserves, indexes, and replays snapshots of public health web content. The project is in development and coverage is still expanding, but the core archive pipeline is already in place.

Scope of the archive (early phase)

The initial focus is on federal Canadian public health sites whose content directly underpins clinical guidance, surveillance, or high-impact public communication. Examples include:

  • Public Health Agency of Canada (e.g., disease pages, surveillance reports, immunization guidance).
  • Health Canada (e.g., vaccine and drug pages, environmental and product safety information).

Future iterations may consider provincial/territorial sources or selected international comparators where appropriate, but the backbone will remain Canadian public health information.

Scope is intentionally constrained and defined with explicit inclusion and exclusion rules per source so the project can prioritize reliable provenance over breadth.

The default capture cadence is an annual “edition” captured on Jan 01 (UTC) for each source, with occasional ad-hoc captures when major events or operational needs justify it. Ad-hoc captures are explicitly labeled so readers can distinguish them from the annual edition.

Capture methods

HealthArchive.ca uses browser-based crawling and standards-based web archive formats (WARCs). At a high level, each capture works as follows:

  1. Seed URLs are defined for each target domain and path, including rules about what to include or exclude.
  2. A browser-based crawler visits in-scope pages, executes JavaScript where needed, and records responses into web archive files.
  3. Responses are stored in archive files alongside metadata such as capture time, HTTP status, and content type.

Captures are stored in WARCs and indexed into a searchable database. The public site replays archived HTML via the backend and, when available, can also offer higher-fidelity browsing through a replay service. Replay fidelity varies by site and content type.

Date range filters in the archive explorer use UTC capture dates.

Storage & replay

The archive relies on dedicated storage for WARC files. When replay is enabled, a replay engine such as pywb can render higher-fidelity historical snapshots in a browser. The goals for replay are:

  • Preserve the original URL structure where possible.
  • Clearly label capture timestamps and make it obvious that the view is archival, not live.
  • Maintain interactive elements (e.g., dashboards) as faithfully as technical constraints allow.

The interface is intentionally conservative: it prioritizes clarity that you are viewing archived content. Some interactive dashboards, embedded visualizations, or third-party assets may not replay perfectly because of JavaScript, API, or hosting constraints.

Change tracking

HealthArchive.ca compares archived captures to highlight text changes between editions. This is designed for auditability and citation, not interpretation.

  • Changes are computed from archived HTML captures.
  • Comparisons are descriptive only (for example: sections added, removed, or updated) and do not provide guidance.
  • Change feeds are edition-aware by default, reflecting the archive’s annual capture cadence.

Limitations and interpretation

  • Not official guidance: Archived content reflects what public sites showed at the time of capture. It may be incomplete, outdated, or superseded, and should not be treated as current clinical guidance or medical advice.
  • Sampling and coverage: Early phases focus on specific high-value domains and paths. Coverage gaps and known limitations will be documented so that “absence” of an archived page is not misinterpreted.
  • Technical artefacts: Some interactive dashboards, embedded visualizations, or third-party assets may not replay perfectly because of JavaScript, API, or hosting constraints.