Methods & coverage
How HealthArchive.ca is being built
This page outlines the emerging approach for capturing, preserving, and replaying public health web content. It reflects an early-stage design and will be expanded as the infrastructure matures.
Scope of the archive (early phase)
The initial focus is on federal Canadian public health sites whose content directly underpins clinical guidance, surveillance, or high-impact public communication. Examples include:
- Public Health Agency of Canada (e.g., disease pages, surveillance reports, immunization guidance).
- Health Canada (e.g., vaccine and drug pages, environmental and product safety information).
Future iterations may consider provincial/territorial sources or selected international comparators where appropriate, but the backbone will remain Canadian public health information.
Capture methods
The live system is intended to use browser-based crawlers and standards-based web archive formats. Conceptually, each capture works as follows:
- Seed URLs are defined for each target domain and path, including rules about what to include or exclude.
- A browser-based crawler (for example, one compatible with WARC output) visits each page, executing JavaScript and recording responses.
- Responses are stored in archive files alongside metadata such as capture time, HTTP status, and content type.
For the demo, this is simulated with a small hand-curated dataset and static HTML snapshots served from the public/demo-archive directory.
Storage & replay
In a full deployment, the archive would rely on dedicated storage for WARC files and a replay engine such as pywb to render historical snapshots in a browser. The goals for replay are:
- Preserve the original URL structure where possible.
- Clearly label capture timestamps and make it obvious that the view is archival, not live.
- Maintain interactive elements (e.g., dashboards) as faithfully as technical constraints allow.
The demo interface is intentionally conservative: it shows that snapshot-based replay is possible while acknowledging that the underlying infrastructure is still being built.
Limitations and interpretation
- Not official guidance: Archived content reflects what public sites showed at the time of capture. It may be incomplete, outdated, or superseded, and should not be treated as current clinical guidance.
- Sampling and coverage: Early phases focus on specific high-value domains and paths. Coverage gaps and known limitations will be documented so that “absence” of an archived page is not misinterpreted.
- Technical artefacts: Some interactive dashboards, embedded visualizations, or third-party assets may not replay perfectly because of JavaScript, API, or hosting constraints.