ποΈ Website Archiver
Archive entire websites as static snapshots before shutdown.

ποΈ Browse archived sites:
https://archive.helsingborg.io/archive/
π Live archive site:
https://archive.helsingborg.io/
βοΈ How It Works
- Provide the websiteβs domain manually.
- The GitHub Action fetches the sitemap (
/sitemap.xml).
- Every page listed is downloaded using
wget:
- All HTML, images, CSS, JS, and assets are saved.
- Links are converted to relative URLs.
- External media (e.g. CDN images) are included if listed in
EXTRA_DOMAINS.
- The archive is stored as:
/domain/YYYY-MM-DD/
- The workflow commits the archived files to the repository.
π§° Requirements
- GitHub repository for the site archive.
- The site must have a valid
sitemap.xml.
- (Optional) Enable GitHub Pages to serve the archive.
π Usage
- Go to Actions β Archive Website.
- Enter the URL to archive.
- Wait for the workflow to finish.
- Find the snapshot under
/domain/YYYY-MM-DD/.
- View the result in the
archive browser.
π§© Technical Details
- Uses
wget to mirror sites.
- URLs are rewritten as relative (
--convert-links).
- External assets defined in
EXTRA_DOMAINS are included.
- Only URLs listed in
sitemap.xml are processed.
- Commits results to
/domain/YYYY-MM-DD/.
π°οΈ Typical Use Case
Ideal for municipal or organizational website decommissioning.
Run once to permanently preserve a static version for archival or legal purposes.
β οΈ Limitations
- Only pages in the sitemap are archived.
- Dynamic content (forms, search, JS-rendered pages) is not captured.
- Sites requiring authentication are not supported.
π§βπ» Run Locally
SITE_URL="https://example.com" \
EXTRA_DOMAINS=("media.example.com" "cdn.example.com") \
MAX_DEPTH=1 \
bash download.sh
π License
MIT Β© Helsingborg Stad