Broken Links Checker (Documentation)

📖 Documentation page

This page documents a workflow, system, feature, tool, or editorial practice used by The Sunil Abraham Project (TSAP). It describes how the project operates and is not itself a primary content article.

The Broken Links Checker is a maintenance and quality-assurance tool used by The Sunil Abraham Project (TSAP) to identify internal links that point to pages which no longer exist. As the project expanded beyond a thousand pages and continued to grow, manually checking links became increasingly impractical. Pages are regularly created, expanded, reorganised, renamed, merged, and occasionally removed. Even careful editorial work can leave behind outdated references, particularly when a page has been moved to a different location or when a URL structure has changed over time.

The Broken Links Checker automates this process by comparing links found within the repository against the URLs currently published on the live website. Rather than attempting to reconstruct Jekyll’s URL-generation logic from source files, the checker uses the site’s live sitemap as the authoritative source of truth. This design decision dramatically reduces false positives and ensures that the checker validates links against what visitors can actually access rather than what the repository appears to contain.

The implementation was developed with support from ChatGPT. All design decisions, testing, debugging, editorial judgement, and final implementation choices were made by the project maintainer.

Background

The need for a broken-link detection system emerged naturally as TSAP expanded. The project contains articles, publications, media mentions, events, biographies, documentation, research materials, and a growing collection of specialised project pages. Internal linking plays an important role throughout the site because it helps readers navigate between related content and improves discoverability.

As the repository grew, several types of link problems began to appear. Some pages were renamed while retaining references to older URLs. Some links were created before the final permalink structure had been decided. Others referred to planned pages that were never ultimately created. Occasionally, simple typographical errors or incorrect capitalisation resulted in links that appeared correct but failed when accessed.

Examples of common issues include:

Links pointing to pages that have been deleted.
Links pointing to pages that have been renamed.
Legacy URLs retained after restructures.
Draft URLs that were never implemented.
Incorrect capitalisation in paths.
Typographical errors in links.
References to pages that were planned but never created.

A maintenance system capable of identifying these problems became increasingly desirable as the repository continued to grow.

Why Earlier Approaches Failed

The first implementation attempted to determine valid URLs by analysing repository files directly. The script scanned Markdown files, extracted explicit permalinks, inferred URLs from directory structures, and attempted to reproduce Jekyll’s routing behaviour.

Although this approach appeared reasonable in theory, it quickly revealed practical limitations. Jekyll generates URLs using a combination of directory structure, index pages, front matter, and build-time logic. Replicating those rules outside Jekyll proved more difficult than expected.

The first implementation reported more than one thousand alleged broken links:

Found 1027 broken links

Manual inspection immediately revealed that many reported URLs were valid and actively published. Additional refinements reduced the number of reported issues:

Found 462 broken links

While this represented a significant improvement, investigation still uncovered many false positives. Valid pages were being reported because the script’s understanding of URL generation did not perfectly match the behaviour of the live site.

This experience produced an important lesson. The problem was not simply improving the script. The underlying architecture was flawed. The checker was attempting to predict what URLs should exist rather than verifying what URLs actually existed.

The solution was therefore not additional complexity but a different source of truth.

Architecture

The final implementation is based on a straightforward principle: validate against the live website rather than attempting to recreate Jekyll’s internal behaviour.

The checker operates in four stages.

First, the script downloads the live sitemap:

https://sunilabraham.in/sitemap.xml

The sitemap is generated by the website itself and represents the URLs that are currently published and accessible. During initial testing on 2 June 2026, the sitemap contained more than two thousand URLs:

Loaded 2160 URLs from sitemap

Second, the script scans Markdown files throughout the repository and extracts internal links. Only site-relative URLs are considered. External websites are ignored because they are outside the scope of the maintenance report.

Third, extracted URLs are compared against the URL set obtained from the sitemap. The checker normalises common variations such as trailing slashes in order to avoid reporting the same page under multiple forms.

For example:

/example-page
/example-page/

are treated as equivalent.

Finally, any URL that cannot be found within the sitemap is recorded as a potential broken link and written to a structured YAML file.

This architecture proved substantially more reliable than the earlier repository-based approach because it validates against the published site rather than inferred behaviour.

Script Location and Generated Files

The Broken Links Checker consists of a single Python script and a generated YAML data file.

The script is located at:

scripts/check_broken_links.py

The generated report is written to:

_data/broken_links.yml

A typical report looks like:

broken_links:
  - file: articles/example.md
    target: /missing-page/

  - file: media/example.md
    target: /old-url/

Each entry records the source file containing the problematic link and the target URL that could not be found in the sitemap.

The report is intentionally simple, human-readable, and easy to inspect manually.

Installation Requirements

The checker was designed to minimise dependencies and remain compatible with standard Ubuntu installations.

The script requires:

Python 3
PyYAML

On Ubuntu these can be installed using:

sudo apt install python3-yaml

The implementation intentionally avoids databases, Jekyll plugins, GitHub Actions, external APIs, and additional infrastructure. The goal is to keep the workflow transparent, portable, and easy to maintain.

Running the Checker

Navigate to the root of the repository:

cd ~/Projects/sunilabraham

Run the checker:

python3 scripts/check_broken_links.py

Typical output appears as follows:

Downloading sitemap...
Loaded 2160 URLs from sitemap
Found 8 broken links
Generated _data/broken_links.yml

Once complete, the generated report may be inspected directly:

head -100 _data/broken_links.yml

or opened in an editor.

Because the checker uses the live sitemap, results should generally be generated after the site has been deployed and the sitemap reflects the current state of the website.

Maintenance Dashboard Integration

The generated YAML file integrates directly with the Maintenance dashboard.

The dashboard reads:

site.data.broken_links

and displays each reported issue together with the source file responsible for the link.

This allows editors to immediately determine:

Which file requires editing.
Which URL appears to be invalid.

A typical dashboard entry might appear as:

articles/sunil-abraham-project.md
Broken target:
/versions/1.0/

This approach makes the maintenance report actionable because editors can move directly from the report to the affected source file.

Development History

Development began on 2 June 2026. The original objective was straightforward: identify internal links that no longer pointed to valid pages.

The first implementation relied entirely on repository analysis. It attempted to construct a list of valid URLs by examining Markdown files and front matter. While the concept seemed attractive because it avoided any dependency on the live site, testing quickly revealed significant limitations.

The initial report identified more than one thousand alleged broken links. Investigation showed that many reported URLs were valid. Several rounds of refinement reduced the count significantly, but false positives remained common.

A turning point occurred when individual URLs reported as broken were manually checked against the live website. Pages that clearly existed continued to appear in the report. This demonstrated that the underlying strategy was flawed.

Attention then shifted toward the sitemap. Instead of treating repository structure as authoritative, the checker was redesigned to treat the published website as authoritative. Once this change was implemented, the number of reported issues dropped dramatically.

The first sitemap-based run produced:

Loaded 2160 URLs from sitemap
Found 8 broken links

Manual verification confirmed that multiple reported URLs genuinely returned 404 errors. The architecture was therefore adopted as the permanent solution.

Maintenance Workflow

The Broken Links Checker is intended as a manual maintenance utility rather than a continuously running service. A recommended workflow is:

Create, edit, rename, or reorganise content.
Deploy changes.
Run the checker.
Review the generated report.
Fix confirmed issues.
Regenerate the report.
Commit updated maintenance data.
Push changes.

Typical usage:

python3 scripts/check_broken_links.py

git add _data/broken_links.yml

git commit -m "Update broken links report"

git push

This workflow mirrors the approach used elsewhere within TSAP, including the Automatic Last Updated Dates system.

Advantages and Limitations

The selected architecture offers several important advantages. Validation is based on URLs that are actually published rather than inferred from source files. False positives are dramatically reduced. The resulting report remains human-readable and easy to audit. The implementation is compatible with GitHub Pages and does not depend upon custom build infrastructure.

At the same time, several limitations should be recognised. The checker validates against the current sitemap, which means newly created pages may be reported if the sitemap has not yet been updated. The checker identifies missing URLs but does not automatically determine the correct replacement. Editorial judgement remains necessary when resolving issues. The implementation also focuses on standard internal links and may require future enhancement if additional custom link formats are introduced.

These limitations are considered acceptable given the project’s emphasis on simplicity, transparency, and maintainability.

Future Improvements

Several enhancements may be considered in the future.

Potential improvements include redirect awareness, support for exclusion lists, detection of orphaned pages, detection of wanted pages, integration with additional maintenance reports, unified maintenance-script execution, and historical reporting that tracks trends over time.

Any future development should continue to prioritise transparency, maintainability, compatibility with GitHub Pages, and alignment with TSAP’s broader static-site architecture.

Lessons Learned

The development of the Broken Links Checker produced an important architectural lesson that may influence future maintenance tools across TSAP.

Source files are not necessarily an accurate representation of the published website. Jekyll generates URLs through a combination of directory structures, front matter, templates, index pages, and build-time behaviour. Attempting to reproduce all of those rules outside the build process can introduce significant complexity and still produce incorrect results.

The live sitemap provides a simpler and more reliable source of truth because it reflects what visitors can actually access. The eventual success of the Broken Links Checker came not from making the original approach more sophisticated, but from replacing assumptions with published reality.

Future maintenance tools should therefore consider using published site data whenever possible rather than attempting to infer site structure from repository content alone.

Categories:

TSAP Documentation

📄 This page was created on 2 June 2026. You can view its history on GitHub, preview the fileTip: Press Alt+Shift+G, or inspect the .