Web Crawler

v1.0.0

9th of June, 2026

The Internet is a vast sea full of domains, pages, and sub-pages. All endlessly referencing each other in differing patterns. Some are uni-directional, pointing towards a behemoth that has corned a section of the Internet. Others, bi-directional, painting a cohesive brush stroke with whiskers that sometimes spur. But most help build a bird nest swarm of information. A points to B, points to C, points to D, and so on.

TheFarrelly.com is part of the bird nest.

Wanna know how I know?

I built a web crawler. Along with a way to represent the domains and pages it found upon the vast Internet seas. If you're interested in seeing it, head over to web-graph and see it for yourself.

Crawlers

In my journey I made three behaviourally distinct web crawlers, Internal, Fringe, and Follow. Each built on the previous one, extending functionality.

Internal only represents web pages, or nodes, which share a root or base url (i.e 'thefarrelly.com'). Fringe includes external pages but only as leaf nodes. The nodes at the very end of a graph. Lastly, Follow goes to each of these external pages and includes their sub-pages and so on.

Keeping it simple

With the focus starting on my own domain, then looking outwards. I was able to keep it simple. This web-crawler starts from a seed, 'thefarrelly.com', grabs all the hrefs it can find. Then adds them into a queue.

It continues crawling until there are either no more hrefs in the queue, or its limits have been reached.

Internal mode has 40 or so pages to check. Fringe is the same, but also adds external hrefs that are found as leaf nodes. Bringing up the total nodes to 160. Follow dwarfs both. Creating over 3675 nodes and that's with a depth limit of 1.

Just imagine the sheer amount of nodes if there was no depth limit applied. The single reference to Graph Theory on Wikipedia via "where in the web is thefarrelly.com?" resulted in 1,010 nodes being created.

Don't just take my word for it. Go to web-graph, click 'Follow' and 'See final result', you can see it for yourself.

Future

I'd like to take my current learnings and apply them to what I see as the next stage of this project. As noted in a previous post, "where in the web is thefarrelly.com?", I still have a few outstanding questions I'd like to answer.

Where can you find this website referenced?
Which web pages are being referenced?

This current iteration purely focuses on starting at this domain, and then going outwards to a limit. Whereas finding references requires searching the breadth of the Internet, then coming back.

DuckDuckGo, Google, Microsoft and other giants have solved this problem with rather complex architectures. To be fair, they're doing a lot more than just finding references. They're indexing, scoring, checking relevance, and providing a result to a query. My requirements are simpler.

Ideally I resolve the above questions and problems, while keeping the complexity low. Allowing the crawler to run on simple hardware, like a raspberry pi.

Perhaps you'd like to peek behind the curtain and see how this all works?

View on Github