A typical Fossil website can have millions of pages, and many of those pages (for example diffs and annotations and tarballs) can be expensive to compute. If a robot walks a Fossil-generated website, it can present a crippling bandwidth and CPU load.
A Fossil website is intended to be used interactively by humans, not walked by robots. This article describes the techniques used by Fossil to try to welcome human users while keeping out robots.
The Hyperlink User Capability
Every Fossil web session has a "user". For random passers-by on the internet (and for robots) that user is "nobody". The "anonymous" user is also available for humans who do not wish to identify themselves. The difference is that "anonymous" requires a login (using a password supplied via a CAPTCHA) whereas "nobody" does not require a login. The site administrator can also create logins with passwords for specific individuals.
Users without the Hyperlink capability do not see most Fossil-generated hyperlinks. This is a simple defense against robots, since the "nobody" user category does not have this capability by default. Users must log in (perhaps as "anonymous") before they can see any of the hyperlinks. A robot that cannot log into your Fossil repository will be unable to walk its historical check-ins, create diffs between versions, pull zip archives, etc. by visiting links, because there are no links.
A text message appears at the top of each page in this situation to invite humans to log in as anonymous in order to activate hyperlinks.
But requiring a login, even an anonymous login, can be annoying. Fossil provides other techniques for blocking robots which are less cumbersome to humans.
Automatic Hyperlinks Based on UserAgent
The UserAgent string is a text identifier that is included in the header of most HTTP requests that identifies the specific maker and version of the browser (or robot) that generated the request. Typical UserAgent strings look like this:
- Mozilla/5.0 (Windows NT 6.1; rv:19.0) Gecko/20100101 Firefox/19.0
- Mozilla/4.0 (compatible; MSIE 8.0; Windows_NT 5.1; Trident/4.0)
- Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
- Wget/1.12 (openbsd4.9)
The first two UserAgent strings above identify Firefox 19 and Internet Explorer 8.0, both running on Windows NT. The third example is the robot used by Google to index the internet. The fourth example is the "wget" utility running on OpenBSD. Thus the first two UserAgent strings above identify the requester as human whereas the second two identify the requester as a robot. Note that the UserAgent string is completely under the control of the requester and so a malicious robot can forge a UserAgent string that makes it look like a human. But most robots want to "play nicely" on the internet and are quite open about the fact that they are a robot. And so the UserAgent string provides a good first-guess about whether or not a request originates from a human or a robot.
The first new sub-setting is a delay (in milliseconds) before setting the "href=" attributes on anchor tags. The default value for this delay is 10 milliseconds. The idea here is that a robots will try to interpret the links on the page immediately, and will not wait for delayed scripts to be run, and thus will never enable the true links.
See also Managing Server Load for a description of how expensive pages can be disabled when the server is under heavy load.
The Ongoing Struggle
Fossil currently does a very good job of providing easy access to humans while keeping out troublesome robots. However, robots continue to grow more sophisticated, requiring ever more advanced defenses. This "arms race" is unlikely to ever end. The developers of Fossil will continue to try improve the robot defenses of Fossil so check back from time to time for the latest releases and updates.
Readers of this page who have suggestions on how to improve the robot defenses in Fossil are invited to submit your ideas to the Fossil Users forum: https://fossil-scm.org/forum.