How to prevent Google hammering server for old linked CSE specifications
Google’s Linked CSE is a fantastic tool. It allows you to dynamically generate a custom search engine for each of your users, or even for each individual visit, based on any parameters available to your application. This functionality has been invaluable for SearchTempest.com as we use custom search engines to provide customized multi-city searches of craigslist (no affiliation).
The problem with this approach is that when you create a Google Custom Search Engine (CSE) with a linked specification file on your server, Google’s “FeedFetcher-Google-CoOp” bot requests that file in order to build the CSE. It then continues to regularly request the file, repeatedly, for at least a matter of months afterward, even if it is never again used by an actual user.
In our case, it got to the point where the majority of all requests for files from our web server were for useless, outdated Google CSE specification files. Unfortunately, once this is happening, it appears there is no way to stop it. The best you can do is to add a rule either the web server or, ideally, the firewall level to block these requests. (Currently we return a 410 ‘gone’ response in as few bytes as possible.)
However, there is a way to avoid getting into this situation in the first place. In short, Google CSE specification files should be served from disposable subdomains. For example, create a subdomain called gcrefs1. For convenience you can point it at the same directory as your main (www) site. In your CSE setup, tell Google to access the file at http://gcrefs1.example.com/filename. Then, after a period of time (once Google’s Feedfetcher bot is making too many requests to the file for your liking), simply create a new subdomain (say, gcrefs2), update your references to point to the new domain, and then remove the DNS entries for the old one.
Of course, it’d be nice if Google’s feedfetcher just respected robots.txt, or reacted properly to 410 responses, but given the usefulness of Custom Search Engines in general, I’ll take what I can get.
Update: It appears that Google ignores 410 responses, but not 301 responses. So by 301 redirecting an outdated cref file to null.html (for example), you should be able to convince them to stop requesting it. (Although the bot will run through each of its saved sets of request arguments one last time, since it sees each as a completely separate file.)