Google not indexing craigslist – SearchTempest switches to Bing
As of February 28, Google has stopped indexing new craigslist posts. Or more specifically, every day between about 5pm and midnight PST, they index them as usual. Then at midnight, they throw them all away. So anyone searching Google for craigslist posts over the past couple weeks has been faced with a giant gap since the beginning of March.
SearchTempest has no affiliation with craigslist, so until recently, we used Google to power our searches. Since Google is no longer getting the job done though, we’ve switched to Bing!
To be honest, Bing’s API doesn’t hold a candle to Google Custom Search. You can’t sort by date, specify a list of urls to search (Google’s ‘annotations’), or even reliably search within the url at all. (Bing does have a semi-hidden option, instreamset:(url):{text}, which is similar to Google’s inurl:{text}, but we’ve found it to be unreliable.)
That said, through some clever manipulation of query strings and a mess of hard-coded special cases, we’ve managed to come up with a Bing-powered craigslist search that’s quite functional. If you’re frustrated by not being able to search craigslist through Google like before, give it a try!
Hi Nathan, my name is Matt Cutts and I’m an engineer in the search quality group at Google. Thanks for asking about this; it helped the indexing team uncover an issue in how we’re indexing Craigslist, and we’re in the process of fixing it right now.
To understand what happened, you need to know about the “Expires” HTTP header and Google’s “unavailable_after” extension to the Robots Exclusion Protocol. As you can see at http://googleblog.blogspot.com/2007/07/robots-exclusion-protocol-now-with-even.html , Google’s “unavailable_after” lets a website say “after date X, remove this page from Google’s main web search results.” In contrast, the “Expires” HTTP header relates to caching, and gives the date when a page is considered stale.
A few years ago, users were complaining that Google was returning pages from Craigslist that were defunct or where the offer had expired a long time ago. And at the time, Craigslist was using the “Expires” HTTP header as if it were “unavailable_after”–that is, the Expires header was describing when the listing on Craigslist was obsolete and shouldn’t be shown to users. We ended up writing an algorithm for sites that appeared to be using the Expires header (instead of “unavailable_after”) to try to list when content was defunct and shouldn’t be shown anymore.
You might be able to see where this is going. Not too long ago, Craigslist changed how they generated the “Expires” HTTP header. It looks like they moved to the traditional interpretation of Expires for caching, and our indexing system didn’t notice. We’re in the process of fixing this, and I expect it to be fixed pretty quickly. The indexing team has already corrected this, so now it’s just a matter of re-crawling Craigslist over the next few days.
So we were trying to go the extra mile to help users not see defunct pages, but that caused an issue when Craigslist changed how they used the “Expires” HTTP header. It sounded like you preferred Google’s Custom Search API over Bing’s so it should be safe to switch back to Google if you want. Thanks again for pointing this out.
So, will you, TEMPEST NATHAN, please switch back to Google now! PLEASE!
As soon as the problem is truly fixed, yep! Matt’s post is very encouraging, but take a look – Google still does not have any new craigslist posts as of 2:15pm March 15: https://www.google.com/search?num=20&safe=off&tbs=qdr:m,sbd:1&q=table+site:craigslist.org+-intitle:classifieds
I’ve replied to your email to get a better idea of the problems you’re having with the current Bing setup.
Sorry, I was out yesterday doing a long run near Sacramento, but the indexing team mentioned that this should be fixed now. In general, it might take 2-3 days for things to get back to a more normal state, but this is on our radar now. Thanks again for pointing it out.
No problem, thanks again for your responses! I’ll let you know if things don’t appear to be back to normal in a few days.
Nathan & Matt,
As I posted in the Google Search Forum, here is a more specific query that demonstrates that the indexing is still not correct. This is from the New York Craigslist site, which is the second busiest Craigslist site. I should be seeing, at minimum, 100’s if not 1000’s of ads here:
https://www.google.com/search?num=20&safe=off&tbs=qdr:d,sbd:1&q=car+inurl:cto+site:newyork.craigslist.org+-intitle:classifieds
Yet, I only see 1 ad at this particular time (11:10pm EST)
Please get this resolved soon. This is unbearable. Thank You.
Matt, I am an investigator and I specialize in stolen vehicles and the craigslist postings are vital to a majority of my investigations. Can you confirm if the indexing issue has been resolved?