Facebook Clarifies Policy On Site Scrapers As Robots.Txt Gets Updated

Robots.txt IconWant to scrape Facebook’s site for content? You may want to reconsider how you do so as Facebook has updated their Robots.txt file to be a bit more restrictive. If you aren’t aware of what a Robots.txt file is, you can read more here. Ultimately the Robots.txt file simply restricted certain pages from being indexed by anybody. Now Facebook has become more explicit within the file, limiting indexing to Baidu, Google, MSN, Naver, Slurp, Yandex, and other search engines. The company has also linked to the “Automated Data Collection Terms” page.

Developers and companies can also apply to scrape Facebook data via the following form. Following a blog post by Pete Warden (a speaker at yesterday’s Social Developer Summit) Bret Taylor made a post to Hacker News clarifying the company’s position on scrapers:

There are a couple of things I want to clarify. First, we genuinely support data portability: we want users to be able to use their data in other applications without restriction. Our new data policies, which we deployed at f8, clearly reflect this (http://developers.facebook.com/policy/):
“Users give you their basic account information when they connect with your application. For all other data, you must obtain explicit consent from the user who provided the data to us before using it for any purpose other than displaying it back to the user on your application.”

Basically, users have complete control over their data, and as long as user gives an application explicit consent, Facebook doesn’t get in the way of the user using their data in your applications beyond basic protections like selling data to ad networks and other sleazy data collectors.

Crawling is a bit of special case. We have a privacy control enabling users to decide whether they want their profile page to show up in search engines. Many of the other “crawlers” don’t really meet user expectations. As Blake mentioned in his response on Pete’s blog post, some sleazy crawlers simply aggregate user data en masse and then sell it, which we view as a threat to user privacy.

Pete’s post did bring up some real issues with the way we were handling things. In particular, I think it was bad for us to stray from Internet standards and conventions by having an robots.txt that was open and a separate agreement with additional restrictions. This was just a lapse of judgment.

We are updating our robots.txt to explicitly allow the crawlers of search engines that we currently allow to index Facebook content and disallow all other crawlers. We will whitelist crawlers when legitimate companies contact us who want to crawl us (presumably search engines). For other purposes, we really want people using our API because it has explicit controls around privacy and has important additional requirements that we feel are important when a company is using users’ data from Facebook (e.g., we require that you have a privacy policy and offer users the ability to delete their data from your service).

This robots.txt change should be deployed today. The change will make our robots.txt abide by conventions and standards, which I think is the main legitimate complaint in Pete’s post.

We’ve taken a look at the Robots.txt file and it has definitely been updated. For those companies that are scraping the site for any purpose, it’s now clear that Facebook will require those companies to go get approval before using any scraped data for any purposes, including selling it. You can view Facebook’s updated Robots.txt file here. It will be interesting to see how this affect the relatively robust data scraper ecosystem, if affects it at all.