Licensing And Legal
Screen Scraping: Legal Considerations
xSkrape is a tool, not much unlike many other pre-existing tools that help people automate. Truth is, if I reference a URL in my browser, I'm retrieving a data stream from a host, where the host has tacitly said, "Ok client, here's some data I'm freely willing to hand out." So if I decided to pull the data from a tool that's not a browser, but is say Excel - how's that different? (After all, Excel's Web Data Source as one example affords similar functionality.)
Here's where things get a bit murky. When we use a site, we're agreeing to their terms of service - assuming they've published them in a reasonably clear way. Many sites do this, and sometimes terms of service (TOS) say things about how you are - or are not - allowed to access the site. So if a TOS says "no scrapers" (or equivalent language) - you're likely working with a site which may believe their data is proprietary (even if you could reach it using a browser) - and that's their right. It's not hard to think of ways some businesses could be damaged by "malicious scraping," which we clearly do not advocate! There are even considerations related to basic copyright that you should be mindful of as well.
xSkrape's End-User License Agreement (EULA) addresses this issue in the paragraph "Other Usage Restrictions" by noting that xSkrape users are responsible for following restrictions set forth in TOS for any sites they'll touch. This means adhering to explicit restrictions but also implicit ones such as not issuing so many requests that the site could perceive the access as a denial of service attack, for instance.
To help the user stay on top of what's "out of bounds," xSkrape checks for any robots.txt file that site admins may publish, per commonly understood Internet standards. If there's an explicit denial reference based on a user agent string containing "xskrape", denial requests will be honored. If there's a general denial reference, xSkrape warns the user about this fact. A hard error is not issued in this case since we've observed examples of where in some "public API" scenarios, robots.txt would imply denial even in the face of advertised ability to use the API. Why might this be the case? It comes down to how we define a "bot." Unfortunately this is not a black-and-white definition. We believe xSkrape is not a bot in itself, although one could certainly build a bot-like agent using elements of xSkrape. However, you could do this with "curl" or other tools as well, so we also believe in working on a level playing field.
xSkrape also identifies itself in the User-Agent header of requests. This is something that can't be overridden by users, so it offers another way for sites to potentially turn off access to xSkrape if they really feel threatened by it. It should be noted that a unique registration number associated with you when you register xSkrape is also advertised in the User-Agent string. This can't identify who you are personally to the outside world - but it can tell that your requests are coming from the same source, and it offers a site admin an option to restrict access to you, which could be possible if you abuse a site's TOS. (This is not our decision and as noted in the EULA we make no warranties about what sites will be accessible to you.)
Also we have the right (as noted in the EULA) to use a "kill switch" to turn off your use of xSkrape if we receive a valid complaint from a site admin and have contacted you to give you an opportunity to comply (if we have your correct registration details).
In an effort to avoid possible unintentional cases where your requests could turn into something that resembles a denial of service attack, the current version of xSkrape by default throttles requests on a per host, per AppDomain basis. (Default is 15,000 uncached requests per hour and a minimum of 25 milliseconds between requests.) This can be changed through configuration, with your acceptance of responsibility for the implications (and have, for example, discussed your workload with the site admin). Note in many cases you won't have a choice about using throttling: sites may very well fail your requests if you hit them too hard, so we offer quite a few "dials" for configuring this.
Here are a few other resources that cover some of the legal considerations related to screen scraping:
- Wikipedia - Web Scraping
- Blog post: Is Web Scraping Legal? (perspectives from a patent/trademark lawyer)
- How Legal is Content Scraping?