Web Scraping or Parsing Internet resources – what is it and is it legal?
Web scraping or otherwise web parsing is an automated collection of information from various Internet resources, which is carried out through the use of a specially developed computer program-bot. Bots are developed for those cases when it is necessary to perform monotonous, routine, based on the same algorithm of actions, but at the maximum speed, obviously inaccessible to humans.
It is important that we are talking about collecting information that is publicly available, and not about hacking or stealing content from a resource that has limited access to it. In addition, web scraping means that the bot selects certain information that the person collecting this information is interested in, rather than copying the entire database of the information resource. Moreover, the object of web parsing may not necessarily be personal data of users, but a variety of information. For example, one of the most popular applications of web scraping is product price monitoring and assortment analysis-in a non-automatic mode, in a relatively small volume and much more slowly, they are also carried out by ordinary consumers when they search the Internet for the most suitable products for them.
The web scraping mechanism is usually disclosed as follows: the robot accesses the pages of the target site, receives the HTML code, parses it into components (scraping), searches for the data corresponding to the task, which it then saves in its own database.
A classic example is the parsing of sites by different search engines (in particular, Google, Yandex), whose robots go to the site and index it, getting the necessary data. In this regard, experts on site building strongly recommend placing a special “invitation to the search robot” file in the root folder of the site – the robots.txt, created in order to index the site, and as a result – its appearance in search results, occurred faster. The need to include such an “invitation” in the site directory is explained by the fact that the search bot scans a limited number of files on a specific information resource, and then goes to the next site. And if the robots file is missing.txt bot can index non-main pages, while significant pages that should be used for website promotion will remain UN-indexed.
File robots.txt may also contain a ban on search robots indexing certain pages of the site. But not only can it be used to limit parsing – today there are other technical protection mechanisms. At the same time, experts recognize that it is not yet possible to set insurmountable barriers to all types of parsing (web scraping) – the tools being developed can only be considered as a deterrent. Unable to technically prevent site scraping, copyright holders of information resources attempt to use legal tools to prohibit this automated collection of information.
VK v. Double Data
The famous case of banning parsing (web scraping) was the case of VK (VKontakte) vs. Double Data on the claim for the protection of related rights of the plaintiff as the manufacturer of the database of users of the social network VK. It was sequentially examined by the courts of three instances, the trial court rejecting, the appeal claim and cassation held in the case of judicial acts were cancelled, and the case sent for fresh consideration to the court of first instance, which has not yet issued a new decision.
The plaintiff justified his position by the presence of intellectual property – related rights to the database, as well as the need to protect the rights of personal data subjects – users of social networks, information about which was indexed by the respondent’s robot. Therefore, it turns out that the plaintiff defended his intellectual rights, partly justifying his claim by the need to protect the rights and interests of a certain group of people – users of the VK social network (which, by the way, raises questions in the context of special requirements of the arbitration procedural law).
Objecting to the lawsuit, Double Data pointed out the need for the court to assess several circumstances that, in his opinion, impede the satisfaction of the lawsuit. Firstly, the defendant insisted that the activities carried out by him are inherently no different from the activities of search robots indexing Internet resources, and therefore cannot violate the intellectual rights of the plaintiff. Secondly, the defendant drew attention to the fact that the database created by the plaintiff is a by-product of the creation and development of the social network itself, which did not require independent investment in the search, collection and verification of the data contained in this database, as a result of which there is no reason to talk about creating an investment base data. Thirdly, Double Data draws attention to the fact that there is no evidence of a defendant retrieving a substantial part of the materials of the database of social network users, and this does not allow us to talk about the violation. Fourth, according to the respondent, site owners cannot acquire a monopoly on the data of users of these sites – “the principle of ‘posted data – presented them to a social network” is extremely dangerous ”. Thus, the defendant, in the framework of his objections, also sought not to leave the niche indicated by the plaintiff for intellectual rights “spiced up” with the rights to personal data of users of the social network.
LinkedIn v. HiQ
The participants in the HiQ v. LinkedIn case justified their positions in a completely different way. The right to investment databases does not belong to intellectual property in all countries and is designated as a related right. In this regard, publications often draw attention to the fact that, for example, in Directive No. 96/9/EP of the European Parliament and of the Council of 11 March 1996 on the legal protection of databases, this right is neutrally referred to as sui generis. For the purposes of this article, it is significant that non-creative databases in the United States do not receive legal protection as intellectual property objects, which partly led to a fundamentally different legal justification in the case of HiQ v. LinkedIn.
It is also noteworthy that in contrast to the Russian case, in which the copyright holder of the VK information resource appealed to the court to prohibit to the startup Double Data from web scraping, the American case developed in a diametrically opposite scenario – The HiQ Labs, Inc., applied to the court to prohibit the copyright holder of the LinkedIn Corporation information resource from taking technical measures to prevent parsing. that performs such parsing and uses the obtained data in its analytical products.
At first, LinkedIn attempted to pre-trial limit the web scraping of its site by sending HiQ a letter demanding that it stop automated data copying. The letter stated that such actions by HiQ are a violation, and if HiQ continues to parse the LinkedIn platform, it will be a violation of Federal and state laws, including the Computer Fraud and Abuse Act of 1984 (CFAA), Digital Millennium Copyright Act of 1998 (DMCA), § 502 (c) of the California Criminal code.
Faced with the threat of losing the main data source and being almost accused of hacking (since the CFAA is aimed at stopping hacking and prohibits access to a computer without permission or exceeding the authorized access), the HiQ startup in a response letter demanded that LinkedIn recognize HiQ’s right to access public LinkedIn pages. A week later, HiQ went to court with a request to ban LinkedIn from installing technical barriers that prevent scraping, as well as implementing legal or technical measures aimed at blocking the access of HiQ bots to the public profiles of LinkedIn users.
HiQ pointed out that its business model is based on access to publicly available data of people who have chosen to share this information on LinkedIn, and if it is deprived of this data source, HiQ will not be able to fulfill its contractual obligations, including contracts with large clients, and its business will be irreparably damaged. The applicant also referred to the fact that LinkedIn’s behavior does not comply with fair competition rules, since there is evidence that LinkedIn plans to create a new product that has a clear resemblance to The HiQ – Skill Mapper analytical product and use data from users of its platform for it. In fact, LinkedIn was accused of deliberately trying to interfere with other people’s contractual relations by setting technical barriers to HiQ bots ‘ access to publicly available data on its platform, which is unacceptable (tortious interference with contract) and subject to a court ban. That is, Ry built his legal position based on the provisions of tort law (law of torts), while referring to the public significance of the case.
Objecting, LinkedIn pointed out that HiQ is an analysis company, not a data collection company, and it can use alternative LinkedIn data sources. At the same time, LinkedIn paid special attention to the fact that the legalization of web scraping threatens the privacy of LinkedIn users, as a result of which the goodwill of LinkedIn Corporation itself is put at risk. The main argument of LinkedIn was that sending a letter prohibiting HiQ from automated copying of data, due to the provisions of the CFAA, prevents HiQ from further legal access to the data of users of the LinkedIn platform. In other words, while defending itself against the charge of intentionally interfering with contractual relations, LinkedIn in turn accused HiQ of seeking unauthorized access to computer information, which is considered a very serious offense under the CFAA.
Thus, despite the fact that the main issue in both cases considered was whether automated collection of publicly available information from an Internet resource is acceptable, it is to call the analyzed cases similar.
It is important that the American courts gave an unambiguous answer to this main question, which seems to be justified and absolutely correct. In giving it, it is advisable to briefly disclose the position of the court of appeal, which, not agreeing with the arguments of LinkedIn, supported the decision of the court of first instance. According to the court of appeal, the CFAA provides for three types of computer information:
1) access to which is open to the public and does not require permission,
2) access to which requires permission, and it was given;
3) access to which requires permission, but it was not given or there was an excess of the authorized access limits.
Public profiles of LinkedIn users, the court stressed, are available to anyone who is connected to the Internet, so they belong to the first type of information, so that the activities of HiQ are not subject to the restrictions of the CFAA. And the final conclusion of the court:
“Giving companies like LinkedIn the freedom to decide who can collect and use data – data that companies don’t own, that is publicly available to everyone, and that these companies themselves collect and use – creates the risk of information monopolies that would violate public interests.”
You can find a more detailed decision of the court here.