Web Scraping and AI: Playing with Fire
Web scraping is highly controversial – even more so in the age of artificial intelligence (AI).
Web scraping differs from screen scraping in that the underlying HTML code and data stored in databases can be extracted, whereas in screen scraping only the pixels displayed on the screen are copied.
Over the years, data extraction methods have evolved. Developers began to write code to automate the process. Through machine learning and AI, web scraping has become increasingly sophisticated and efficient. As a result, it has become an important tool for companies to collect data for machine learning models, market research, competitive analysis and much more.
Good and bad side of web scraping
The negative aspects of web scraping are often emphasized, however, the technology also has its positive sides. For example, it makes it possible to create and maintain a searchable index of websites. It also allows social media managers to measure sentiment on social media.
Malicious bots, on the other hand, extract content from a website for purposes beyond the control of the website owner and often violate the terms of use. Competitors could, for example, tap into price information to gain a competitive advantage. Or they steal content that they use themselves. This inevitably lowers the SEO ranking. What is worrying in this context is that, according to our study, bad bot traffic now accounts for 67.5 percent of all internet traffic in Germany.
Legality of web scraping
Web scraping is therefore in a legal gray area. This has led to court cases involving prominent parties, particularly in the USA. In 2009, for example, Power Ventures violated intellectual property rights by analyzing Facebook user data. In the landmark case LinkedIn vs. hiQ Labs in 2019, however, the US Supreme Court ruled that web scraping of publicly accessible data on the internet is legal.
In Germany, the Federal Court of Justice did not classify web scraping as unlawful in 2014 as long as no “technical measures” intended to protect the data are circumvented. In 2020, the Cologne Higher Regional Court ruled that operators of online stores violate database copyright law if they extract and use data from a third-party online store for their own online store. However, according to the Higher Regional Court, scraping incidents do not automatically lead to compensation under the GDPR, but must always be considered on a case-by-case basis.
Web scraping in the age of AI
There is now a legal framework within which web scraping can be carried out legally. Due to the constant further development of AI, the discussion about the legality of the technology is once again coming to the fore, as it makes a fundamental contribution to training large language models (LLM). Models such as OpenAI’s GPT-4 rely on large amounts of data to learn and produce coherent results.
One recent case is the lawsuit filed by the New York Times against Microsoft and Open AI. The US newspaper believes that the copyright to millions of articles has been infringed, as the company is said to have used the knowledge from these articles to feed the AI. The ruling in this case could set a precedent. If the court rules in Microsoft’s favor, it would strengthen the group that believes that data is necessary for such models to work at all. If the court rules in favor of the New York Times, many observers say it would be a victory for copyright and data protection.
Ethical implications could also pose a problem. The data collected by the AI could inadvertently contain private information of individuals that the AI inadvertently disseminates. This poses a risk to those affected. It also uses the data in a very non-transparent way and it is difficult to remove the data that has already been collected.
Effective protection against web scraping
Since web scraping is already a legal gray area, AI will exacerbate this problem because the courts are already not keeping up with the case law. And even if there is case law, it will probably lag behind reality. AI is currently developing too quickly for that. The technology has therefore become a game with fire, because the line to legality is thin. This is to the detriment of companies, whose data continues to be stolen by malicious actors.
Companies should therefore take measures to protect themselves against web scraping. Here, a technical bot management solution that prevents any web scraping could be the solution. It should be able to protect all entry points such as websites, mobile applications and APIs. Companies should also make sure that it is a multi-layered approach that includes machine learning models specifically tailored to detect web scraping. This way, they can protect business-critical traffic despite the challenging legal landscape and the evolution of AI.
Stephan Dykgers
is AVP DACH at Imperva.