February 24, 2024

Technology Websites

I Need Technology Websites Right

Web Data Scraping for Generative AI Training

  • The free hand access to AI companies over publicly available web data is now being resisted by websites and companies demanding appropriate compensation for the data.
  • Websites are thus updating access to crawlers or cutting off access to data on their platforms altogether.
  • With no legal precedence favoring websites on web data scraping, the conflict presents an opportunity for the rise of a new business model based on publicly available data.

Machine learning models trained on accurate and reliable data form the bedrock of any artificial intelligence (AI) tool and a multitude of other use cases that predate the rise of AI. However, websites and online platforms aren’t happy.

When ChatGPT emerged guns blazing on the web a year ago, it spurred the AI movement on a global scale, leading executives and experts to draw comparisons with some of the crucial technology milestones of the decades past.

It is an exciting, if not a tad disconcerting, time for the world indeed. Except website proprietors are not happy. The situation is conspicuously similar to the time web search engines were taking off nearly three decades ago, which paved the way for Martijn Koster to create the now widely adopted Robots Exclusion Protocol in 1994.

A robots.txt file allows website owners to define crawler directives and control the data types that make up the content they serve, can be accessed by web crawlers, and, more importantly, assuage concerns that crawler activities will overwhelm their services.

However, here lies the difference between concerns then and now — back then, owners worried crawlers would lead to traffic overload. The primary issue at hand today is privacy, misuse, and copyrights.

Privacy is addressed by the fact that only publicly available data is crawlable, which forms the basis of ecommerce, certain software, market research, and other commercial activities. If the data is scraped from social media, it can be misused for malicious ends like phishing, identity theft, perpetrating fraud, or lead to system infiltration attempts. Still, an argument can be made that the users chose to share the data as publicly available, taking the liability away from the said platforms.

Also, data scraping isn’t illegal.

Multiple court proceedings have upheld this, including LinkedIn v. hiQ, where the judge dismissed the former company’s claim that the latter violated the Digital Millennium Copyright Act (DMCA), California Penal Code § 502(c), and the California common law of trespass, and  CFAA.

Spiceworks News & Insights reached out to Dan Pinto, CEO and co-founder of Fingerprint, for his two cents. “AI scraping is a morally gray area because the value does not flow back to the original creator of the content. Businesses and creators are within their rights to demand accountability, but ultimately, they are responsible for taking proactive measures to safeguard their data from undesired scraping activities,” Pinto told Spiceworks.

See More: Breaking New Ground: A Dive Into Multimodal Generative AI

How the Rise of AI Necessitates a New Take on Web Scraping

In the past few years, publicly available data, including personal information and intellectual property, has been widely used to train AI tools such as ChatGPT, Midjourrney, Bard, etc. The tendency of AI systems to replicate or misappropriate proprietary information is creating a furor among creators.

The laws about leveraging training data scraped from the web for large language and other models are largely undefined or nonexistent. So, the question arises: who is liable for the output generated by GenAI tools? AI developers/companies, the AI tool, or the creators?

This has resulted in several lawsuits governed by unprecedented paradigms:

  • OpenAI sued for ChatGPT
  • GitHub sued for Copilot
  • Google sued for Bard, DuetAI, Imagen, and Gemini
  • Stability AI sued for Stable Diffusion by Getty Images (U.S.)

Further, companies and websites that have so far handed free rein over their data are demanding appropriate compensation. Reddit’s controversial change of its API pricing policy earlier this year may have affected third-party developers at the outset. However, the company clarified that the objective was to shut the tap on the endless data being scraped and used to train GenAI models unless it is paid for.

Reddit, which is threatening to block Google and Bing web crawlers, brought the change on the heels of X (formerly Twitter) instituting API pricing tiers in March this year. Later in August, X also retired some legacy API endpoints.

Speaking from experience, X has a reasonably good search in place. Reddit doesn’t. The discussion forum relies quite heavily on Google search. Moreover, it can also negatively impact traffic in other ways.

“Reddit should consider implementing other strategies to identify and block malicious bots without negatively impacting their discoverability on search engines. If they block well-behaved bots like Google and Bing, malicious bots will still find ways to pretend they’re legitimate visitors to collect data,” Pinto opined.

Pinto suggests the use of device intelligence solutions. He added. “Leveraging device intelligence can help businesses distinguish between sophisticated bots and legitimate website users by analyzing dozens of attributes such as IP addresses, browser version, VPNs, and operating systems. This information can help identify visitors with a history of bot-like behavior and other dubious devices.”

However, Reddit and X aren’t the only ones. Multiple news publishers, including CNN, BBC, The New York Times, Reuters, and others, have already restricted Common Crawl and OpenAI crawlers.

In fact, 25.9% of the top 1000 websites blocked GPTBot, OpenAI’s new web crawler, as of September 22, 2023. Meanwhile, the Common Crawl Bot is blocked by 13.9%, 7% blocked ChatGPT-User, and 0.2% blocked Anthropic, according to data by Originality.ai. Widespread blocking of crawlers and data scraping bots means costs will rise.

“Data scrapers are constantly developing new methods to bypass website blocks, meaning companies must evolve their technology to stay ahead. This race will quickly escalate the complexity of automated bot operation and make data scrapers more difficult and expensive to operate, potentially stifling the data-scraping economy and forcing AI companies to develop new data-gathering approaches,” Pinto continued.

It will be interesting to see how data-guzzling demands for AI training shape companies’ approach to assess the trade-off and tread the balance between rising web scraping costs and monetary demands from websites.

Data Scraping Legislation

Pinto told Spiceworks that the ethical and legal considerations about data scraping should be subject to the actual use of said data. “There are many useful and ethical data scraping use cases. The ethics and legality ultimately come down to what is being done with the data and whether or not the original creator receives the appropriate value from their work, whether it’s money, recognition or content ownership,” Pinto said.

Even though data scraping on the web isn’t illegal per se, and there is little precedence for holding scrapers accountable, legally or otherwise, websites have the right to block access. And that’s all they can really do for now.

“Data scraping is an unresolved legislative issue. Many applicable laws have been unable to keep up with the rapid use and development of generative AI. Lawmakers are still grappling with regulations, policies, and best practices, but recent rulings point toward allowing bots to access any information that is openly available,” Pinto explained.

“If policy moves this way, companies must protect the information they don’t want scrapers to access. By incorporating device intelligence software, businesses can detect data-scraping bots and prevent access to protected content.”

How can companies collaborate on data scraping? Share your thoughts with us on LinkedInOpens a new window