Google admits using public web data to train its AI services including Bard, raises questions and clarification

The impact of this updated policy is significant, as it signifies Google's acknowledgment of using publicly available information for training its AI products.

Google using public web data to train its AI
Google using public web data to train its AI

Highlights

  • This approach raises intriguing questions concerning the compliance of Google's practices with global regulations like the GDPR
  • Many publicly accessible websites have policies in place that prohibit data collection or web scraping

In a recent update to its privacy policy on 1 July, 2023, Google disclosed that it may utilise publicly available data obtained through web scraping to train its various AI services, including Bard and Cloud AI.

This update, aims at providing transparency, clarifying that new services like Bard are included in the practice of training AI models using publicly accessible information. Google spokesperson Christa Muldoon emphasised that privacy principles and safeguards are integral to the development of their AI technologies, aligning with their AI Principles.

According to Google's privacy policy:

"We may collect information that's publicly available online or from other public sources to help train Google's Al models and build products and features likeGoogle Translate, Bard, and Cloud Al capabilities. Or, if your business's information appears on a website, we may index and display it on Google services."

Google

The revised privacy policy now explicitly states that Google employs such information to enhance its services, as well as to develop new products, features, and technologies that benefit users and the wider public.

Notably, the updated policy expands the scope beyond language models, referring to the utilisation of public data for 'AI Models,' granting Google greater freedom in training and constructing systems. However, specific measures to prevent copyrighted materials from being included in the data pool remain undisclosed.

 AI training on public data raises questions and user privacy

This approach raises intriguing questions concerning the compliance of Google's practices with global regulations like the General Data Protection Regulation (GDPR). Many publicly accessible websites have policies in place that prohibit data collection or web scraping for the purpose of training large language models and other AI toolsets.

It remains to be seen how Google's approach will fare in light of these regulations, especially with regard to protecting individuals' data from being misused without explicit permission. The revised policy's wording now provides additional clarity on the services that will benefit from the collected data.

Notably, it moves away from the term "language models" and instead refers to "AI Models." By doing so, Google grants itself more flexibility in training and building systems. However, this particular information is buried within an embedded link for "publicly accessible sources" under the policy's "Your Local Information" tab, requiring users to actively seek it out.

The impact of this updated policy is significant, as it signifies Google's acknowledgment of using publicly available information for training its AI products. Nevertheless, the inclusion of copyrighted materials and compliance with data protection regulations remain key concerns.

Going forward, the implementation of these practices and their implications will undoubtedly attract scrutiny from privacy advocates and regulatory bodies.

Overall, Google's updated privacy policy aims to increase transparency regarding its AI training practices, expand the scope beyond language models, and emphasise the integration of privacy principles and safeguards. While the revised policy clarifies certain aspects, it also raises important questions about data usage, copyright protection, and adherence to global regulations.