AI: How Safe is Your Data?

Posted April 3, 2024

Is your personal or proprietary data safe with AI services?

When speaking with others about AI, one of the biggest concerns many of them have is the safety of their data. The most common questions I get are:

Is my public data being used to train AI?
Is ChatGPT using the data I give it to train itself?
How can I protect my data?

In this article, I'll try to clear up some of that confusion and explore:

Where AI services get their data
How you can protect AI from accessing your personal information for training
How you can safely use AI services with your data

First Thing's First: ChatGPT Is Not Safe

Since some of you might be reading this to see if the information you submit to ChatGPT is safe, let me be very clear: it is not. As OpenAI itself states on its security page:

Will OpenAI use my content to improve models and services?

Data submitted through the OpenAI API is not used to train OpenAI models or improve OpenAI's service offering. Data submitted through non-API consumer services ChatGPT or DALL·E may be used to improve our models.

If you are using ChatGPT with personal or business information, stop. Don't give it your documents. Don't give it your code. Don't have it proofread your documents, emails, or presentations. Companies like Samsung learned this the hard way and banned its employees from using it at all. Some branches and departments of the government have done so as well. If you find that ChatGPT is basing its output on your proprietary information, send a deletion request to OpenAI.

As the saying goes: if a product is free, you are the product.

The good news is that anything that uses OpenAI's API, which has access to the same AI models that ChatGPT does, has much stronger data privacy protections. If you are making your own service that uses OpenAI's API or are using a service that does, OpenAI states that it will not use the data you give it for training purposes.

Part One: The Data You Don't Voluntarily Give AI

Where Do AI Services Get Their Data?

When ChatGPT answers a question, where did it learn the answer? OpenAI collects data from various public sources to create a gigantic dataset of training data. This data comes from the internet, publications, public research, and even public code repositories. OpenAI also asks the community to contribute data. For the most part, this training data isn't based on live, up-to-the-moment information. Think of this training data like a giant printed encyclopedia. At the time of publication, the newest GPT model's training data cut off at December 2023.

Just like search engines, OpenAI has its own web crawler, called GPTbot. Any information that is public is fair game for AI training. This means that the information on your company's public website can (and probably is) used for training. Does your company have a file store that isn't protected? That can also be part of training data. Do you have a LinkedIn profile? Your resume just might be used.

As you can imagine, intellectual property is included in this training data, often without the knowledge of their owners. Books, newspaper articles, poems, lyrics, scripts - if it's public, it could be used to train AI. This has subjected OpenAI to lawsuits from newspapers, studios, and artists. This is uncharted territory in regard to fair-use. It will take years for the courts to sort it out.

Open-source code repositories on GitHub are utilized for AI assistants for software development, like GitHub Copilot. Things can go wrong when code repositories that are public actually should be private. Even Microsoft, OpenAI's largest investor, has accidentally leaked terabytes of code and data to OpenAI.

How to Protect Your Data

Practice good data hygiene

The first thing you should do, and this isn't particular to the topic of AI, is to make sure that data you want to be private is private. It's important not to practice security by obscurity here - if you can access a resource over a public network, it's not safe.

Make sure your resources (pages, documents, media) that should be protected are behind some kind of authentication mechanism
Make sure your code repositories that aren't open source are private.

Protect your public information

If you would like to prevent OpenAI from training its LLMs on your public data, you can edit the robots.txt file on your website to disallow GPTBot. This is the same file where you would request the same of search engines.

User-agent: GPTBot
Disallow: /

This can also be done a per-directory basis.

Make sure public information about you is accurate

As AI services get much of their information from the web, perform a web search about you or your company and make sure that the results are accurate. This can sometimes be out of your control, especially when false statements exist on sites where you can't moderate the content.

Part Two: The Data You Do Voluntarily Give AI

That being said, generative AI really shines if you can safely give it access to some of your data. It can do things like quickly find information, summarize documents, and find patterns. Remember, the training data the AI models have available are public information, so if you have questions related to specific information that it does not possess, you must provide it.

Since I already gave you the flashing red warning about ChatGPT. I'll assume that if you want to do this, you will use an AI service with a good privacy and security policy. As previously mentioned, OpenAI's API, Azure OpenAI, and HuggingFace have good privacy policies. For instance, an excerpt of Microsoft Azure OpenAI's privacy policy:

Your prompts (inputs) and completions (outputs), your embeddings, and your training data:

are NOT available to other customers.

are NOT available to OpenAI.

are NOT used to improve OpenAI models.

are NOT used to improve any Microsoft or 3rd party products or services.

are NOT used for automatically improving Azure OpenAI models for your use in your resource (The models are stateless, unless you explicitly fine-tune models with your training data).

Your fine-tuned Azure OpenAI models are available exclusively for your use.

The Azure OpenAI Service is fully controlled by Microsoft; Microsoft hosts the OpenAI models in Microsoft's Azure environment and the Service does NOT interact with any services operated by OpenAI (e.g. ChatGPT, or the OpenAI API).

This is a great example of what you should look for in a privacy policy for an AI service provider:

Your data is yours and only you (or those who you give permission to) can access it
Your data is not used to train AI models
Your data is not used for product improvement or analytics
Only you can use the things you built

Whichever service you choose, whether it be right from an AI service provider or from a product that uses an AI service provider, make sure to review the privacy policy and that it adheres to these standards.

The Ways You Can Provide an AI Model with Your Data

In the prompt

In simple terms, a prompt tells an AI model what its purpose is and what you want from it. It can also include information that the AI needs to know to accomplish its task. This is called giving a prompt context.

Take this simple prompt for example.

Identity: You are an AI assistant that answers questions about Josh Greenwald.

[Information about Josh]

Josh was born in Florida

Josh can slam dunk a basketball from a standing start

Josh makes the best chocolate chip cookies

User question: {{$question}} Answer:

Without this context, OpenAI probably doesn't know that I have superhuman strength or that I'm a great baker (disclaimer: only one of those things is actually true).

In more advanced scenarios, you can provide context to a prompt with from an API endpoint that would provide it with custom data. For example, here is a prompt that calls a Semantic Kernel plugin to get product data:

Identity: You answer questions about the top selling products from Josh's store.

[Product information]

{{getTopSellingProducts()}}

User question: {{$question}} Answer:

From a Document

Some services allow you to upload or link to resources for data, Copilot Builder being one of them. When you upload a document, the data in it is embedded into vector data so that it's more efficient and effective for AI models to use. Some services let you do this once and some will store the file and vector data for later use. In either case, make sure that you understand what is being done with this data and that you have the option of permanently removing it whenever you need.

From Media Files

Models like GPT-4-Vision and Azure Vision provide image analysis, providing descriptions of objects, scenes, people, and colors.

Some models can transcribe audio files as well, like OpenAI Whisper or Azure Speech.

Like my advice with document uploads, make sure you can remove your media files if you so choose.

Trust vs. Convenience

Even with robust security and privacy policies, is your data really safe? You might have questions like:

Can we really trust companies to adhere to their policies?
How secure are these platforms?
What happens in the case of a data leak?

A certain amount of trust and faith are required when using any vendor that has access to your data, whether it be OpenAI, Amazon, or practically anywhere you use a credit card. What it comes down to is risk vs. reward. It's up to you to decide if that ratio is worth it to you.

The Near Future: Local AI and Small Language Models

The most popular AI models, such as the GPT variants, run in the cloud because of the massive amount of computing power necessary to run them (which is where we get the large in Large Language Model). But, what if there was an alternative that didn't involve sending potentially sensitive information to third parties?

This is where local AIs will come into play, and this is coming sooner than you might think. We're already seeing the release of AI-optimized hardware thanks to chips from NVIDIA and Intel. This will allow you to run smaller, more focused AI models in-house. As AI models become leaner and hardware more powerful, this type of computation will become much more common.

Summary

In this article, we've explored:

What to do about your data that you don't want AI services to access
How to handle your data that you do want AI services to access
What's coming in the near future with small language models

As always, stay vigilant and beware of the AI snake oil salespeople on LinkedIn!