How to prepare your data to train your AI model

Learn how to transform raw data into a format that trains your AI model effectively.

Tags Streamline Icon: https://streamlinehq.com
Wrench Streamline Icon: https://streamlinehq.com
Uses
People Man Graduate Streamline Icon: https://streamlinehq.com
Advanced

This tutorial is for product managers, founders, and anyone else contributing data to an AI model. In most cases, you’ll need an engineering partner to build the model; however, this tutorial will help you prepare your data in a way your model will easily understand.

The outcome of this tutorial is a succinct prompt that you’ll enter into ChatGPT accompanied by your raw data set. There are trade-offs you need to consider for your specific project. This tutorial aims to inform you of those trade-offs.

In this tutorial we'll cover:

  • Popular data transformation techniques
  • Decisions and trade-offs you’ll have to make with how you handle your data
  • How to prepare your data with ChatGPT as your sidekick
Products Give Gift 1 Streamline Icon: https://streamlinehq.com

Try before you bite?
This one's on us.

Sign up for a free account to view our free courses

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

What do we mean by data transformation and data types?

Training your AI model requires information. And that information can’t simply be an export of raw data or collection of files. Part of training your model is showing it how it should structure your data.

Data transformation is crucial because machine learning models require data to be in a consistent, numerical format to process it effectively. By transforming categorical and numerical variables into a standardized format, we ensure that the model can easily interpret and learn from the data, leading to better performance and more accurate predictions.

Data transformation is the process of turning abstract categories and numbers into systematized numerical values that a model can ingest, analyze, and understand. There are two data types we’ll focus on today: categorical and numerical.

Categorical variables

Countries, colours, and candy bar brands are categorical. When you can group these variables together, they’re a category variable. Another clue is that they’re often not represented by a number. Your data set might include a column with country data: singapore, united kingdom, canada, argentina. This is referred to as categorical data.

Numerical variables

You probably guessed this one already. Numerical variables refer to data that is represented by numbers. Think: speed, height, weight, temperature, and many others.

How to encode categorical variables

In order to input categorical data in your model, you need to encode it. Meaning, you need to create a system that represents category variables through numbers. There’s generally two approaches to this: label encoding and one-hot encoding.

Label encoding

Label encoding assigns a numerical value to each of the variables. Back to our country example: Afghanistan-1, Albania-2, Algeria-3, and so on. Each country is represented by a number so our list might look like this:

One-hot encoding

One-hot encoding is where each variable within a category, i.e. each country within the country category, has its own column. So columns would look more like this:

Keep in mind that one-hot encoding can lead to high dimensionality when there are many categories, which can impact model performance and increase computational costs.

There are trade-offs with each approach.

Use Label encoding when the categories have a clear ranking or order, like 'low', 'medium', 'high'. It's also a good choice when there are many categories and when using models that make decisions based on rules, such as decision trees.

Use One-hot encoding when dealing with categories that don't have a natural order, such as colors or types of cars. It works well when there aren't too many different categories and is ideal for models that treat each category separately.

How to encode numerical variables

Even though some of your data is already represented as numbers, you still need to encode it in a standardized scale. You might have the height (in cm) of a list of people: 180,165,191,138,136. Then you might also have that person’s local average temperature (in F): 48,76,90,28,78.

While these are both sets of numbers, the value of person 4’s height has nothing to do with the value of person 3’s local average temperature. There are two approaches to scaling the numerical variables: normalization and standardization.

Normalization

Normalization finds the minimum value and the maximum value in your data set and assigns them value 0 and value 1 respectively. Then all the other values fit in between. In the case of our list of heights, 136 is the lowest meaning it’s represented by 0. 191 is the highest value. So one of the middle heights, like 165 would be represented by 0.53.

The equation is:

Normalized value  =   (max−min)/(value−min)

Outliers can significantly impact normalization because they can dramatically extend the range between the minimum and maximum values, causing the majority of the data to be compressed into a smaller portion of the 0-1 scale. This can lead to a loss of information and reduced effectiveness of the normalization process.

Keep in mind, you’re not doing these calculations, ChatGPT will do them. In this tutorial, we’re simply getting informed about the different methods and trade-offs.

Standardization

Standardization works with a more traditional standard deviation model where values represent how many standard deviations each original data point is from the mean of the dataset. Using the same numbers as before, the 136 height is represented by -1.18 and 191 is represented by 1.32.

The equation is:

Standardized value  =  (standard deviation)/(value−mean)

One of the main challenges of normalization is that it’s sensitive to outliers. One extremely tall person or one extremely cold temperature and the data is skewed quite dramatically. For this reason, standardization is often the preferred method of encoding numerical data.

Writing your data transformation prompt for ChatGPT

The following is a prompt that was tested with numerous data sets. It reliably produced the desired outcome of (1) a recommendation, and (2) successful execution of the work.

Include the raw data as an attached excel file or CSV.

I have attached a spreadsheet containing various data attributes that I need to organize and preprocess for a machine learning project. The spreadsheet includes columns such as [example data]. I would like your help with the following tasks:

Organizing into a table: Help me structure these attributes into a clear table format that is suitable for machine learning. This should include assigning appropriate headers and organizing data rows.

Suggesting processing: Based on the variable types, provide recommendations on how to process the data, including whether to use label encoding or one-hot encoding for categorical data. And whether to use normalization or standardization to scale numerical variables.

Once I have received your suggestions, I will decide and then ask you to execute the work.
Thank you, I agree with your recommendation. Please execute the work. During the processing, make a note of any data points or scenarios that you weren’t sure about. List them in bullet points at the end of your output.

You’ve now transformed your abstract categorical and numerical value into numbers your model will understand. Happy training!

This tutorial was created by Jonah.

Get full access

✔️ All 100+ courses & tutorials in our catalog
✔️ New content added weekly
✔️ Private community access
✔️ No subscription, $150 paid once
✔️ Expense it using this template. Or get a team account.
✔️ 30-day refund policy. No questions asked
Join 3,107 learners from companies like Microsoft, Coca Cola, NBA, Adobe & Google

More tutorials like this

View all

If you scrolled this far, you must be a little interested...

Start learning ->

Join 3,107 professionals already learning