This tutorial is for product managers, founders, and anyone else contributing data to an AI model. In most cases, you’ll need an engineering partner to build the model; however, this tutorial will help you prepare your data in a way your model will easily understand.
The outcome of this tutorial is a succinct prompt that you’ll enter into ChatGPT accompanied by your raw data set. There are trade-offs you need to consider for your specific project. This tutorial aims to inform you of those trade-offs.
In this tutorial we'll cover:
- Popular data transformation techniques
- Decisions and trade-offs you’ll have to make with how you handle your data
- How to prepare your data with ChatGPT as your sidekick
What do we mean by data transformation and data types?
Training your AI model requires information. And that information can’t simply be an export of raw data or collection of files. Part of training your model is showing it how it should structure your data.
Data transformation is crucial because machine learning models require data to be in a consistent, numerical format to process it effectively. By transforming categorical and numerical variables into a standardized format, we ensure that the model can easily interpret and learn from the data, leading to better performance and more accurate predictions.
Data transformation is the process of turning abstract categories and numbers into systematized numerical values that a model can ingest, analyze, and understand. There are two data types we’ll focus on today: categorical and numerical.
Categorical variables
Countries, colours, and candy bar brands are categorical. When you can group these variables together, they’re a category variable. Another clue is that they’re often not represented by a number. Your data set might include a column with country data: singapore, united kingdom, canada, argentina. This is referred to as categorical data.
Numerical variables
You probably guessed this one already. Numerical variables refer to data that is represented by numbers. Think: speed, height, weight, temperature, and many others.
How to encode categorical variables
In order to input categorical data in your model, you need to encode it. Meaning, you need to create a system that represents category variables through numbers. There’s generally two approaches to this: label encoding and one-hot encoding.
Label encoding
Label encoding assigns a numerical value to each of the variables. Back to our country example: Afghanistan-1, Albania-2, Algeria-3, and so on. Each country is represented by a number so our list might look like this:
One-hot encoding
One-hot encoding is where each variable within a category, i.e. each country within the country category, has its own column. So columns would look more like this:
Keep in mind that one-hot encoding can lead to high dimensionality when there are many categories, which can impact model performance and increase computational costs.
There are trade-offs with each approach.
Use Label encoding when the categories have a clear ranking or order, like 'low', 'medium', 'high'. It's also a good choice when there are many categories and when using models that make decisions based on rules, such as decision trees.
Use One-hot encoding when dealing with categories that don't have a natural order, such as colors or types of cars. It works well when there aren't too many different categories and is ideal for models that treat each category separately.
How to encode numerical variables
Even though some of your data is already represented as numbers, you still need to encode it in a standardized scale. You might have the height (in cm) of a list of people: 180,165,191,138,136. Then you might also have that person’s local average temperature (in F): 48,76,90,28,78.
While these are both sets of numbers, the value of person 4’s height has nothing to do with the value of person 3’s local average temperature. There are two approaches to scaling the numerical variables: normalization and standardization.
Normalization
Normalization finds the minimum value and the maximum value in your data set and assigns them value 0 and value 1 respectively. Then all the other values fit in between. In the case of our list of heights, 136 is the lowest meaning it’s represented by 0. 191 is the highest value. So one of the middle heights, like 165 would be represented by 0.53.
The equation is:
Normalized value = (max−min)/(v
alue−min)
Outliers can significantly impact normalization because they can dramatically extend the range between the minimum and maximum values, causing the majority of the data to be compressed into a smaller portion of the 0-1 scale. This can lead to a loss of information and reduced effectiveness of the normalization process.
Keep in mind, you’re not doing these calculations, ChatGPT will do them. In this tutorial, we’re simply getting informed about the different methods and trade-offs.
Standardization
Standardization works with a more traditional standard deviation model where values represent how many standard deviations each original data point is from the mean of the dataset. Using the same numbers as before, the 136 height is represented by -1.18 and 191 is represented by 1.32.
The equation is:
Standardized value = (standard deviation)/(value−mean)
One of the main challenges of normalization is that it’s sensitive to outliers. One extremely tall person or one extremely cold temperature and the data is skewed quite dramatically. For this reason, standardization is often the preferred method of encoding numerical data.
Writing your data transformation prompt for ChatGPT
The following is a prompt that was tested with numerous data sets. It reliably produced the desired outcome of (1) a recommendation, and (2) successful execution of the work.
Include the raw data as an attached excel file or CSV.
I have attached a spreadsheet containing various data attributes that I need to organize and preprocess for a machine learning project. The spreadsheet includes columns such as [example data]. I would like your help with the following tasks:
Organizing into a table: Help me structure these attributes into a clear table format that is suitable for machine learning. This should include assigning appropriate headers and organizing data rows.
Suggesting processing: Based on the variable types, provide recommendations on how to process the data, including whether to use label encoding or one-hot encoding for categorical data. And whether to use normalization or standardization to scale numerical variables.
Once I have received your suggestions, I will decide and then ask you to execute the work.
Thank you, I agree with your recommendation. Please execute the work. During the processing, make a note of any data points or scenarios that you weren’t sure about. List them in bullet points at the end of your output.
You’ve now transformed your abstract categorical and numerical value into numbers your model will understand. Happy training!
This tutorial was created by Jonah.
More tutorials like this
Start learning today
If you scrolled this far, you must be a little interested...
Start learning ->Join 3,107 professionals already learning