Javascript is required
/machine-learning/bootcamp/24-exploratory-data-analysis.md

Exploratory Data Analysis

In order to understand the data we are working with, it's necessary to do some exploratory data analysis.
The following list describes the process of exploratory data analysis:

  • Data Cleaning: Remove or fill missing values, remove outliers, fix structural errors, and correct typos. It's also important to remove useless features, such as the ID column that hasn't a meaning for the model. If the dataset has a timestamp column, it's possible to convert it to a datetime object using .datetime(df['date']) method.
  • Statistical Summary: Use methods such as .describe().transpose(), .info(), .value_counts(), and .unique() to better understand the data.
  • Data Visualization: Use plots such as histograms, boxplots, scatter plots, pair plots, and bar plots to understand the data.
  • Correlation: Use methods such as .corr() and .corrwith() to understand the correlation between features. It's possible to use a heatmap to visualize the correlation matrix. It's common to use the target feature as reference of correlation (ex. df.corr()['price'].sort_values())
  • Feature Engineering: Create new features from the existing ones. For example, if the dataset has a timestamp column, it's possible to create new features such as year, month, day, day of week, and hour. It's also possible to create new features from categorical features using the .get_dummies() method. it's important to understand what feature is categorical and what feature is numerical. It's possible to use the .info() method to understand the data types of each feature. Keep in mind that a feature such as zipcode would be considered numerical, but it's actually categorical because would be meaningless to use it in a mathematical operation. Categorical features can be converted to numerical features using the .get_dummies() method but it's important to consider that too many different labels if not highly correlated with the target feature can cause overfitting and slow down the model. Another trick is to convert a variable like year_renovated to a binary variable (0 = no renovation, 1 = renovated) in order to make it more meaningful. in case of textual features it's necessary to use NLP techniques to convert them to numerical features. It's also possible to use the .apply() method to create new features from existing ones. For example,
  • Scaling: Use methods such as StandardScaler() and MinMaxScaler() to scale the data. It's important to scale the data when using algorithms that use distance as a metric (ex. KNN). It's also important to scale the data when using regularization (ex. Ridge and Lasso).
  • Train Test Split: Split the data into training and testing sets. It's important to split the data before feature engineering and scaling in order to avoid data leakage. It's also important to split the data before feature selection in order to avoid overfitting. A capitalized X represents the features and a lowercase y represents the target feature.

GO

GOmachine-learningscalingnlpmd