/machine-learning/bootcamp/11-linear-regression.md

11 Linear Regression

Machine Learning algorithm that predicts a continuous value, it aims to minimize the distance between the points and the line of best fit.
It is a supervised learning algorithm.

Representing Linear Regression

The equation of a line is y = mx + b, where m is the slope and b is the y-intercept.
Good plots to represent linear regression are scatterplots, jointplots, linearmodelplots and residualplots.
It's always a good practice to also use a pairgrid to see the relationship between all the features.

Measures for Linear Regression

Common metrics for linear regression are:

Mean Absolute Error (MAE)
Mean Squared Error (MSE)
Root Mean Squared Error (RMSE)
Explained Variance Score (EVS)

Residuals

Residuals are the difference between the actual value and the predicted value.
Residuals are a good way to evaluate the performance of a model, and they should be normally distributed.
It's possible to represent residuals with a displot like this:

sns.displot((y_test-predictions),bins=50);

Linear Regression with Python

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

USAhousing = pd.read_csv('USA_Housing.csv')
USAhousing.head()
USAhousing.info()
USAhousing.describe()
USAhousing.columns

sns.pairplot(USAhousing)
sns.displot(USAhousing['Price'])

sns.heatmap(USAhousing.corr())

X = USAhousing[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
       'Avg. Area Number of Bedrooms', 'Area Population']]

y = USAhousing['Price']

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101)

from sklearn.linear_model import LinearRegression

lm = LinearRegression()

lm.fit(X_train,y_train)

print(lm.intercept_)
print(lm.coef_)
cdf = pd.DataFrame(lm.coef_,X.columns,columns=['Coeff'])

predictions = lm.predict(X_test)

plt.scatter(y_test,predictions)

sns.displot((y_test-predictions),bins=50);

from sklearn import metrics

print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
print('EVS:', metrics.explained_variance_score(y_test, predictions))

Python

PythonGOmachine-learningnumpypandasseabornlinear-regressionsklearnmatplotlibmetricsmdcsv