Understanding Logistic Regression

Logistic Regression is a popular classification algorithm used in machine learning to predict the probability that an instance belongs to a…

Aug 04, 2023

Logistic Regression is a popular classification algorithm used in machine learning to predict the probability that an instance belongs to a particular class. Despite its name, it is a classification algorithm, not a regression algorithm. In this article, we will dive into the working principles of Logistic regression step-by-step, to understand how it can classify data.

Step 1: Sigmoid Function

At the core of Logistic Regression lies the Sigmoid Function, also known as the Logistic Function. The Sigmoid Function transforms the output of the linear equation into a value between 0 and 1. The formula for the Sigmoid Function is:

f(z) = 1 / (1 + e^(-z))

where ‘z’ is the linear combination of input features and their respective weights.

Step 2: Hypothesis Function

The Hypothesis Function in Logistic Regression uses the Sigmoid Function to calculate the probability that an instance belongs to the positive class (usually represented as ‘1’). It is denoted by ‘hθ(x)’ and is defined as:

hθ(x) = f(θ^T * x) = 1 / (1 + e^(-θ^T * x))

Here, ‘x’ is the feature vector, ‘θ’ represents the vector of weights, and ‘θ^T’ is the transpose of ‘θ’. The goal of Logistic Regression is to find the optimal values for ‘θ’ such that ‘hθ(x)’ correctly predicts the probability of an instance belonging to the positive class.

Step 3: Cost Function

To train the Logistic Regression model, we need a Cost Function that measures how well the model is performing. The most commonly used Cost Function for Logistic Regression is the Log Loss (also known as Cross-Entropy Loss). For a single instance, the Log Loss is defined as:

J(θ) = -[y * log(hθ(x)) + (1 — y) * log(1 — hθ(x))]

Here, ‘y’ is the true label (0 or 1) of the instance, and ‘hθ(x)’ is the predicted probability from the Hypothesis Function. The objective is to minimize the Log Loss across all training instances.

Step 4: Gradient Descent

To minimize the Cost Function and find the optimal values of ‘θ’, we use the Gradient Descent algorithm. Gradient Descent iteratively updates the weights based on the gradient of the Cost Function. The update rule for the weights is:

θj := θj — α * ∂J(θ) / ∂θj

where ‘α’ is the learning rate, and ∂J(θ) / ∂θj is the partial derivative of the Cost Function with respect to the j-th weight.

Step 5: Decision Boundary

Once the model is trained and the optimal weights are found, we can use the Hypothesis Function to make predictions. Since the Hypothesis Function outputs the probability of an instance belonging to the positive class, we need to set a threshold (usually 0.5) to classify the instance as either positive or negative. The decision boundary is the line or hyperplane that separates the two classes in the feature space.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Logistic Regression model
model = LogisticRegression()

# Train the model using the training data
model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = model.predict(X_test)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Create a confusion matrix to evaluate the performance of the model
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

Evaluating the Logistic Regression Model:

After training the Logistic Regression model, it is essential to evaluate its performance. Several metrics can be used to assess how well the model is making predictions. Let’s discuss some of the commonly used evaluation metrics:

1. Accuracy:
Accuracy is the most basic metric for classification tasks. It measures the percentage of correct predictions made by the model over the total number of instances. However, accuracy can be misleading, especially in imbalanced datasets, where one class dominates the other. It is calculated as:
Accuracy = (True Positives + True Negatives) / Total Samples

2. Precision:
Precision measures the proportion of true positive predictions out of all positive predictions made by the model. It indicates how many of the positive predictions were correct. Precision is valuable when the cost of false positives is high. It is calculated as:
Precision = True Positives / (True Positives + False Positives)

3. Recall (Sensitivity or True Positive Rate):
Recall measures the proportion of true positive predictions out of all actual positive instances in the dataset. It indicates how well the model identifies positive instances. Recall is essential when the cost of false negatives is high. It is calculated as:
Recall = True Positives / (True Positives + False Negatives)

4. F1 Score:
The F1 Score is the harmonic mean of precision and recall. It balances the trade-off between precision and recall. A high F1 Score indicates both high precision and high recall. It is calculated as:
F1 Score = 2 * (Precision * Recall) / (Precision + Recall)

5. ROC Curve (Receiver Operating Characteristic):
The ROC curve is a graphical representation of the model’s performance across various classification thresholds. It plots the True Positive Rate (Recall) against the False Positive Rate (1 — Specificity) at different threshold values. The area under the ROC curve (AUC-ROC) is a measure of the model’s ability to distinguish between positive and negative classes.

6. Confusion Matrix:
The confusion matrix is a tabular representation that summarizes the model’s predictions against the true labels. It shows the number of true positives, true negatives, false positives, and false negatives. From the confusion matrix, other metrics like accuracy, precision, and recall can be calculated.

7. Log Loss (Cross-Entropy Loss):
Log Loss measures the error between predicted probabilities and actual class labels. It is used to evaluate the probabilities output by the Logistic Regression model. A lower log loss indicates better model performance.

Choosing the appropriate evaluation metrics depends on the specific problem and the relative importance of false positives and false negatives. For instance, in medical diagnoses, recall might be more critical to identify as many true positive cases as possible, even if it increases false positives. On the other hand, in spam detection, precision might be more important to minimize false positives and avoid flagging legitimate emails as spam.

By understanding these evaluation metrics, data scientists can effectively assess the performance of their Logistic Regression models and make informed decisions to improve their predictions.

Conclusion

Logistic Regression is a powerful classification algorithm that can efficiently handle binary classification problems. By using the Sigmoid Function, the Hypothesis Function transforms the linear combination of input features into probabilities. The model is trained to minimize the Log Loss using Gradient Descent, leading to the best possible decision boundary. Logistic Regression serves as a fundamental building block in machine learning and finds application in various domains, including healthcare, finance, and natural language processing.

Data Science Substack

Discussion about this post