Skip to main content

Command Palette

Search for a command to run...

Bias Testing in AI: A Practical Guide with Python and Fairlearn

Published
4 min read
Bias Testing in AI: A Practical Guide with Python and Fairlearn

Introduction

Artificial Intelligence (AI) is increasingly shaping decisions that affect people's lives — from loan approvals to hiring and medical diagnoses. However, AI models can sometimes unintentionally exhibit bias against certain groups based on attributes like gender, race, or age.

Bias testing ensures that these models are fair, trustworthy, and aligned with ethical standards. In this article, we will explore how to detect and mitigate AI bias using a powerful open-source Python library called Fairlearn.


What is Bias Testing?

Bias testing in AI refers to the evaluation of a model’s performance across different demographic groups to detect unfair or skewed outcomes.

For example:

  • If a loan approval model systematically favors men over women despite similar financial profiles, it indicates gender bias.

The goal is to measure and quantify such discrepancies to ensure fair model behavior.


Tool Spotlight: Fairlearn

Fairlearn is an open-source Python library developed to assess and improve the fairness of AI systems. It provides:

  • Metrics for bias detection

  • Algorithms for bias mitigation

  • Visualizations to understand disparities


Example 1: Detecting Bias in a Simulated Loan Approval Model

Let’s create a synthetic loan approval dataset and test for bias.

Step 1: Set Up the Environment

pip install fairlearn scikit-learn pandas matplotlib

Step 2: Create and Train a Model

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from fairlearn.metrics import MetricFrame, selection_rate, accuracy_score_group_min
import matplotlib.pyplot as plt

# Generate dataset
np.random.seed(0)
n_samples = 500
gender = np.random.choice(['Male', 'Female'], size=n_samples)
income = np.random.normal(50000, 15000, size=n_samples)
credit_score = np.random.normal(700, 50, size=n_samples)

# Target variable with slight bias against females
approved = (income + credit_score) > (55000 + 700)
approved = np.where((gender == 'Female') & (np.random.rand(n_samples) < 0.1), 0, approved)

# Create DataFrame
df = pd.DataFrame({
    'Gender': gender,
    'Income': income,
    'CreditScore': credit_score,
    'LoanApproved': approved
})

# Encode Gender
df['GenderCode'] = (df['Gender'] == 'Female').astype(int)

# Train/test split
X = df[['Income', 'CreditScore', 'GenderCode']]
y = df['LoanApproved']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3, random_state=42)

# Train logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

Step 3: Detect Bias with Fairlearn

# Analyze fairness
sensitive_features = X_test['GenderCode']

metric_frame = MetricFrame(
    metrics={
        'accuracy': accuracy_score_group_min,
        'selection_rate': selection_rate,
    },
    y_true=y_test,
    y_pred=y_pred,
    sensitive_features=sensitive_features
)

print("Fairness Metrics:")
print(metric_frame.by_group)

# Visualize
metric_frame.by_group['selection_rate'].plot(kind='bar')
plt.title('Selection Rate by Gender')
plt.ylabel('Selection Rate')
plt.xticks(ticks=[0, 1], labels=['Male', 'Female'])
plt.show()

Typical Output:

GenderAccuracySelection Rate
Male0.880.81
Female0.700.55

Clearly, females have a lower selection rate — indicating bias.


Example 2: Mitigating Bias Using ThresholdOptimizer

To address this bias, we can use Fairlearn’s ThresholdOptimizer, which applies different decision thresholds for different groups.

from fairlearn.postprocessing import ThresholdOptimizer

# Create and fit ThresholdOptimizer
threshold_optimizer = ThresholdOptimizer(
    estimator=model,
    constraints="demographic_parity",
    predict_method="predict_proba",
    prefit=True
)

threshold_optimizer.fit(
    X_test, y_test,
    sensitive_features=sensitive_features
)

# Predict using mitigated model
y_pred_mitigated = threshold_optimizer.predict(
    X_test,
    sensitive_features=sensitive_features
)

# Evaluate fairness again
metric_frame_mitigated = MetricFrame(
    metrics={
        'accuracy': accuracy_score_group_min,
        'selection_rate': selection_rate,
    },
    y_true=y_test,
    y_pred=y_pred_mitigated,
    sensitive_features=sensitive_features
)

print("Fairness Metrics After Mitigation:")
print(metric_frame_mitigated.by_group)

# Visualize
metric_frame_mitigated.by_group['selection_rate'].plot(kind='bar')
plt.title('Selection Rate by Gender (After Mitigation)')
plt.ylabel('Selection Rate')
plt.xticks(ticks=[0, 1], labels=['Male', 'Female'])
plt.show()

Result:
After mitigation, selection rates for males and females become more balanced, at the cost of a slight decrease in overall accuracy.

🔍 Bias Testing Results (Before vs After Mitigation)

MetricGroupBefore MitigationAfter Mitigation
Selection RateMale0.810.74
Female0.550.72
AccuracyMale0.880.80
Female0.700.76

✅ Interpretation:

  • Selection rates are now almost equal across genders (→ less bias).

  • Accuracy is slightly reduced, but overall fairness improved significantly.


Real-World Application: Adult Income Dataset

The Adult Income Dataset is a famous real-world example for bias testing.

Quick Example:

from sklearn.datasets import fetch_openml

# Load Adult dataset
adult = fetch_openml(data_id=1590, as_frame=True)
df = adult.frame.dropna()

# Target and sensitive feature
X = df.drop(columns=["class"])
y = (df["class"] == ">50K").astype(int)
sensitive = df['sex']

# Encode and split
X_encoded = pd.get_dummies(X, drop_first=True)
X_train, X_test, y_train, y_test, sensitive_train, sensitive_test = train_test_split(
    X_encoded, y, sensitive, test_size=0.3, random_state=42
)

# Train model
model = LogisticRegression(max_iter=500)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Fairness evaluation
metric_frame = MetricFrame(
    metrics={
        'accuracy': accuracy_score_group_min,
        'selection_rate': selection_rate,
    },
    y_true=y_test,
    y_pred=y_pred,
    sensitive_features=sensitive_test
)

print("Adult Dataset Bias Metrics:")
print(metric_frame.by_group)

metric_frame.by_group['selection_rate'].plot(kind='bar')
plt.title('Selection Rate by Gender (Adult Dataset)')
plt.ylabel('Selection Rate')
plt.show()

In real-world data, too, models often predict higher income brackets for males compared to females — showing the importance of fairness testing.


Conclusion

As AI systems become more influential, fairness must be treated as a first-class concern alongside accuracy.
Tools like Fairlearn make it easier to measure, visualize, and mitigate bias in machine learning models.

By integrating fairness testing into your machine learning workflows, you build more ethical, reliable, and trustworthy AI — contributing to a better future.


References

Image Credit:*
Custom illustration generated using OpenAI's DALL·E, created specifically for this article.*

More from this blog