Bias Testing in AI: A Practical Guide with Python and Fairlearn

Introduction
Artificial Intelligence (AI) is increasingly shaping decisions that affect people's lives — from loan approvals to hiring and medical diagnoses. However, AI models can sometimes unintentionally exhibit bias against certain groups based on attributes like gender, race, or age.
Bias testing ensures that these models are fair, trustworthy, and aligned with ethical standards. In this article, we will explore how to detect and mitigate AI bias using a powerful open-source Python library called Fairlearn.
What is Bias Testing?
Bias testing in AI refers to the evaluation of a model’s performance across different demographic groups to detect unfair or skewed outcomes.
For example:
- If a loan approval model systematically favors men over women despite similar financial profiles, it indicates gender bias.
The goal is to measure and quantify such discrepancies to ensure fair model behavior.
Tool Spotlight: Fairlearn
Fairlearn is an open-source Python library developed to assess and improve the fairness of AI systems. It provides:
Metrics for bias detection
Algorithms for bias mitigation
Visualizations to understand disparities
Example 1: Detecting Bias in a Simulated Loan Approval Model
Let’s create a synthetic loan approval dataset and test for bias.
Step 1: Set Up the Environment
pip install fairlearn scikit-learn pandas matplotlib
Step 2: Create and Train a Model
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from fairlearn.metrics import MetricFrame, selection_rate, accuracy_score_group_min
import matplotlib.pyplot as plt
# Generate dataset
np.random.seed(0)
n_samples = 500
gender = np.random.choice(['Male', 'Female'], size=n_samples)
income = np.random.normal(50000, 15000, size=n_samples)
credit_score = np.random.normal(700, 50, size=n_samples)
# Target variable with slight bias against females
approved = (income + credit_score) > (55000 + 700)
approved = np.where((gender == 'Female') & (np.random.rand(n_samples) < 0.1), 0, approved)
# Create DataFrame
df = pd.DataFrame({
'Gender': gender,
'Income': income,
'CreditScore': credit_score,
'LoanApproved': approved
})
# Encode Gender
df['GenderCode'] = (df['Gender'] == 'Female').astype(int)
# Train/test split
X = df[['Income', 'CreditScore', 'GenderCode']]
y = df['LoanApproved']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3, random_state=42)
# Train logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
Step 3: Detect Bias with Fairlearn
# Analyze fairness
sensitive_features = X_test['GenderCode']
metric_frame = MetricFrame(
metrics={
'accuracy': accuracy_score_group_min,
'selection_rate': selection_rate,
},
y_true=y_test,
y_pred=y_pred,
sensitive_features=sensitive_features
)
print("Fairness Metrics:")
print(metric_frame.by_group)
# Visualize
metric_frame.by_group['selection_rate'].plot(kind='bar')
plt.title('Selection Rate by Gender')
plt.ylabel('Selection Rate')
plt.xticks(ticks=[0, 1], labels=['Male', 'Female'])
plt.show()
Typical Output:
| Gender | Accuracy | Selection Rate |
| Male | 0.88 | 0.81 |
| Female | 0.70 | 0.55 |
Clearly, females have a lower selection rate — indicating bias.
Example 2: Mitigating Bias Using ThresholdOptimizer
To address this bias, we can use Fairlearn’s ThresholdOptimizer, which applies different decision thresholds for different groups.
from fairlearn.postprocessing import ThresholdOptimizer
# Create and fit ThresholdOptimizer
threshold_optimizer = ThresholdOptimizer(
estimator=model,
constraints="demographic_parity",
predict_method="predict_proba",
prefit=True
)
threshold_optimizer.fit(
X_test, y_test,
sensitive_features=sensitive_features
)
# Predict using mitigated model
y_pred_mitigated = threshold_optimizer.predict(
X_test,
sensitive_features=sensitive_features
)
# Evaluate fairness again
metric_frame_mitigated = MetricFrame(
metrics={
'accuracy': accuracy_score_group_min,
'selection_rate': selection_rate,
},
y_true=y_test,
y_pred=y_pred_mitigated,
sensitive_features=sensitive_features
)
print("Fairness Metrics After Mitigation:")
print(metric_frame_mitigated.by_group)
# Visualize
metric_frame_mitigated.by_group['selection_rate'].plot(kind='bar')
plt.title('Selection Rate by Gender (After Mitigation)')
plt.ylabel('Selection Rate')
plt.xticks(ticks=[0, 1], labels=['Male', 'Female'])
plt.show()
Result:
After mitigation, selection rates for males and females become more balanced, at the cost of a slight decrease in overall accuracy.
🔍 Bias Testing Results (Before vs After Mitigation)
| Metric | Group | Before Mitigation | After Mitigation |
| Selection Rate | Male | 0.81 | 0.74 |
| Female | 0.55 | 0.72 | |
| Accuracy | Male | 0.88 | 0.80 |
| Female | 0.70 | 0.76 |
✅ Interpretation:
Selection rates are now almost equal across genders (→ less bias).
Accuracy is slightly reduced, but overall fairness improved significantly.
Real-World Application: Adult Income Dataset
The Adult Income Dataset is a famous real-world example for bias testing.
Quick Example:
from sklearn.datasets import fetch_openml
# Load Adult dataset
adult = fetch_openml(data_id=1590, as_frame=True)
df = adult.frame.dropna()
# Target and sensitive feature
X = df.drop(columns=["class"])
y = (df["class"] == ">50K").astype(int)
sensitive = df['sex']
# Encode and split
X_encoded = pd.get_dummies(X, drop_first=True)
X_train, X_test, y_train, y_test, sensitive_train, sensitive_test = train_test_split(
X_encoded, y, sensitive, test_size=0.3, random_state=42
)
# Train model
model = LogisticRegression(max_iter=500)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Fairness evaluation
metric_frame = MetricFrame(
metrics={
'accuracy': accuracy_score_group_min,
'selection_rate': selection_rate,
},
y_true=y_test,
y_pred=y_pred,
sensitive_features=sensitive_test
)
print("Adult Dataset Bias Metrics:")
print(metric_frame.by_group)
metric_frame.by_group['selection_rate'].plot(kind='bar')
plt.title('Selection Rate by Gender (Adult Dataset)')
plt.ylabel('Selection Rate')
plt.show()
In real-world data, too, models often predict higher income brackets for males compared to females — showing the importance of fairness testing.
Conclusion
As AI systems become more influential, fairness must be treated as a first-class concern alongside accuracy.
Tools like Fairlearn make it easier to measure, visualize, and mitigate bias in machine learning models.
By integrating fairness testing into your machine learning workflows, you build more ethical, reliable, and trustworthy AI — contributing to a better future.
References
Image Credit:*
Custom illustration generated using OpenAI's DALL·E, created specifically for this article.*



