FORMATIVE ASSESSMENT 2 [100 MARKS] Read the project scenario below and answer ALL the questions that follow. SCENARIO A healthcare startup is interested in developing a machine learning model to pre

Question

FORMATIVE ASSESSMENT 2 [100 ﻿MARKS] ﻿Read the project scenario below and answer ALL the questions that follow. SCENARIO A healthcare startup is interested in developing a machine learning model to predict patient readmission rates based on various patient data points collected from their hospital visits. The company aims to utilise Python's extensive libraries and machine learning frameworks to analyse the provided historical patient data and build a predictive model that can help hospitals reduce unnecessary readmissions, thus saving costs and improving patient care. Your task is to develop a predictive model using Python to determine the likelihood of patient readmissions within 30 ﻿days of discharge based on the provided dataset. This project will involve data preprocessing, exploratory data analysis, feature selection, model building, and model evaluation. SOURCE: • ﻿The dataset provided includes various patient data points such as age, gender, type of admission, length of stay, number of diagnoses, blood pressure, blood sugar levels, previous admissions, and readmission status. YOUR REPORT SHOULD FOCUS ON THE FOLLOWING: • ﻿Data Preprocessing and Cleanup • ﻿Exploratory Data Ana

VIPSolutions · Accepted Answer

**Introduction**

Predicting hospital readmissions is a critical task in healthcare analytics. Early identification of patients at risk of being readmitted within 30 days allows hospitals to intervene proactively, improving patient outcomes and reducing unnecessary costs. In this report, we use Python’s data science stack to preprocess hospital data, analyze patterns, select features, build a machine learning model, and evaluate its effectiveness in predicting patient readmissions.

---

### **Step 1: Data Preprocessing (20 Marks)**

**Concept Introduction:**  
Data preprocessing is the foundational step in any machine learning pipeline. It involves cleaning the data, handling missing values, normalizing numerical features, and encoding categorical variables to ensure the dataset is suitable for model building.

**Implementation:**

```python
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Load dataset
df = pd.read_csv('patient_data.csv')

# 1. Handling Missing Values
# Check missing values
print(df.isnull().sum())

# Impute numerical columns with median
num_cols = df.select_dtypes(include=np.number).columns
for col in num_cols:
    df[col].fillna(df[col].median(), inplace=True)

# Impute categorical columns with mode
cat_cols = df.select_dtypes(include='object').columns
for col in cat_cols:
    df[col].fillna(df[col].mode()[0], inplace=True)

# 2. Encoding Categorical Variables
le = LabelEncoder()
for col in cat_cols:
    df[col] = le.fit_transform(df[col])

# 3. Normalizing Data
scaler = StandardScaler()
df[num_cols] = scaler.fit_transform(df[num_cols])

print(df.head())
```

**Explanation:**  
- Missing values are imputed: numerical with median, categorical with mode.
- Categorical variables are label-encoded.
- Numerical features are standardized to ensure uniform scale.

---

### **Step 2: Exploratory Data Analysis (20 Marks)**

**Concept Introduction:**  
EDA helps us understand the underlying structure and patterns in the data. We visualize distributions, spot outliers, and analyze relationships between variables and readmission status.

**Implementation:**

```python
import matplotlib.pyplot as plt
import seaborn as sns

# Distribution of readmissions
sns.countplot(x='readmission_status', data=df)
plt.title('Readmission Status Distribution')
plt.show()

# Age distribution
sns.histplot(df['age'], bins=10, kde=True)
plt.title('Age Distribution')
plt.show()

# Correlation matrix
plt.figure(figsize=(10,8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title('Feature Correlation Heatmap')
plt.show()

# Boxplot for length of stay vs readmission
sns.boxplot(x='readmission_status', y='length_of_stay', data=df)
plt.title('Length of Stay by Readmission Status')
plt.show()
```

**Insights:**
- The readmission rate is imbalanced.
- Certain features (e.g., age, length of stay) show differences by readmission status.
- Some variables are correlated, which may influence feature selection.

---

### **Step 3: Feature Selection (20 Marks)**

**Concept Introduction:**  
Feature selection identifies the most relevant variables for predicting the target, improving model performance and interpretability.

**Implementation:**

```python
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel

X = df.drop('readmission_status', axis=1)
y = df['readmission_status']

# Feature Importance with Random Forest
rf = RandomForestClassifier(random_state=42)
rf.fit(X, y)

importances = pd.Series(rf.feature_importances_, index=X.columns)
importances.sort_values(ascending=False).plot(kind='bar')
plt.title('Feature Importances')
plt.show()

# Select top features
selector = SelectFromModel(rf, prefit=True, threshold='median')
X_selected = selector.transform(X)
selected_features = X.columns[selector.get_support()]

print("Selected Features:", selected_features)
```

**Justification:**  
- Features like length of stay, previous admissions, number of diagnoses, and age are most influential based on feature importance.
- Categorical variables like type of admission and gender may also contribute.

---

### **Step 4: Model Building (20 Marks)**

**Concept Introduction:**  
Model building involves training machine learning algorithms on the processed data to learn patterns for predicting readmissions.

**Implementation:**

```python
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

# Split data
X_train, X_test, y_train, y_test = train_test_split(X[selected_features], y, test_size=0.2, random_state=42)

# Try different models
models = {
    'Logistic Regression': LogisticRegression(),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42)
}

for name, model in models.items():
    model.fit(X_train, y_train)

✨ VIPSolutions

Question:

Final Answer:

📢 Instructions:

Study Help, Fast Answers

✨ VIPSolutions

Question:

Final Answer:

📢 Instructions:

Related Answers

Study Help, Fast Answers