Client Report - Can You Predict That? – Chaz Clark

Show the code

import pandas as pd
import numpy as np
from lets_plot import *
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, RandomForestRegressor
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score,
    confusion_matrix,
    ConfusionMatrixDisplay,
    mean_absolute_error,
    mean_squared_error,
    r2_score,
)
import matplotlib.pyplot as plt
from IPython.display import display, Markdown

LetsPlot.setup_html(isolated_frame=True)

np.random.seed(42)

Show the code

# Data URLs
main_url = "https://github.com/byuidatascience/data4dwellings/raw/master/data-raw/dwellings_ml/dwellings_ml.csv"
nbhd_url = "https://github.com/byuidatascience/data4dwellings/raw/master/data-raw/dwellings_neighborhoods_ml/dwellings_neighborhoods_ml.csv"

# Load main modeling data
df = pd.read_csv(main_url)

# Basic labels for readable charts/tables
df["era"] = np.where(df["before1980"] == 1, "Before 1980", "1980 or Later")

# Quick profile
profile = pd.DataFrame(
    {
        "rows": [len(df)],
        "columns": [df.shape[1]],
        "pre_1980_rate": [df["before1980"].mean()],
    }
)
profile

	rows	columns	pre_1980_rate
0	22913	52	0.624929

Elevator pitch

A leakage-safe Random Forest model (excluding yrbuilt, which is unavailable in the missing-year use case) classified pre-1980 homes at 92.84% accuracy with 0.9788 ROC-AUC on holdout data. The strongest predictors were structural and form-factor signals (livearea, stories, arcstyle_ONE-STORY, numbaths), which aligns with historical construction patterns in Denver. Adding one-hot neighborhood indicators in the stretch analysis raised the best model to 96.07% accuracy, so I recommend the neighborhood-enhanced Random Forest as the production option.

QUESTION|TASK 1

Create 2-3 charts that evaluate potential relationships between the home variables and before1980. Explain what you learn from the charts that could help a machine learning algorithm.

Homes in the pre-1980 class are visibly smaller and lower-density. Mean livearea is about 1,289.84 sq ft for pre-1980 homes versus 1,878.92 sq ft for newer homes, and pre-1980 homes average fewer bathrooms (1.97 vs 2.98). Style indicators also separate classes clearly: about 56.3% of pre-1980 homes are one-story versus only 6.4% in 1980+ homes.

These relationships are useful for machine learning because they are high-signal, non-random patterns tied to construction era. A classifier can exploit combinations of structural size, layout, and style features to estimate the probability that a home is pre-1980.

For visual clarity, the first and third charts use display-only axis zoom (high-percentile limits) so extreme outliers do not compress the main pattern. The full, untrimmed data is still used for modeling.

Show the code

# Summary stats used in interpretation
summary_stats = (
    df.groupby("before1980")[["livearea", "numbaths", "stories", "tasp", "netprice"]]
    .mean()
    .round(2)
)
summary_stats

	livearea	numbaths	stories	tasp	netprice
before1980
0	1878.92	2.98	1.76	1090878.11	1086428.29
1	1289.84	1.97	1.21	247490.60	244883.83

Show the code

# Display-only caps for visual readability (do not affect modeling data)
livearea_p99 = float(df["livearea"].quantile(0.99))

# Chart 1: Live area by class
p1 = (
    ggplot(df, aes(x="era", y="livearea", fill="era"))
    + geom_boxplot(alpha=0.75, outlier_alpha=0.08)
    + coord_cartesian(ylim=[0, livearea_p99])
    + labs(
        title="Livable Area by Class (99th Percentile Zoom)",
        subtitle=f"Display capped at {livearea_p99:,.0f} sq ft to reduce outlier distortion",
        x="Class",
        y="Live Area (sq ft)",
    )
    + theme_minimal()
    + theme(legend_position="none")
    + ggsize(900, 480)
)

# Chart 2: Tax assessed selling price by class (log scale)
p2 = (
    ggplot(df, aes(x="era", y="tasp", fill="era"))
    + geom_boxplot(alpha=0.75, outlier_alpha=0.05)
    + scale_y_log10()
    + labs(
        title="Tax Assessed Price by Class (log scale)",
        x="Class",
        y="Tax Assessed Selling Price (log10)",
    )
    + theme_minimal()
    + theme(legend_position="none")
    + ggsize(900, 480)
)

# Chart 3: Relationship between size and bathrooms
sample_df = df.sample(5000, random_state=42)
livearea_sample_p995 = float(sample_df["livearea"].quantile(0.995))
p3 = (
    ggplot(sample_df, aes(x="livearea", y="numbaths", color="era"))
    + geom_point(alpha=0.35, size=1.7)
    + coord_cartesian(xlim=[0, livearea_sample_p995], ylim=[0, 9.5])
    + labs(
        title="Live Area vs Bathrooms (sample, 99.5th Percentile X Zoom)",
        subtitle=f"X-axis capped at {livearea_sample_p995:,.0f} sq ft for readability",
        x="Live Area (sq ft)",
        y="Number of Bathrooms",
        color="Class",
    )
    + theme_minimal()
    + ggsize(900, 500)
)

p1

Show the code

p2

Show the code

p3

Show the code

# Additional style/garage prevalence summary
style_summary = (
    df.groupby("era")[["arcstyle_ONE-STORY", "arcstyle_TWO-STORY", "gartype_Att"]]
    .mean()
    .mul(100)
    .round(1)
)
style_summary

	arcstyle_ONE-STORY	arcstyle_TWO-STORY	gartype_Att
era
1980 or Later	6.4	40.5	60.3
Before 1980	56.3	9.1	23.5

QUESTION|TASK 2

Build a classification model labeling houses as being built “before 1980” or “during or after 1980”. Your goal is to reach or exceed 90% accuracy. Explain your final model choice (algorithm, tuning parameters, etc) and describe what other models you tried.

I tested three algorithms using the same 80/20 stratified split:

Logistic Regression (scaled features)
Gradient Boosting Classifier
Random Forest Classifier

Then I tuned Random Forest with GridSearchCV and selected the best cross-validated configuration. The tuned Random Forest reached 92.84% test accuracy, exceeding the 90% target.

I selected Random Forest as the final model because it outperformed the alternatives in holdout accuracy and ROC-AUC while handling nonlinear interactions well.

Show the code

# Avoid target leakage:
# - drop yrbuilt because the target before1980 is derived from year built.
# - drop parcel because it is an identifier, not a generalizable feature.
X = df.drop(columns=["before1980", "yrbuilt", "parcel", "era"], errors="ignore")
y = df["before1980"]

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.20,
    random_state=42,
    stratify=y,
)

baseline_models = {
    "Logistic Regression": Pipeline(
        [
            ("scaler", StandardScaler()),
            ("clf", LogisticRegression(max_iter=5000, solver="liblinear")),
        ]
    ),
    "Gradient Boosting": GradientBoostingClassifier(random_state=42),
    "Random Forest (baseline)": RandomForestClassifier(
        random_state=42,
        n_estimators=200,
        n_jobs=-1,
    ),
}

baseline_results = []
for name, model in baseline_models.items():
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    probs = model.predict_proba(X_test)[:, 1]

    baseline_results.append(
        {
            "Model": name,
            "Accuracy": accuracy_score(y_test, preds),
            "Precision": precision_score(y_test, preds),
            "Recall": recall_score(y_test, preds),
            "F1": f1_score(y_test, preds),
            "ROC_AUC": roc_auc_score(y_test, probs),
        }
    )

baseline_results_df = pd.DataFrame(baseline_results).sort_values("Accuracy", ascending=False)
baseline_results_df

	Model	Accuracy	Precision	Recall	F1	ROC_AUC
2	Random Forest (baseline)	0.927995	0.941771	0.943087	0.942428	0.978665
1	Gradient Boosting	0.898538	0.908410	0.931564	0.919841	0.961693
0	Logistic Regression	0.872791	0.888586	0.910615	0.899465	0.937098

Show the code

# Tune Random Forest
rf_param_grid = {
    "n_estimators": [200, 350],
    "max_depth": [None, 16],
    "min_samples_split": [2, 5],
    "max_features": ["sqrt"],
}

rf_grid = GridSearchCV(
    estimator=RandomForestClassifier(random_state=42, n_jobs=-1),
    param_grid=rf_param_grid,
    cv=4,
    scoring="accuracy",
    n_jobs=-1,
)

rf_grid.fit(X_train, y_train)
best_rf = rf_grid.best_estimator_

rf_preds = best_rf.predict(X_test)
rf_probs = best_rf.predict_proba(X_test)[:, 1]

rf_metrics = pd.DataFrame(
    {
        "Metric": ["Accuracy", "Precision", "Recall", "F1", "ROC_AUC"],
        "Value": [
            accuracy_score(y_test, rf_preds),
            precision_score(y_test, rf_preds),
            recall_score(y_test, rf_preds),
            f1_score(y_test, rf_preds),
            roc_auc_score(y_test, rf_probs),
        ],
    }
)

print("Best RF parameters:", rf_grid.best_params_)
print("Best cross-val accuracy:", round(rf_grid.best_score_, 4))
rf_metrics

Best RF parameters: {'max_depth': None, 'max_features': 'sqrt', 'min_samples_split': 2, 'n_estimators': 350}
Best cross-val accuracy: 0.9233

	Metric	Value
0	Accuracy	0.928431
1	Precision	0.943047
2	Recall	0.942388
3	F1	0.942717
4	ROC_AUC	0.978810

QUESTION|TASK 3

Justify your classification model by discussing the most important features selected by your model. This discussion should include a feature importance chart and a description of the features.

The final Random Forest ranked livearea, arcstyle_ONE-STORY, stories, numbaths, and price-assessment features (tasp, netprice, sprice) among the most important predictors. This is consistent with the domain context: home form factor and footprint differ meaningfully between older and newer construction periods.

Practically, this means the model is not relying on a single feature. It is using a pattern of size, layout, and valuation signals, which generally improves robustness compared to one-variable rules.

Show the code

rf_importance = (
    pd.Series(best_rf.feature_importances_, index=X.columns)
    .sort_values(ascending=False)
    .reset_index()
)
rf_importance.columns = ["feature", "importance"]

top_rf_importance = rf_importance.head(15)
top_rf_importance

	feature	importance
0	livearea	0.084573
1	arcstyle_ONE-STORY	0.077952
2	stories	0.077855
3	numbaths	0.061559
4	tasp	0.059662
5	netprice	0.057978
6	sprice	0.056963
7	gartype_Att	0.056820
8	basement	0.042084
9	quality_C	0.040766
10	abstrprd	0.036896
11	quality_B	0.027542
12	nocars	0.026919
13	arcstyle_TWO-STORY	0.026228
14	smonth	0.025564

Show the code

importance_plot = (
    ggplot(top_rf_importance, aes(x="feature", y="importance"))
    + geom_bar(stat="identity", fill="#2f5d62")
    + coord_flip()
    + labs(
        title="Top 15 Feature Importances - Final Random Forest",
        x="Feature",
        y="Importance",
    )
    + theme_minimal()
    + ggsize(900, 500)
)

importance_plot

QUESTION|TASK 4

Describe the quality of your classification model using 2-3 different evaluation metrics. You also need to explain how to interpret each of the evaluation metrics you use.

I used Accuracy, Precision, Recall, and ROC-AUC.

Accuracy (0.9284): 92.84% of all homes in the test set were classified correctly.
Precision (0.9430): when the model predicts “before 1980,” it is correct 94.30% of the time.
Recall (0.9424): the model correctly finds 94.24% of truly pre-1980 homes.
ROC-AUC (0.9788): across thresholds, the model has excellent class-separation ability (close to 1.0 is best).

The confusion matrix shows balanced performance in both classes, not just one-sided accuracy.

Show the code

final_eval = pd.DataFrame(
    {
        "Metric": ["Accuracy", "Precision", "Recall", "F1", "ROC_AUC"],
        "Value": [
            accuracy_score(y_test, rf_preds),
            precision_score(y_test, rf_preds),
            recall_score(y_test, rf_preds),
            f1_score(y_test, rf_preds),
            roc_auc_score(y_test, rf_probs),
        ],
    }
)
final_eval

	Metric	Value
0	Accuracy	0.928431
1	Precision	0.943047
2	Recall	0.942388
3	F1	0.942717
4	ROC_AUC	0.978810

Show the code

# Confusion matrix for final model
cm = confusion_matrix(y_test, rf_preds)
cm_df = pd.DataFrame(
    cm,
    index=["Actual: 1980 or later", "Actual: Before 1980"],
    columns=["Pred: 1980 or later", "Pred: Before 1980"],
)
cm_df

	Pred: 1980 or later	Pred: Before 1980
Actual: 1980 or later	1556	163
Actual: Before 1980	165	2699

Show the code

fig, ax = plt.subplots(figsize=(6, 4.5))
ConfusionMatrixDisplay(
    confusion_matrix=cm,
    display_labels=["1980 or Later", "Before 1980"],
).plot(ax=ax, colorbar=False)
ax.set_title("Final Random Forest - Confusion Matrix")
plt.show()

STRETCH QUESTION|TASK 1

Repeat the classification model using 3 different algorithms. Display their Feature Importance, and Decision Matrix. Explain the differences between the models and which one you would recommend to the Client.

Across the three algorithms on the same leakage-safe feature set, Random Forest performed best overall (accuracy and ROC-AUC), followed by Gradient Boosting, then Logistic Regression. Logistic Regression produced a simpler linear decision boundary and lower accuracy, while tree-based models captured nonlinear interactions better.

Recommendation for the client on this feature set: Random Forest.

Show the code

def model_importance(model, feature_names, model_name):
    """Extract a comparable importance score for each model type."""
    if model_name == "Logistic Regression":
        coef = np.abs(model.named_steps["clf"].coef_[0])
        return pd.Series(coef, index=feature_names).sort_values(ascending=False)
    return pd.Series(model.feature_importances_, index=feature_names).sort_values(ascending=False)

stretch_models = {
    "Logistic Regression": Pipeline(
        [
            ("scaler", StandardScaler()),
            ("clf", LogisticRegression(max_iter=5000, solver="liblinear")),
        ]
    ),
    "Gradient Boosting": GradientBoostingClassifier(random_state=42),
    "Random Forest": RandomForestClassifier(
        random_state=42,
        n_estimators=350,
        max_depth=None,
        max_features="sqrt",
        min_samples_split=2,
        n_jobs=-1,
    ),
}

stretch_results = []
stretch_artifacts = {}

for name, model in stretch_models.items():
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    probs = model.predict_proba(X_test)[:, 1]

    stretch_results.append(
        {
            "Model": name,
            "Accuracy": accuracy_score(y_test, preds),
            "Precision": precision_score(y_test, preds),
            "Recall": recall_score(y_test, preds),
            "F1": f1_score(y_test, preds),
            "ROC_AUC": roc_auc_score(y_test, probs),
        }
    )

    stretch_artifacts[name] = {
        "preds": preds,
        "importance": model_importance(model, X.columns, name),
    }

stretch_results_df = pd.DataFrame(stretch_results).sort_values("Accuracy", ascending=False)
stretch_results_df

	Model	Accuracy	Precision	Recall	F1	ROC_AUC
2	Random Forest	0.928431	0.943047	0.942388	0.942717	0.978810
1	Gradient Boosting	0.898538	0.908410	0.931564	0.919841	0.961693
0	Logistic Regression	0.872791	0.888586	0.910615	0.899465	0.937098

Show the code

# Feature importance + confusion matrix for each algorithm
short_name = {
    "Logistic Regression": "Logistic Reg",
    "Gradient Boosting": "Gradient Boosting",
    "Random Forest": "Random Forest",
}

for model_name in stretch_results_df["Model"]:
    display(Markdown(f"### {model_name}"))

    top_imp = (
        stretch_artifacts[model_name]["importance"]
        .head(10)
        .reset_index()
    )
    top_imp.columns = ["feature", "importance"]

    p = (
        ggplot(top_imp, aes(x="feature", y="importance"))
        + geom_bar(stat="identity", fill="#1f7a8c")
        + coord_flip()
        + labs(
            title=f"Top 10 Importance - {short_name.get(model_name, model_name)}",
            x="Feature",
            y="Importance",
        )
        + theme_minimal()
        + ggsize(900, 500)
    )
    display(p)

    fig, ax = plt.subplots(figsize=(5.7, 4.3))
    ConfusionMatrixDisplay.from_predictions(
        y_test,
        stretch_artifacts[model_name]["preds"],
        display_labels=["1980 or Later", "Before 1980"],
        colorbar=False,
        ax=ax,
    )
    ax.set_title(f"Decision Matrix - {model_name}")
    plt.show()

Random Forest

Gradient Boosting

Logistic Regression

STRETCH QUESTION|TASK 2

Join the dwellings_neighborhoods_ml.csv data to the dwelling_ml.csv on the parcel column to create a new dataset. Duplicate the code for the stretch question above and update it to use this data. Explain the differences and if this changes the model you recommend to the Client.

After adding neighborhood one-hot features, model performance improved substantially for tree-based methods. The best neighborhood-enhanced Random Forest achieved about 96.07% accuracy and 0.9910 ROC-AUC, which is stronger than the non-neighborhood model.

Important join detail: both source tables have repeated parcel values. To avoid row multiplication in the merge, I first aggregated neighborhood rows to one row per parcel using max() across the one-hot neighborhood columns.

This update strengthens my recommendation: use Random Forest with neighborhood features included when those fields are available.

Show the code

# Load neighborhood data and deduplicate by parcel before joining
nbhd_raw = pd.read_csv(nbhd_url)
nbhd_cols = [c for c in nbhd_raw.columns if c != "parcel"]
nbhd_unique = nbhd_raw.groupby("parcel", as_index=False)[nbhd_cols].max()

joined = df.merge(nbhd_unique, on="parcel", how="left")

X_nb = joined.drop(columns=["before1980", "yrbuilt", "parcel", "era"], errors="ignore").fillna(0)
y_nb = joined["before1980"]

X_nb_train, X_nb_test, y_nb_train, y_nb_test = train_test_split(
    X_nb,
    y_nb,
    test_size=0.20,
    random_state=42,
    stratify=y_nb,
)

nb_models = {
    "Logistic Regression": Pipeline(
        [
            ("scaler", StandardScaler()),
            ("clf", LogisticRegression(max_iter=5000, solver="liblinear")),
        ]
    ),
    "Gradient Boosting": GradientBoostingClassifier(random_state=42),
    "Random Forest": RandomForestClassifier(
        random_state=42,
        n_estimators=350,
        max_depth=None,
        max_features="sqrt",
        min_samples_split=2,
        n_jobs=-1,
    ),
}

nb_results = []
nb_artifacts = {}

for name, model in nb_models.items():
    model.fit(X_nb_train, y_nb_train)
    preds = model.predict(X_nb_test)
    probs = model.predict_proba(X_nb_test)[:, 1]

    nb_results.append(
        {
            "Model": name,
            "Accuracy": accuracy_score(y_nb_test, preds),
            "Precision": precision_score(y_nb_test, preds),
            "Recall": recall_score(y_nb_test, preds),
            "F1": f1_score(y_nb_test, preds),
            "ROC_AUC": roc_auc_score(y_nb_test, probs),
        }
    )

    if name == "Logistic Regression":
        imp = pd.Series(
            np.abs(model.named_steps["clf"].coef_[0]),
            index=X_nb.columns,
        ).sort_values(ascending=False)
    else:
        imp = pd.Series(model.feature_importances_, index=X_nb.columns).sort_values(ascending=False)

    nb_artifacts[name] = {"preds": preds, "importance": imp}

nb_results_df = pd.DataFrame(nb_results).sort_values("Accuracy", ascending=False)
nb_results_df

	Model	Accuracy	Precision	Recall	F1	ROC_AUC
2	Random Forest	0.960724	0.971871	0.965084	0.968465	0.991048
0	Logistic Regression	0.945451	0.957313	0.955307	0.956309	0.985708
1	Gradient Boosting	0.927777	0.941751	0.942737	0.942244	0.977296

Show the code

# Feature importance + confusion matrix for each neighborhood model
nb_short_name = {
    "Logistic Regression": "Logistic Reg",
    "Gradient Boosting": "Gradient Boosting",
    "Random Forest": "Random Forest",
}

for model_name in nb_results_df["Model"]:
    display(Markdown(f"### Neighborhood Model: {model_name}"))

    top_imp = nb_artifacts[model_name]["importance"].head(10).reset_index()
    top_imp.columns = ["feature", "importance"]

    p = (
        ggplot(top_imp, aes(x="feature", y="importance"))
        + geom_bar(stat="identity", fill="#3a7d44")
        + coord_flip()
        + labs(
            title=f"Top 10 Importance - Nbhd {nb_short_name.get(model_name, model_name)}",
            x="Feature",
            y="Importance",
        )
        + theme_minimal()
        + ggsize(900, 500)
    )
    display(p)

    fig, ax = plt.subplots(figsize=(5.7, 4.3))
    ConfusionMatrixDisplay.from_predictions(
        y_nb_test,
        nb_artifacts[model_name]["preds"],
        display_labels=["1980 or Later", "Before 1980"],
        colorbar=False,
        ax=ax,
    )
    ax.set_title(f"Decision Matrix - Neighborhood {model_name}")
    plt.show()

Neighborhood Model: Random Forest

Neighborhood Model: Logistic Regression

Neighborhood Model: Gradient Boosting

STRETCH QUESTION|TASK 3

Can you build a model that predicts the year a house was built? Explain the model and the evaluation metrics you would use to determine if the model is good.

Yes. I built a Random Forest Regressor to predict yrbuilt, again excluding leakage features (before1980 and parcel).

MAE (Mean Absolute Error): average absolute miss in years.
RMSE (Root Mean Squared Error): penalizes larger misses more heavily than MAE.
R-squared: percent of variance in year built explained by the model.

Results show moderate quality for year-level prediction: MAE ~10.12 years, RMSE ~16.60 years, R2 ~0.797. This is useful for rough historical estimation, but less reliable for precise year assignment.

Show the code

X_reg = df.drop(columns=["yrbuilt", "before1980", "parcel", "era"], errors="ignore")
y_reg = df["yrbuilt"]

Xr_train, Xr_test, yr_train, yr_test = train_test_split(
    X_reg,
    y_reg,
    test_size=0.20,
    random_state=42,
)

rf_reg = RandomForestRegressor(
    random_state=42,
    n_estimators=350,
    n_jobs=-1,
)

rf_reg.fit(Xr_train, yr_train)
yr_pred = rf_reg.predict(Xr_test)

mae = mean_absolute_error(yr_test, yr_pred)
rmse = np.sqrt(mean_squared_error(yr_test, yr_pred))
r2 = r2_score(yr_test, yr_pred)
within_5 = np.mean(np.abs(yr_pred - yr_test) <= 5)
within_10 = np.mean(np.abs(yr_pred - yr_test) <= 10)

reg_metrics = pd.DataFrame(
    {
        "Metric": ["MAE", "RMSE", "R2", "Pct within 5 years", "Pct within 10 years"],
        "Value": [mae, rmse, r2, within_5, within_10],
    }
)
reg_metrics

	Metric	Value
0	MAE	10.122403
1	RMSE	16.604617
2	R2	0.797436
3	Pct within 5 years	0.474362
4	Pct within 10 years	0.668340

Show the code

# Predicted vs actual year built
pred_df = pd.DataFrame({"actual": yr_test, "predicted": yr_pred})

year_plot = (
    ggplot(pred_df.sample(3000, random_state=42), aes(x="actual", y="predicted"))
    + geom_point(alpha=0.35, color="#2f6690")
    + geom_abline(slope=1, intercept=0, linetype="dashed", color="#c1121f")
    + labs(
        title="Predicted vs Actual Year Built (sample of 3,000)",
        x="Actual Year Built",
        y="Predicted Year Built",
    )
    + theme_minimal()
    + ggsize(900, 500)
)

year_plot