# Data URLsmain_url ="https://github.com/byuidatascience/data4dwellings/raw/master/data-raw/dwellings_ml/dwellings_ml.csv"nbhd_url ="https://github.com/byuidatascience/data4dwellings/raw/master/data-raw/dwellings_neighborhoods_ml/dwellings_neighborhoods_ml.csv"# Load main modeling datadf = pd.read_csv(main_url)# Basic labels for readable charts/tablesdf["era"] = np.where(df["before1980"] ==1, "Before 1980", "1980 or Later")# Quick profileprofile = pd.DataFrame( {"rows": [len(df)],"columns": [df.shape[1]],"pre_1980_rate": [df["before1980"].mean()], })profile
rows
columns
pre_1980_rate
0
22913
52
0.624929
Elevator pitch
A leakage-safe Random Forest model (excluding yrbuilt, which is unavailable in the missing-year use case) classified pre-1980 homes at 92.84% accuracy with 0.9788 ROC-AUC on holdout data. The strongest predictors were structural and form-factor signals (livearea, stories, arcstyle_ONE-STORY, numbaths), which aligns with historical construction patterns in Denver. Adding one-hot neighborhood indicators in the stretch analysis raised the best model to 96.07% accuracy, so I recommend the neighborhood-enhanced Random Forest as the production option.
QUESTION|TASK 1
Create 2-3 charts that evaluate potential relationships between the home variables and before1980. Explain what you learn from the charts that could help a machine learning algorithm.
Homes in the pre-1980 class are visibly smaller and lower-density. Mean livearea is about 1,289.84 sq ft for pre-1980 homes versus 1,878.92 sq ft for newer homes, and pre-1980 homes average fewer bathrooms (1.97 vs 2.98). Style indicators also separate classes clearly: about 56.3% of pre-1980 homes are one-story versus only 6.4% in 1980+ homes.
These relationships are useful for machine learning because they are high-signal, non-random patterns tied to construction era. A classifier can exploit combinations of structural size, layout, and style features to estimate the probability that a home is pre-1980.
For visual clarity, the first and third charts use display-only axis zoom (high-percentile limits) so extreme outliers do not compress the main pattern. The full, untrimmed data is still used for modeling.
Show the code
# Summary stats used in interpretationsummary_stats = ( df.groupby("before1980")[["livearea", "numbaths", "stories", "tasp", "netprice"]] .mean() .round(2))summary_stats
livearea
numbaths
stories
tasp
netprice
before1980
0
1878.92
2.98
1.76
1090878.11
1086428.29
1
1289.84
1.97
1.21
247490.60
244883.83
Show the code
# Display-only caps for visual readability (do not affect modeling data)livearea_p99 =float(df["livearea"].quantile(0.99))# Chart 1: Live area by classp1 = ( ggplot(df, aes(x="era", y="livearea", fill="era"))+ geom_boxplot(alpha=0.75, outlier_alpha=0.08)+ coord_cartesian(ylim=[0, livearea_p99])+ labs( title="Livable Area by Class (99th Percentile Zoom)", subtitle=f"Display capped at {livearea_p99:,.0f} sq ft to reduce outlier distortion", x="Class", y="Live Area (sq ft)", )+ theme_minimal()+ theme(legend_position="none")+ ggsize(900, 480))# Chart 2: Tax assessed selling price by class (log scale)p2 = ( ggplot(df, aes(x="era", y="tasp", fill="era"))+ geom_boxplot(alpha=0.75, outlier_alpha=0.05)+ scale_y_log10()+ labs( title="Tax Assessed Price by Class (log scale)", x="Class", y="Tax Assessed Selling Price (log10)", )+ theme_minimal()+ theme(legend_position="none")+ ggsize(900, 480))# Chart 3: Relationship between size and bathroomssample_df = df.sample(5000, random_state=42)livearea_sample_p995 =float(sample_df["livearea"].quantile(0.995))p3 = ( ggplot(sample_df, aes(x="livearea", y="numbaths", color="era"))+ geom_point(alpha=0.35, size=1.7)+ coord_cartesian(xlim=[0, livearea_sample_p995], ylim=[0, 9.5])+ labs( title="Live Area vs Bathrooms (sample, 99.5th Percentile X Zoom)", subtitle=f"X-axis capped at {livearea_sample_p995:,.0f} sq ft for readability", x="Live Area (sq ft)", y="Number of Bathrooms", color="Class", )+ theme_minimal()+ ggsize(900, 500))p1
Build a classification model labeling houses as being built “before 1980” or “during or after 1980”. Your goal is to reach or exceed 90% accuracy. Explain your final model choice (algorithm, tuning parameters, etc) and describe what other models you tried.
I tested three algorithms using the same 80/20 stratified split:
Logistic Regression (scaled features)
Gradient Boosting Classifier
Random Forest Classifier
Then I tuned Random Forest with GridSearchCV and selected the best cross-validated configuration. The tuned Random Forest reached 92.84% test accuracy, exceeding the 90% target.
I selected Random Forest as the final model because it outperformed the alternatives in holdout accuracy and ROC-AUC while handling nonlinear interactions well.
Show the code
# Avoid target leakage:# - drop yrbuilt because the target before1980 is derived from year built.# - drop parcel because it is an identifier, not a generalizable feature.X = df.drop(columns=["before1980", "yrbuilt", "parcel", "era"], errors="ignore")y = df["before1980"]X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.20, random_state=42, stratify=y,)baseline_models = {"Logistic Regression": Pipeline( [ ("scaler", StandardScaler()), ("clf", LogisticRegression(max_iter=5000, solver="liblinear")), ] ),"Gradient Boosting": GradientBoostingClassifier(random_state=42),"Random Forest (baseline)": RandomForestClassifier( random_state=42, n_estimators=200, n_jobs=-1, ),}baseline_results = []for name, model in baseline_models.items(): model.fit(X_train, y_train) preds = model.predict(X_test) probs = model.predict_proba(X_test)[:, 1] baseline_results.append( {"Model": name,"Accuracy": accuracy_score(y_test, preds),"Precision": precision_score(y_test, preds),"Recall": recall_score(y_test, preds),"F1": f1_score(y_test, preds),"ROC_AUC": roc_auc_score(y_test, probs), } )baseline_results_df = pd.DataFrame(baseline_results).sort_values("Accuracy", ascending=False)baseline_results_df
Best RF parameters: {'max_depth': None, 'max_features': 'sqrt', 'min_samples_split': 2, 'n_estimators': 350}
Best cross-val accuracy: 0.9233
Metric
Value
0
Accuracy
0.928431
1
Precision
0.943047
2
Recall
0.942388
3
F1
0.942717
4
ROC_AUC
0.978810
QUESTION|TASK 3
Justify your classification model by discussing the most important features selected by your model. This discussion should include a feature importance chart and a description of the features.
The final Random Forest ranked livearea, arcstyle_ONE-STORY, stories, numbaths, and price-assessment features (tasp, netprice, sprice) among the most important predictors. This is consistent with the domain context: home form factor and footprint differ meaningfully between older and newer construction periods.
Practically, this means the model is not relying on a single feature. It is using a pattern of size, layout, and valuation signals, which generally improves robustness compared to one-variable rules.
Describe the quality of your classification model using 2-3 different evaluation metrics. You also need to explain how to interpret each of the evaluation metrics you use.
I used Accuracy, Precision, Recall, and ROC-AUC.
Accuracy (0.9284): 92.84% of all homes in the test set were classified correctly.
Precision (0.9430): when the model predicts “before 1980,” it is correct 94.30% of the time.
Recall (0.9424): the model correctly finds 94.24% of truly pre-1980 homes.
ROC-AUC (0.9788): across thresholds, the model has excellent class-separation ability (close to 1.0 is best).
The confusion matrix shows balanced performance in both classes, not just one-sided accuracy.
# Confusion matrix for final modelcm = confusion_matrix(y_test, rf_preds)cm_df = pd.DataFrame( cm, index=["Actual: 1980 or later", "Actual: Before 1980"], columns=["Pred: 1980 or later", "Pred: Before 1980"],)cm_df
Pred: 1980 or later
Pred: Before 1980
Actual: 1980 or later
1556
163
Actual: Before 1980
165
2699
Show the code
fig, ax = plt.subplots(figsize=(6, 4.5))ConfusionMatrixDisplay( confusion_matrix=cm, display_labels=["1980 or Later", "Before 1980"],).plot(ax=ax, colorbar=False)ax.set_title("Final Random Forest - Confusion Matrix")plt.show()
STRETCH QUESTION|TASK 1
Repeat the classification model using 3 different algorithms. Display their Feature Importance, and Decision Matrix. Explain the differences between the models and which one you would recommend to the Client.
Across the three algorithms on the same leakage-safe feature set, Random Forest performed best overall (accuracy and ROC-AUC), followed by Gradient Boosting, then Logistic Regression. Logistic Regression produced a simpler linear decision boundary and lower accuracy, while tree-based models captured nonlinear interactions better.
Recommendation for the client on this feature set: Random Forest.
Join the dwellings_neighborhoods_ml.csv data to the dwelling_ml.csv on the parcel column to create a new dataset. Duplicate the code for the stretch question above and update it to use this data. Explain the differences and if this changes the model you recommend to the Client.
After adding neighborhood one-hot features, model performance improved substantially for tree-based methods. The best neighborhood-enhanced Random Forest achieved about 96.07% accuracy and 0.9910 ROC-AUC, which is stronger than the non-neighborhood model.
Important join detail: both source tables have repeated parcel values. To avoid row multiplication in the merge, I first aggregated neighborhood rows to one row per parcel using max() across the one-hot neighborhood columns.
This update strengthens my recommendation: use Random Forest with neighborhood features included when those fields are available.
Can you build a model that predicts the year a house was built? Explain the model and the evaluation metrics you would use to determine if the model is good.
Yes. I built a Random Forest Regressor to predict yrbuilt, again excluding leakage features (before1980 and parcel).
MAE (Mean Absolute Error): average absolute miss in years.
RMSE (Root Mean Squared Error): penalizes larger misses more heavily than MAE.
R-squared: percent of variance in year built explained by the model.
Results show moderate quality for year-level prediction: MAE ~10.12 years, RMSE ~16.60 years, R2 ~0.797. This is useful for rough historical estimation, but less reliable for precise year assignment.
# Predicted vs actual year builtpred_df = pd.DataFrame({"actual": yr_test, "predicted": yr_pred})year_plot = ( ggplot(pred_df.sample(3000, random_state=42), aes(x="actual", y="predicted"))+ geom_point(alpha=0.35, color="#2f6690")+ geom_abline(slope=1, intercept=0, linetype="dashed", color="#c1121f")+ labs( title="Predicted vs Actual Year Built (sample of 3,000)", x="Actual Year Built", y="Predicted Year Built", )+ theme_minimal()+ ggsize(900, 500))year_plot