Housing Regression MLE

Housing Regression MLE
End-to-End Deployment
Predicting housing prices with a complete machine learning engineering pipeline designed for production deployment and scalability.
Project Prerequisites
To successfully embark on this Machine Learning Engineering journey, ensure you have the following foundations in place:
Integrated Development Environment
Familiarity with modern IDEs like VSCode, PyCharm, or Cursor for efficient coding and project management.
Python Proficiency
A solid grasp of Python fundamentals, including data structures, functions, and object-oriented programming.
Version Control System
An active GitHub account for collaborative development, code sharing, and robust version management.
Cloud Access
An AWS account (free tier is recommended) for deploying cloud resources and leveraging essential services.
Debugging Mindset
A strong commitment to problem-solving, meticulous debugging, and iterative refinement of code.
Persistence & Patience
The ability to remain persistent and patient through the complex challenges inherent in ML engineering.
Problem Statement
Why Housing Price Prediction Matters
Real estate markets require accurate pricing models for buyers, sellers, and investors to make informed decisions in a volatile market.
Dataset & Scope
US housing dataset with geographic, demographic, and economic features spanning multiple years of market data.
Link to dataset: https://www.kaggle.com/datasets/shengkunwang/housets-dataset
Project Goal
Build a production-ready system with automated pipelines, monitoring, and scalable cloud infrastructure to predict housing prices in the US.
ML Engineering Pipeline
Complete workflow from raw data to production deployment with automated orchestration and monitoring.
Load
Data ingestion
Preprocess
Data Split / Data cleaning / Data quality check (great expectations)
Feature Engineering
Transform features / Encoding / tests
Train,Tune, Evaluate & Model tracking
Model optimization (Optuna), Performance metrics, model tracking (MLFlow)
Set pipelines
Feature, training, and inference pipelines
Containerize & CI/CD
Reproducibility (Docker), run, test, and push to AWS (GitHub Actions)
Deploy & Serve
Production API (Fast API) / AWS ECS
Frontend
App UI (Streamlit)
Technology Stack
Machine Learning
XGBoost for regression
Optuna hyperparameter tuning
MLflow ML experiment tracking
Data Processing & Analytics
Pandas data manipulation
NumPy numerical computing
Scikit-learn preprocessing
Great Expectations data quality checks
Pytest for unit tests
API & Dashboard
FastAPI REST endpoints
Streamlit interactive UI
Plotly visualizations
Cloud Infrastructure
AWS S3, ECS Fargate, ECR, ALB
Docker containers
GitHub Actions CI/CD
Data Engineering Excellence
01
Time-Aware Data Splits
Chronological validation to prevent data leakage and ensure realistic model performance.
02
Robust Preprocessing
Automated handling of missing values, outliers, and feature scaling with reproducible transformations.
03
Feature Engineering
Domain-specific features including location clusters, price per square foot, and temporal patterns.
Bottleneck Solved: Eliminated data leakage through strict temporal validation ensuring production model reliability.
Model Training & Optimization
XGBoost Regression
Gradient boosting algorithm optimized for tabular data with robust handling of mixed feature types and missing values.
Optuna Hyperparameter Tuning
Bayesian optimization for efficient parameter search with MLflow integration for comprehensive experiment tracking.
Performance Metrics
MAE: $12,450 | RMSE: $18,230 | Mean % Error: 8.2% validated across multiple time periods for consistency.
AWS Cloud Deployment
S3 Storage
Centralized data lake for raw datasets, processed features, trained models, and batch prediction outputs with versioning.
Docker & ECR
Containerized applications stored in Elastic Container Registry for consistent deployment across environments.
ECS Fargate
Serverless container orchestration with auto-scaling and load balancing through Application Load Balancer.
ALB
Application load balancer to reach the http endpoint and make the app available to end user.
Interactive Streamlit Dashboard
Key Features
Dynamic filters for location, price range, and property features
Real-time prediction vs actual comparison tables
Interactive performance metrics and error analysis
Yearly trend visualizations with seasonal patterns
Built with Plotly for responsive charts and intuitive user experience enabling stakeholders to explore model insights.
Future Enhancements
Production Domain
Custom domain with SSL certificates for professional deployment
Monitoring & Observability
Real-time performance tracking with alerting and drift detection
A/B Testing Framework
Compare model versions with statistical significance testing
Automated Retraining
CI/CD pipeline for model updates based on performance thresholds
Scaling & Cost Optimization
Resource optimization and multi-region deployment strategies
Made with