Loading content...
Loading content...
End-to-end Machine Learning pipeline designed to predict individual motor insurance claim severity using 70,000+ Belgian insurance records.

This project is an end-to-end Machine Learning pipeline designed to predict individual motor insurance claim severity using a dataset of over 70,000 Belgian insurance records. I developed a robust system that cleans raw data, performs advanced feature engineering—such as target encoding for high-cardinality vehicle models—and evaluates seven different regression architectures. The final solution utilizes a Random Forest model optimized for Mean Absolute Percentage Error (MAPE) to ensure high accuracy for individual claim predictions. To bridge the gap between development and production, I integrated MLflow for experiment tracking, hosted the versioned model on the Hugging Face Hub, and deployed a real-time prediction interface via Streamlit.
Insurance companies need to predict claim severity (the expected cost of a claim) to set appropriate premium prices, allocate reserves for future claims, and identify high-risk policies.
A Random Forest model trained on 70,000+ Belgian motor insurance claims, optimized for individual claim accuracy (lowest MAPE) while maintaining strong overall performance.
The architecture consists of two main components: Data & Training Pipeline and Production Deployment. The data flows from the Belgian MTPL dataset through Jupyter notebooks for processing, with MLflow tracking experiments and Hugging Face Hub storing models. The production deployment uses Streamlit Cloud to serve predictions to end users.
Belgian MTPL Insurance Data (beMTPL16) contains 70,791 insurance records from the period 2004-2016. The target variable is claim severity (amount in EUR).
A key part of the project was converting raw claim time to Day/Night categories using a custom function. Night was defined as 20:00 - 06:00, and Day as 06:00 - 20:00.
Numeric features use StandardScaler for normalization. Low-cardinality categoricals use OneHotEncoder, while high-cardinality categoricals use TargetEncoder to handle the 1000+ unique vehicle models.
For individual claim predictions, Random Forest achieves the lowest MAPE (Mean Absolute Percentage Error). It provides the best accuracy on individual claim amounts because: the ensemble of decision trees reduces overfitting, it handles mixed feature types well, and it is robust to outliers via log-transformation. Note: For portfolio-level reserve calculations where large claims matter more, the Actuarial XGBoost (Gamma loss) model achieves lower RMSE.
Comparison Results across all 7 models trained:
Best model selected by lowest MAPE for individual claim accuracy.
The trained model and artifacts are stored on Hugging Face Hub for version control, easy access (download with one line of code), and collaboration.
The web app is deployable to Streamlit Cloud with zero infrastructure (no servers to manage), auto-scaling, and free tier perfect for portfolio projects.




Clone the repository: git clone https://github.com/charleskwakye/be-insurance-ai.git && cd be-insurance-ai. Create virtual environment: python -m venv venv. Activate on Windows: venv\Scripts\activate or source venv/bin/activate on Mac/Linux. Install dependencies: pip install -r requirements.txt
Run ml_belgium.ipynb in Jupyter or VS Code. Run all cells sequentially. View experiments in MLflow UI: mlflow ui. The best model will be uploaded to Hugging Face Hub.
cd streamlit_app && pip install -r requirements.txt && streamlit run app.py
Copy .env.example to .env and edit to add your HF_TOKEN for Hugging Face authentication.
be-insurance-ai/ ├── ml_belgium.ipynb (Main ML pipeline notebook) ├── data/ └── beMTPL16.rda (Belgian MTPL insurance dataset) ├── mlruns/ └── 653770752589350789/ (Experiment runs, metrics, artifacts) ├── streamlit_app/ ├── app.py (Streamlit application) ├── requirements.txt (App dependencies) └── README.md (App documentation) ├── requirements.txt (Training dependencies) ├── .env.example (Environment variables template) ├── .gitignore └── README.md (This file)