Predicting Calories Expenditure

How many calories were burned during a workout ?

The objective of this prediction task is to estimate the number of calories burned during a workout session based on various input features such as physical activity data and physiological metrics. Accurately predicting calorie expenditure is important for applications in health monitoring, fitness tracking, and personalized training plans.

The evaluation metric used for this task is Root Mean Squared Logarithmic Error (RMSLE), which is particularly suitable when the target variable spans several orders of magnitude and when it's important to penalize underestimates more than overestimates. RMSLE ensures that small differences in predicted versus actual values have less impact when both values are small, while larger discrepancies are penalized proportionally, especially when predicting higher calorie expenditures.

Where :

n is the total number of observations in the test set.
y_pred is the predicted value of the target for instance (i) (the one with a hat)
y is the actual value of the target for instance (i)

Modeling Approach Overview:

My approach combined advanced feature engineering, CatBoost modeling, 5-fold cross-validation, a hill climbing optimization strategy, and rigorous prevention of data leakage.

Feature Engineering: I began with thorough exploratory data analysis to uncover meaningful relationships between variables. Based on these insights, I created new features—such as linear combinations and interactions—that enhanced the model's ability to capture complex patterns within the data as for example : IMC, BSA, Log_Weight, Burned_Calories and etc.

Modeling with CatBoost: I've selected a few Regressors and run my tests on each one of them : CatBoost, XGB, LGBM and TabNet. Finally, I used CatBoost due to its robustness with categorical features and its built-in prevention of data leakage via ordered boosting. This technique ensures that during training, future data points are not used to make predictions, thereby supporting better generalization on unseen data.

Cross-Validation & Optimization: I used a 5-fold KFold cross-validation strategy to obtain a reliable estimate of model performance and to reduce overfitting. To further fine-tune the model, I applied a hill climbing approach to iteratively optimize hyperparameters and improve RMSLE.

Data Leakage Prevention: I took particular care to avoid data leakage by ensuring that all feature transformations and encodings were based solely on the training folds, with no information leakage from the validation or test sets.

Results :

By combining these techniques, I was able to develop a highly predictive model that performed competitively in the leaderboard, with a well-controlled RMSLE and strong generalization performance.

Final Rank : 1204/4318
RMSLE Score : 0.05914
Best RMSLE Score : 0.05841

Modeling Approach Overview:

Results :

Github Project