Open-Source ML Pipeline for Hong Kong Horse Racing Prediction
๐กLearn how to build a robust ML pipeline for betting markets and identify common pitfalls like data leakage.
โก 30-Second TL;DR
What Changed
Features LightGBM and XGBoost training pipelines with ensemble prediction capabilities.
Why It Matters
This project offers a practical case study for practitioners interested in time-series forecasting and market efficiency analysis. It highlights the risks of data leakage in financial and betting-related ML models.
What To Do Next
Clone the repository and run the provided unit tests to evaluate how your own time-series models handle data leakage and feature engineering.
๐ง Deep Insight
Web-grounded analysis with 12 cited sources.
๐ Enhanced Key Takeaways
- โขThe pipeline likely leverages publicly available Hong Kong Jockey Club (HKJC) data, a common source for such projects, often supplemented by feature engineering from historical race results, horse profiles, and track conditions.
- โขBeyond the mentioned LightGBM and XGBoost models, related research in horse racing prediction frequently employs hyperparameter optimization techniques (e.g., HyperOpt, Optuna) to fine-tune these models for improved predictive accuracy and profitability.
- โขThe comprehensive betting simulations in the pipeline may incorporate advanced capital management strategies, such as variations of the Kelly criterion, to optimize bet sizing and maximize long-term profitability, a common practice in quantitative betting models.
- โขThe project's finding that models trained without public odds outperformed those with odds for Quinella ROI suggests that despite the semi-strong efficiency of betting markets, specific inefficiencies can still be identified and exploited through sophisticated ML models, challenging the assumption that all public information is fully priced in.
๐ ๏ธ Technical Deep Dive
- Data Sources: The pipeline primarily utilizes historical data from the Hong Kong Jockey Club (HKJC), often obtained through scraping or from publicly available datasets like those found on Kaggle. This data typically includes race results, horse information, jockey and trainer statistics, and track conditions.
- Feature Engineering: Critical features commonly engineered include horse's win percentage, jockey's win percentage, trainer's win percentage, actual weight carried, declared weight, days since last race, draw (starting gate), and derived metrics such as Beyer Speed Figures. More advanced feature engineering can involve rolling aggregations of past performances and differences with preferred horse distance.
- Model Architectures: While LightGBM and XGBoost are central, the pipeline's ensemble capabilities suggest the use of stacking or blending multiple gradient boosting models. Some related projects also explore deep learning models (e.g., using Keras and TensorFlow) for comparative analysis or integration.
- Optimization and Evaluation: Models are often optimized for metrics directly related to profitability, such as win-log-loss or return on investment (ROI) within betting simulations. Hyperparameter tuning, using tools like Optuna or HyperOpt, is crucial for enhancing predictive performance. Evaluation metrics also extend to precision, recall, F1-score, Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and Rยฒ.
- Betting Simulation: The pipeline simulates various betting markets including Quinella, QPL, Tierce, and Quartet. This involves identifying "value bets" where the model's estimated probability of an outcome exceeds the implied probability from public odds, potentially incorporating capital allocation strategies like the Kelly criterion to manage risk and maximize returns.
๐ฎ Future ImplicationsAI analysis grounded in cited sources
๐ Sources (12)
Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.
Weekly AI Recap
Read this week's curated digest of top AI events โ
๐Related Updates
AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/MachineLearning โ
