๐Ÿค–Stalecollected in 1m

Open-Source ML Pipeline for Hong Kong Horse Racing Prediction

PostLinkedIn
๐Ÿค–Read original on Reddit r/MachineLearning
#ml-pipeline#predictive-modeling#data-leakagehong-kong-horse-racing-ml-pipeline

๐Ÿ’กLearn how to build a robust ML pipeline for betting markets and identify common pitfalls like data leakage.

โšก 30-Second TL;DR

What Changed

Features LightGBM and XGBoost training pipelines with ensemble prediction capabilities.

Why It Matters

This project offers a practical case study for practitioners interested in time-series forecasting and market efficiency analysis. It highlights the risks of data leakage in financial and betting-related ML models.

What To Do Next

Clone the repository and run the provided unit tests to evaluate how your own time-series models handle data leakage and feature engineering.

Who should care:Developers & AI Engineers

๐Ÿง  Deep Insight

Web-grounded analysis with 12 cited sources.

๐Ÿ”‘ Enhanced Key Takeaways

  • โ€ขThe pipeline likely leverages publicly available Hong Kong Jockey Club (HKJC) data, a common source for such projects, often supplemented by feature engineering from historical race results, horse profiles, and track conditions.
  • โ€ขBeyond the mentioned LightGBM and XGBoost models, related research in horse racing prediction frequently employs hyperparameter optimization techniques (e.g., HyperOpt, Optuna) to fine-tune these models for improved predictive accuracy and profitability.
  • โ€ขThe comprehensive betting simulations in the pipeline may incorporate advanced capital management strategies, such as variations of the Kelly criterion, to optimize bet sizing and maximize long-term profitability, a common practice in quantitative betting models.
  • โ€ขThe project's finding that models trained without public odds outperformed those with odds for Quinella ROI suggests that despite the semi-strong efficiency of betting markets, specific inefficiencies can still be identified and exploited through sophisticated ML models, challenging the assumption that all public information is fully priced in.

๐Ÿ› ๏ธ Technical Deep Dive

  • Data Sources: The pipeline primarily utilizes historical data from the Hong Kong Jockey Club (HKJC), often obtained through scraping or from publicly available datasets like those found on Kaggle. This data typically includes race results, horse information, jockey and trainer statistics, and track conditions.
  • Feature Engineering: Critical features commonly engineered include horse's win percentage, jockey's win percentage, trainer's win percentage, actual weight carried, declared weight, days since last race, draw (starting gate), and derived metrics such as Beyer Speed Figures. More advanced feature engineering can involve rolling aggregations of past performances and differences with preferred horse distance.
  • Model Architectures: While LightGBM and XGBoost are central, the pipeline's ensemble capabilities suggest the use of stacking or blending multiple gradient boosting models. Some related projects also explore deep learning models (e.g., using Keras and TensorFlow) for comparative analysis or integration.
  • Optimization and Evaluation: Models are often optimized for metrics directly related to profitability, such as win-log-loss or return on investment (ROI) within betting simulations. Hyperparameter tuning, using tools like Optuna or HyperOpt, is crucial for enhancing predictive performance. Evaluation metrics also extend to precision, recall, F1-score, Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and Rยฒ.
  • Betting Simulation: The pipeline simulates various betting markets including Quinella, QPL, Tierce, and Quartet. This involves identifying "value bets" where the model's estimated probability of an outcome exceeds the implied probability from public odds, potentially incorporating capital allocation strategies like the Kelly criterion to manage risk and maximize returns.

๐Ÿ”ฎ Future ImplicationsAI analysis grounded in cited sources

The open-source nature will accelerate innovation in algorithmic betting for horse racing.
Providing a reproducible framework allows a wider community of data scientists and enthusiasts to contribute, test new models, and identify further market inefficiencies.
The findings on public odds could shift research focus towards more complex, non-obvious features.
Demonstrating that models without public odds can outperform those with them encourages exploration of less correlated or proprietary data points to gain an edge.

๐Ÿ“Ž Sources (12)

Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.

  1. teddykoker.com
  2. kaggle.com
  3. apify.com
  4. medium.com
  5. github.io
  6. marcoseduardoelias.com
  7. turf-wise.com
  8. reddit.com
  9. researchgate.net
  10. medium.com
  11. codeworks.fr
  12. kaggle.com
๐Ÿ“ฐ

Weekly AI Recap

Read this week's curated digest of top AI events โ†’

๐Ÿ‘‰Related Updates

AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/MachineLearning โ†—