LakeMLB is a benchmark for machine learning in data lakes, focusing on multi-table union and join scenarios with real datasets from government, finance, and more. Supports pre-training, augmentation strategies. Evaluates tabular ML methods and releases datasets/code.
Key Points
- 1.Multi-source, multi-table scenarios
- 2.Three datasets per union/join
- 3.Integration strategy evaluations
Impact Analysis
Fills gap in data lake ML benchmarks. Enables fair comparisons of methods. Drives research in scalable data lake analytics.
Technical Details
Covers government, finance, Wikipedia data. Tests state-of-the-art tabular learners. Code at GitHub.