Text2GQL-Bench introduces a unified benchmark for Text-to-Graph-Query-Language systems with 178,184 question-query pairs across 13 domains and multiple GQLs. It features a scalable dataset generation framework and a multi-metric evaluation including grammatical validity, similarity, semantic alignment, and execution accuracy. Evaluations show LLMs struggle with ISO-GQL, achieving only 4% zero-shot execution accuracy, improving to 50% with 3-shot prompting and 45.1% with fine-tuning.
Key Points
- 1.178k pairs spanning 13 domains and multiple GQLs
- 2.Scalable framework for diverse datasets
- 3.Comprehensive eval beyond end-to-end metrics
- 4.LLM gaps: 4% zero-shot EX on ISO-GQL, 45% fine-tuned
Impact Analysis
Addresses gaps in domain coverage and evaluation for Text-to-GQL systems, enabling systematic model comparisons. Highlights dialect challenges in graph queries, spurring LLM advancements for GDBMS agents. Democratizes graph data analysis via natural language.
Technical Details
Multi-GQL dataset with heterogeneous resources and abstraction levels. Metrics: grammatical validity (up to 90.8% fine-tuned), execution accuracy (45.1% fine-tuned 8B model). Reveals prompting boosts EX to 50% but validity <70%.