diff --git a/docs/dataset.md b/docs/dataset.md new file mode 100644 index 0000000000000000000000000000000000000000..ef65bd679f3acba7f34bea2e7ce5f4d9d907e358 --- /dev/null +++ b/docs/dataset.md @@ -0,0 +1,50 @@ + # CRAG Dataset Documentation + +## Dataset Version Information + +- **DATASET_DESCRIPTION_VERSION**: `v1` +- **DATASET_VERSION**: `v1` + +## Overview + +The CRAG dataset is designed to support the development and evaluation of Retrieval-Augmented Generation (RAG) models. It consists of two main types of data: + +1. **Question Answering Pairs:** Pairs of questions and their corresponding answers. +2. **Retrieval Contents:** Contents for information retrieval to support answer generation. + +Retrieval contents are divided into two types to simulate practical scenarios for RAG: + +1. **Web Search Results:** For each question, up to `50` **full HTML pages** are stored, retrieved using the question text as a search query. For Task 1, `5 pages` are **randomly selected** from the `top-10 pages`. These pages are likely relevant to the question, but relevance is not guaranteed. +2. **Mock KGs and APIs:** The Mock API is designed to mimic real-world **Knowledge Graphs (KGs)** or **API searches**. Given some input parameters, they output relevant results, which may or may not be helpful in answering the user's question. + +## Download CRAG Data + +- **Task #1:** [Retreival Summarization Task Page](https://www.aicrowd.com/challenges/meta-comprehensive-rag-benchmark-kdd-cup-2024/problems/retrieval-summarization/dataset_files) +- **Task #2:** [Mock API Repository](https://gitlab.aicrowd.com/aicrowd/challenges/meta-comprehensive-rag-benchmark-kdd-cup-2024/crag-mock-api) +- **Task #3:** [End to End Retreival Augmentation Task Page](https://www.aicrowd.com/challenges/meta-comprehensive-rag-benchmark-kdd-cup-2024/problems/end-to-end-retrieval-augmented-generation) + +## Data Schema + +| Field Name | Type | Description | +|------------------------|---------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `interaction_id` | string | A unique identifier for each example. | +| `query_time` | string | Date and time when the query and the web search occurred. | +| `domain` | string | Domain label for the query. Possible values: "finance", "music", "movie", "sports", "open". "Open" includes any factual queries not among the previous four domains. | +| `question_type` | string | Type label about the query. Possible values include: "simple", "simple_w_condition", "comparison", "aggregation", "set", "false_premise", "post-processing", "multi-hop". | +| `static_or_dynamic` | string | Indicates whether the answer to a question changes and the expected rate of change. Possible values: "static", "slow-changing", "fast-changing", and "real-time". | +| `query` | string | The question for RAG to answer. | +| `answer` | string | The gold standard answer to the question. | +| `alternative_answers` | list | Other valid gold standard answers to the question. | +| `split` | integer | Data split indicator, where 0 is for validation and 1 is for the public test. | +| `search_results` | list of JSON | Contains up to `k` HTML pages for each query (`k=5` for Task #1 and `k=50` for Task #3), including page name, URL, snippet, full HTML, and last modified time. | + +### Search Results Detail + +| Key | Type | Description | +|----------------------|--------|---------------------------------------------------------| +| `page_name` | string | The name of the webpage. | +| `page_url` | string | The URL of the webpage. | +| `page_snippet` | string | A short paragraph describing the major content of the page. | +| `page_result` | string | The full HTML of the webpage. | +| `page_last_modified` | string | The time when the page was last modified. | +