Merge branch 'baseline-v0' into 'master'

add baseline related docs See merge request !1

Merge branch 'baseline-v0' into 'master'
add baseline related docs See merge request !1
35b32ad8 · mohanty · 44ee7a6e · 01007dcc · 35b32ad8 · 35b32ad8
Commit 35b32ad8 authored 11 months ago by mohanty
--- a/README.md
+++ b/README.md
@@ -24,6 +24,7 @@ This repository is the CRAG: Comphrensive RAG Benchmark **Submission template an
      - [How to make a submission?](#-how-to-make-a-submission)
      - [What hardware does my code run on?](#-what-hardware-does-my-code-run-on-)
      - [How are my model responses parsed by the evaluators?](#-how-are-my-model-responses-parsed-by-the-evaluators-)
+      - [Baselines](#baselines)
 6. [Frequently Asked Questions](#-frequently-asked-questions)
 6. [Important Links](#-important-links)
@@ -101,6 +102,8 @@ This also includes instructions on [specifying your software runtime](docs/submi
 You can find more details about the hardware and system configuration in [docs/hardware-and-system-config.md](docs/hardware-and-system-config.md).
 In summary, we provide you `4` x [[NVIDIA T4 GPUs](https://www.nvidia.com/en-us/data-center/tesla-t4/)].
+## 🏁 Baseline
+We include two baselines for demonstration purposes, and you can read more abou them in [docs/baselines.md](docs/baselines.md).
 # ❓ Frequently Asked Questions
 ## Which track is this starter kit for ?

--- a/docs/baselines.md
+++ b/docs/baselines.md
+# CRAG Baselines
+For the CRAG benchmark, we provide participants with two baseline models to help get started. Detailed implementations of these baseline models are accessible through the links provided below. Participants are encouraged to use these as a starting point for the competition.
+Please note that these baselines are **NOT** tuned for performance or efficiency, and are provided as is for demonstration.
+## Available Baseline Models:
+1. **Vanilla Llama 2 Model**: For an implementation guide and further details, refer to the Vanilla Llama 2 model documentation [here](../models/vanilla_llama_baseline.py).
+2. **RAG Baseline Model**: For an implementation guide and further details, refer to the RAG Baseline model documentation [here](../models/rag_llm_model.py).
+## Preparing Your Submission:
+Before you can submit your solutions using these baselines, it is necessary to download the model weights and incorporate them into this repository. To do this, follow the step-by-step instructions outlined in the document: [download_baseline_model_weights.md](download_baseline_model_weights.md). 
+Additionally, ensure that your configurations in [user_config.py](../models/user_config.py) correctly reference the model class you intend to use for your submission.
+These steps are crucial for a successful submission. Make sure to follow them carefully. Good luck!
\ No newline at end of file
--- a/docs/download_baseline_model_weights.md
+++ b/docs/download_baseline_model_weights.md
+### Setting Up and Downloading Baseline Model weighta with Hugging Face
+This guide outlines the steps to download (and check in) the models weights required for the baseline models.
+We will focus on the `Llama-2-7b-chat-hf` and `all-MiniLM-L6-v2` models.
+But the steps should work equally well for any other models on hugging face. 
+#### Preliminary Steps:
+1. **Install the Hugging Face Hub Package**:
+   Begin by installing the `huggingface_hub` package, which includes the `hf_transfer` utility, by running the following command in your terminal:
+   ```bash
+   pip install huggingface_hub[hf_transfer]
+   ```
+2. **Accept the LLaMA Terms**:
+   You must accept the LLaMA model's terms of use by visiting: [LLaMA-2-7b-chat-hf Terms](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf).
+3. **Create a Hugging Face CLI Token**:
+   Generate a CLI token by navigating to: [Hugging Face Token Settings](https://huggingface.co/settings/tokens). You will need this token for authentication.
+#### Hugging Face Authentication:
+1. **Login via CLI**:
+   Authenticate yourself with the Hugging Face CLI using the token created in the previous step. Run:
+   ```bash
+   huggingface-cli login
+   ```
+   When prompted, enter the token.
+#### Model Downloads:
+1. **Download LLaMA-2-7b Model**:
+   Execute the following command to download the `Llama-2-7b-chat-hf` model to a local subdirectory. This command excludes unnecessary files to save space:
+   ```bash
+   HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download \
+       meta-llama/Llama-2-7b-chat-hf \
+       --local-dir-use-symlinks False \
+       --local-dir models/meta-llama/Llama-2-7b-chat-hf \
+       --exclude *.bin # These are alternates to the safetensors hence not needed
+   ```
+3. **Download MiniLM-L6-v2 Model (for sentence embeddings)**:
+   Similarly, download the `sentence-transformers/all-MiniLM-L6-v2` model using the following command:
+   ```bash
+   HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download \
+      sentence-transformers/all-MiniLM-L6-v2 \
+       --local-dir-use-symlinks False \
+       --local-dir models/sentence-transformers/all-MiniLM-L6-v2 \
+       --exclude *.bin *.h5 *.ot # These are alternates to the safetensors hence not needed
+   ```
+#### Version Control with Git LFS:
+1. **Track Model Weights**:
+   Use Git Large File Storage (LFS) to track the model directories. This ensures efficient handling of large files:
+   ```bash
+   git lfs track "models/meta-llama/*"
+   git lfs track "models/sentence-transformers/*"
+   ```
+2. **Commit and Push**:
+   Add the models to your Git repository, commit the changes, and push them to your remote repository:
+   ```bash
+   git add models/
+   git commit -am "add weights"
+   git push origin master
+   ```
--- a/models/rag_llama_baseline.py
+++ b/models/rag_llama_baseline.py
+import os
+from typing import List
+import numpy as np
+import torch
+from blingfire import text_to_sentences_and_offsets
+from bs4 import BeautifulSoup
+from models.utils import trim_predictions_to_max_token_length
+from sentence_transformers import SentenceTransformer
+from transformers import (
+    AutoModelForCausalLM,
+    AutoTokenizer,
+    BitsAndBytesConfig,
+    pipeline,
+)
+######################################################################################################
+######################################################################################################
+###
+### IMPORTANT !!!
+### Before submitting, please follow the instructions in the docs below to download and check in :
+### the model weighs. 
+### 
+###  https://gitlab.aicrowd.com/aicrowd/challenges/meta-comprehensive-rag-benchmark-kdd-cup-2024/meta-comphrehensive-rag-benchmark-starter-kit/-/blob/master/docs/download_baseline_model_weights.md
+### 
+###
+### DISCLAIMER: This baseline has NOT been tuned for performance
+###             or efficiency, and is provided as is for demonstration.
+######################################################################################################
+# Load the environment variable that specifies the URL of the MockAPI. This URL is essential
+# for accessing the correct API endpoint in Task 2 and Task 3. The value of this environment variable
+# may vary across different evaluation settings, emphasizing the importance of dynamically obtaining
+# the API URL to ensure accurate endpoint communication.
+# Please refer to https://gitlab.aicrowd.com/aicrowd/challenges/meta-comprehensive-rag-benchmark-kdd-cup-2024/crag-mock-api
+# for more information on the MockAPI.
+#
+# **Note**: This environment variable will not be available for Task 1 evaluations.
+CRAG_MOCK_API_URL = os.getenv("CRAG_MOCK_API_URL", "http://localhost:8000")
+class RAGModel:
+    def __init__(self):
+        """
+        Initialize your model(s) here if necessary.
+        This is the constructor for your DummyModel class, where you can set up any
+        required initialization steps for your model(s) to function correctly.
+        """
+        self.sentence_model = SentenceTransformer('models/sentence-transformers/all-MiniLM-L6-v2', device='cuda')
+        self.num_context = 10
+        self.max_ctx_sentence_length = 1000 # characters
+        self.prompt_template = """You are given a quesition and references which may or may not help answer the question. 
+You are to respond with just the answer and no surrounding sentences.
+If you are unsure about the answer, respond with "I don't know".
+### Question
+{query}
+### References 
+{references}
+### Answer"""
+        bnb_config = BitsAndBytesConfig(
+            load_in_4bit=True,
+            bnb_4bit_compute_dtype=torch.float16,
+            bnb_4bit_quant_type="nf4",
+            bnb_4bit_use_double_quant=False,
+        )
+        model_name = "models/meta-llama/Llama-2-7b-chat-hf"
+        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
+        self.llm = AutoModelForCausalLM.from_pretrained(
+            model_name,
+            device_map='auto',
+            quantization_config=bnb_config,
+            torch_dtype=torch.float16,
+        )
+        self.generation_pipe = pipeline(task="text-generation",
+                                        model=self.llm,
+                                        tokenizer=self.tokenizer,
+                                        max_new_tokens=10)
+    def generate_answer(self, query: str, search_results: List[str]) -> str:
+        """
+        Generate an answer based on a provided query and a list of pre-cached search results.
+        Parameters:
+        - query (str): The user's question or query input.
+        - search_results (List[str]): A list containing the text content from web pages
+          retrieved as search results for the query. Each element in the list is a string
+          representing the HTML text of a web page.
+        Returns:
+        - (str): A plain text response that answers the query. This response is limited to 75 tokens.
+          If the generated response exceeds 75 tokens, it will be truncated to fit within this limit.
+        Notes:
+        - If the correct answer is uncertain, it's preferable to respond with "I don't know" to avoid
+          the penalty for hallucination.
+        - Response Time: Ensure that your model processes and responds to each query within 10 seconds.
+          Failing to adhere to this time constraint **will** result in a timeout during evaluation.
+        """
+        all_sentences = []
+        for html_text in search_results:
+            soup = BeautifulSoup(html_text['page_result'], features="html.parser")
+            text = soup.get_text().replace('\n', '')
+            if len(text) > 0:
+              offsets = text_to_sentences_and_offsets(text)[1]
+              for ofs in offsets:
+                  sentence = text[ofs[0]:ofs[1]]
+                  all_sentences.append(sentence[:self.max_ctx_sentence_length])
+            else:
+                all_sentences.append('')
+        all_embeddings = self.sentence_model.encode(all_sentences, normalize_embeddings=True)
+        query_embedding = self.sentence_model.encode(query, normalize_embeddings=True)[None, :]
+        cosine_scores = (all_embeddings * query_embedding).sum(1)
+        top_sentences = np.array(all_sentences)[(-cosine_scores).argsort()[:self.num_context]]
+        references = ''
+        for snippet in top_sentences:
+            references += '<DOC>\n' + snippet + '\n</DOC>\n'
+        references = ' '.join(references.split()[:500])
+        final_prompt = self.prompt_template.format(query=query, references=references)
+        result = self.generation_pipe(final_prompt)[0]['generated_text']
+        answer = result.split("### Answer\n")[1]
+        # Trim prediction to a max of 75 tokens
+        trimmed_answer = trim_predictions_to_max_token_length(answer)
+        return trimmed_answer
--- a/models/user_config.py
+++ b/models/user_config.py
 from models.dummy_model import DummyModel
 UserModel = DummyModel
\ No newline at end of file
+# Uncomment the lines below to use the Vanilla LLAMA baseline
+# from models.vanilla_llama import ChatModel 
+# UserModel = ChatModel
+# Uncomment the lines below to use the RAG LLAMA baseline
+# from models.rag_llama_baseline import RAGModel
+# UserModel = RAGModel
--- a/models/vanilla_llama_baseline.py
+++ b/models/vanilla_llama_baseline.py
+import os
+from typing import List
+import numpy as np
+import torch
+from models.utils import trim_predictions_to_max_token_length
+from transformers import (
+    AutoModelForCausalLM,
+    AutoTokenizer,
+    BitsAndBytesConfig,
+    pipeline,
+)
+######################################################################################################
+######################################################################################################
+###
+### IMPORTANT !!!
+### Before submitting, please follow the instructions in the docs below to download and check in :
+### the model weighs. 
+### 
+###  https://gitlab.aicrowd.com/aicrowd/challenges/meta-comprehensive-rag-benchmark-kdd-cup-2024/meta-comphrehensive-rag-benchmark-starter-kit/-/blob/master/docs/download_baseline_model_weights.md
+### 
+###
+### DISCLAIMER: This baseline has NOT been tuned for performance
+###             or efficiency, and is provided as is for demonstration.
+######################################################################################################
+# Load the environment variable that specifies the URL of the MockAPI. This URL is essential
+# for accessing the correct API endpoint in Task 2 and Task 3. The value of this environment variable
+# may vary across different evaluation settings, emphasizing the importance of dynamically obtaining
+# the API URL to ensure accurate endpoint communication.
+# Please refer to https://gitlab.aicrowd.com/aicrowd/challenges/meta-comprehensive-rag-benchmark-kdd-cup-2024/crag-mock-api
+# for more information on the MockAPI.
+#
+# **Note**: This environment variable will not be available for Task 1 evaluations.
+CRAG_MOCK_API_URL = os.getenv("CRAG_MOCK_API_URL", "http://localhost:8000")
+class ChatModel:
+    def __init__(self):
+        """
+        Initialize your model(s) here if necessary.
+        This is the constructor for your DummyModel class, where you can set up any
+        required initialization steps for your model(s) to function correctly.
+        """
+        self.prompt_template = """You are given a quesition and references which may or may not help answer the question. Your goal is to answer the question in as few words as possible.
+### Question
+{query}
+### Answer"""
+        bnb_config = BitsAndBytesConfig(
+            load_in_4bit=True,
+            bnb_4bit_compute_dtype=torch.float16,
+            bnb_4bit_quant_type="nf4",
+            bnb_4bit_use_double_quant=False,
+        )
+        model_name = "models/meta-llama/Llama-2-7b-chat-hf"
+        if not os.path.exists(model_name):
+            raise Exception(f"""
+            The evaluators expect the model weights to be checked into the repository,
+            but we could not find the model weights at {model_name}
+            Please follow the instructions in the docs below to download and check in the model weights.
+            https://gitlab.aicrowd.com/aicrowd/challenges/meta-comprehensive-rag-benchmark-kdd-cup-2024/meta-comphrehensive-rag-benchmark-starter-kit/-/blob/master/docs/dataset.md
+            """)
+        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
+        self.llm = AutoModelForCausalLM.from_pretrained(
+            model_name,
+            device_map='auto',
+            quantization_config=bnb_config,
+            torch_dtype=torch.float16,
+        )
+        self.generation_pipe = pipeline(task="text-generation",
+                                        model=self.llm,
+                                        tokenizer=self.tokenizer,
+                                        max_new_tokens=75)
+    def generate_answer(self, query: str, search_results: List[str]) -> str:
+        """
+        Generate an answer based on a provided query and a list of pre-cached search results.
+        Parameters:
+        - query (str): The user's question or query input.
+        - search_results (List[str]): A list containing the text content from web pages
+          retrieved as search results for the query. Each element in the list is a string
+          representing the HTML text of a web page.
+        Returns:
+        - (str): A plain text response that answers the query. This response is limited to 75 tokens.
+          If the generated response exceeds 75 tokens, it will be truncated to fit within this limit.
+        Notes:
+        - If the correct answer is uncertain, it's preferable to respond with "I don't know" to avoid
+          the penalty for hallucination.
+        - Response Time: Ensure that your model processes and responds to each query within 10 seconds.
+          Failing to adhere to this time constraint **will** result in a timeout during evaluation.
+        """
+        final_prompt = self.prompt_template.format(query=query)
+        result = self.generation_pipe(final_prompt)[0]['generated_text']
+        answer = result.split("### Answer")[1].strip()
+        # Trim prediction to a max of 75 tokens
+        trimmed_answer = trim_predictions_to_max_token_length(answer)
+        return trimmed_answer
--- a/requirements.txt
+++ b/requirements.txt
-torch
+accelerate
-transformers
+beautifulsoup4
+bitsandbytes
+blingfire
+hf-transfer
+huggingface-hub
 loguru
 openai==1.13.3
+sentence_transformers
+torch
+transformers
\ No newline at end of file