Skip to content
Snippets Groups Projects
Commit 02dddb6d authored by mohanty's avatar mohanty
Browse files

Merge branch 'batch_predict_interface_v0' into 'master'

Batch predict interface v0

See merge request !4
parents f5ab8ab7 5e714bdf
No related branches found
No related tags found
1 merge request!4Batch predict interface v0
## This is an example Dokerfile you can change to make submissions on aicrowd ## This is an example Dokerfile you can change to make submissions on aicrowd
## To use it, place it in the base of the repo, and remove the underscore (_) from the filename ## To use it, place it in the base of the repo, and remove the underscore (_) from the filename
FROM nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu20.04 FROM nvidia/cuda:12.1.1-cudnn8-devel-ubuntu20.04
ENV DEBIAN_FRONTEND=noninteractive ENV DEBIAN_FRONTEND=noninteractive
COPY apt.txt /tmp/apt.txt COPY apt.txt /tmp/apt.txt
RUN apt -qq update && apt -qq install -y --no-install-recommends `cat /tmp/apt.txt` \ RUN apt -qq update && apt -qq install -y --no-install-recommends `cat /tmp/apt.txt` \
&& rm -rf /var/cache/* && rm -rf /var/cache/*
RUN apt install -y locales wget RUN apt install -y locales wget build-essential
# Unicode support: # Unicode support:
RUN locale-gen en_US.UTF-8 RUN locale-gen en_US.UTF-8
......
...@@ -49,6 +49,7 @@ docker run \ ...@@ -49,6 +49,7 @@ docker run \
-v "$(pwd)":/submission \ -v "$(pwd)":/submission \
-w /submission \ -w /submission \
-e OPENAI_API_KEY=$OPENAI_API_KEY \ -e OPENAI_API_KEY=$OPENAI_API_KEY \
--ipc=host \
$IMAGE_NAME python local_evaluation.py $IMAGE_NAME python local_evaluation.py
# Note: We assume you have nvidia-container-toolkit installed and configured # Note: We assume you have nvidia-container-toolkit installed and configured
......
...@@ -7,9 +7,9 @@ Please note that these baselines are **NOT** tuned for performance or efficiency ...@@ -7,9 +7,9 @@ Please note that these baselines are **NOT** tuned for performance or efficiency
## Available Baseline Models: ## Available Baseline Models:
1. [**Vanilla Llama 2 Model**](../models/vanilla_llama_baseline.py): For an implementation guide and further details, refer to the Vanilla Llama 2 model documentation [here](../models/vanilla_llama_baseline.py). 1. [**Vanilla Llama 3 Model**](../models/vanilla_llama_baseline.py): For an implementation guide and further details, refer to the Vanilla Llama 3 model inline documentation [here](../models/vanilla_llama_baseline.py).
2. [**RAG Baseline Model**](../models/rag_llm_model.py): For an implementation guide and further details, refer to the RAG Baseline model documentation [here](../models/rag_llm_model.py). 2. [**RAG Baseline Model**](../models/rag_llm_model.py): For an implementation guide and further details, refer to the RAG Baseline model inline documentation [here](../models/rag_llm_model.py).
## Preparing Your Submission: ## Preparing Your Submission:
......
## Batch Prediction Interface
- Date: `14-05-2024`
Your submitted models can now make batch predictions on the test set, allowing you to fully utilize the multi-GPU setup available during evaluations.
### Changes to Your Code
1. **Add a `get_batch_size()` Function:**
- This function should return an integer between `[1, 16]`. The maximum batch size supported at the moment is 16.
- You can also choose the batch size dynamically.
- This function is a **required** interface for your model class.
2. **Replace `generate_answer` with `batch_generate_answer`:**
- Update your code to replace the `generate_answer` function with `batch_generate_answer`.
- For more details on the `batch_generate_answer` interface, please refer to the inline documentation in [dummy_model.py](../models/dummy_model.py).
```python
# Old Interface
def generate_answer(self, query: str, search_results: List[Dict], query_time: str) -> str:
....
....
return answer
# New Interface
def batch_generate_answer(self, batch: Dict[str, Any]) -> List[str]:
batch_interaction_ids = batch["interaction_id"]
queries = batch["query"]
batch_search_results = batch["search_results"]
query_times = batch["query_time"]
....
....
return [answer1, answer2, ......, answerN]
```
- The new function should return a list of answers (`List[str]`) instead of a single answer (`str`).
- The simplest example of a valid submission with the new interface is as follows:
```python
class DummyModel:
def get_batch_size(self) -> int:
return 4
def batch_generate_answer(self, batch: Dict[str, Any]) -> List[str]:
queries = batch["query"]
answers = ["i dont't know" for _ in queries]
return answers
```
### Backward Compatibility
To ensure a smooth transition, the evaluators will maintain backward compatibility with the `generate_answer` interface for a short period. However, we strongly recommend updating your code to use the `batch_generate_answer` interface to avoid any disruptions when support for the older interface is removed in the coming weeks.
### Setting Up and Downloading Baseline Model weighta with Hugging Face ### Setting Up and Downloading Baseline Model weighta with Hugging Face
This guide outlines the steps to download (and check in) the models weights required for the baseline models. This guide outlines the steps to download (and check in) the models weights required for the baseline models.
We will focus on the `Llama-2-7b-chat-hf` and `all-MiniLM-L6-v2` models. We will focus on the `Meta-Llama-3-8B-Instruct` and `all-MiniLM-L6-v2` models.
But the steps should work equally well for any other models on hugging face. But the steps should work equally well for any other models on hugging face.
#### Preliminary Steps: #### Preliminary Steps:
...@@ -16,7 +16,7 @@ But the steps should work equally well for any other models on hugging face. ...@@ -16,7 +16,7 @@ But the steps should work equally well for any other models on hugging face.
2. **Accept the LLaMA Terms**: 2. **Accept the LLaMA Terms**:
You must accept the LLaMA model's terms of use by visiting: [LLaMA-2-7b-chat-hf Terms](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf). You must accept the LLaMA model's terms of use by visiting: [meta-llama/Meta-Llama-3-8B-Instruct Terms](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct).
3. **Create a Hugging Face CLI Token**: 3. **Create a Hugging Face CLI Token**:
...@@ -38,14 +38,14 @@ But the steps should work equally well for any other models on hugging face. ...@@ -38,14 +38,14 @@ But the steps should work equally well for any other models on hugging face.
1. **Download LLaMA-2-7b Model**: 1. **Download LLaMA-2-7b Model**:
Execute the following command to download the `Llama-2-7b-chat-hf` model to a local subdirectory. This command excludes unnecessary files to save space: Execute the following command to download the `Meta-Llama-3-8B-Instruct` model to a local subdirectory. This command excludes unnecessary files to save space:
```bash ```bash
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download \ HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download \
meta-llama/Llama-2-7b-chat-hf \ meta-llama/Meta-Llama-3-8B-Instruct \
--local-dir-use-symlinks False \ --local-dir-use-symlinks False \
--local-dir models/meta-llama/Llama-2-7b-chat-hf \ --local-dir models/meta-llama/Meta-Llama-3-8B-Instruct \
--exclude *.bin # These are alternates to the safetensors hence not needed --exclude *.pth # These are alternates to the safetensors hence not needed
``` ```
3. **Download MiniLM-L6-v2 Model (for sentence embeddings)**: 3. **Download MiniLM-L6-v2 Model (for sentence embeddings)**:
......
...@@ -20,7 +20,6 @@ Besides, the following restrictions will also be imposed: ...@@ -20,7 +20,6 @@ Besides, the following restrictions will also be imposed:
- Each team will be able to make up to **4 submissions per week per track**, and will be allowed an additional quota of upto **4 failed submissions per task per week**. - Each team will be able to make up to **4 submissions per week per track**, and will be allowed an additional quota of upto **4 failed submissions per task per week**.
Based on the hardware and system configuration, we recommend participants to begin with 7B and 13B models. According to our experiments, models like Llama-2 13B can perform inference smoothly on 4 NVIDIA T4 GPUs, while 13B models will result in OOM.
...@@ -36,9 +36,7 @@ def attempt_api_call(client, model_name, messages, max_retries=10): ...@@ -36,9 +36,7 @@ def attempt_api_call(client, model_name, messages, max_retries=10):
) )
return response.choices[0].message.content return response.choices[0].message.content
except (APIConnectionError, RateLimitError): except (APIConnectionError, RateLimitError):
logger.warning( logger.warning(f"API call failed on attempt {attempt + 1}, retrying...")
f"API call failed on attempt {attempt + 1}, retrying..."
)
except Exception as e: except Exception as e:
logger.error(f"Unexpected error: {e}") logger.error(f"Unexpected error: {e}")
break break
...@@ -69,9 +67,7 @@ def parse_response(resp: str): ...@@ -69,9 +67,7 @@ def parse_response(resp: str):
): ):
answer = 1 answer = 1
else: else:
raise ValueError( raise ValueError(f"Could not parse answer from response: {model_resp}")
f"Could not parse answer from response: {model_resp}"
)
return answer return answer
except: except:
...@@ -79,56 +75,97 @@ def parse_response(resp: str): ...@@ -79,56 +75,97 @@ def parse_response(resp: str):
def trim_predictions_to_max_token_length(prediction): def trim_predictions_to_max_token_length(prediction):
"""Trims prediction output to 75 tokens""" """Trims prediction output to 75 tokens using Llama2 tokenizer"""
max_token_length = 75 max_token_length = 75
tokenized_prediction = tokenizer.encode(prediction) tokenized_prediction = tokenizer.encode(prediction)
trimmed_tokenized_prediction = tokenized_prediction[ trimmed_tokenized_prediction = tokenized_prediction[1 : max_token_length + 1]
1 : max_token_length + 1
]
trimmed_prediction = tokenizer.decode(trimmed_tokenized_prediction) trimmed_prediction = tokenizer.decode(trimmed_tokenized_prediction)
return trimmed_prediction return trimmed_prediction
def generate_predictions(dataset_path, participant_model): def load_data_in_batches(dataset_path, batch_size):
predictions = [] """
with bz2.open(DATASET_PATH, "rt") as bz2_file: Generator function that reads data from a compressed file and yields batches of data.
for line in tqdm(bz2_file, desc="Generating Predictions"): Each batch is a dictionary containing lists of interaction_ids, queries, search results, query times, and answers.
data = json.loads(line)
Args:
query = data["query"] dataset_path (str): Path to the dataset file.
web_search_results = data["search_results"] batch_size (int): Number of data items in each batch.
query_time = data["query_time"]
Yields:
prediction = participant_model.generate_answer( dict: A batch of data.
query, web_search_results, query_time """
) def initialize_batch():
""" Helper function to create an empty batch. """
return {"interaction_id": [], "query": [], "search_results": [], "query_time": [], "answer": []}
# trim prediction to 75 tokens try:
prediction = trim_predictions_to_max_token_length(prediction) with bz2.open(dataset_path, "rt") as file:
predictions.append( batch = initialize_batch()
{ for line in file:
"query": query, try:
"ground_truth": str(data["answer"]).strip().lower(), item = json.loads(line)
"prediction": str(prediction).strip().lower(), for key in batch:
} batch[key].append(item[key])
)
if len(batch["query"]) == batch_size:
yield batch
batch = initialize_batch()
except json.JSONDecodeError:
logger.warn("Warning: Failed to decode a line.")
# Yield any remaining data as the last batch
if batch["query"]:
yield batch
except FileNotFoundError as e:
logger.error(f"Error: The file {dataset_path} was not found.")
raise e
except IOError as e:
logger.error(f"Error: An error occurred while reading the file {dataset_path}.")
raise e
return predictions
def evaluate_predictions(predictions, evaluation_model_name, openai_client): def generate_predictions(dataset_path, participant_model):
"""
Processes batches of data from a dataset to generate predictions using a model.
Args:
dataset_path (str): Path to the dataset.
participant_model (object): UserModel that provides `get_batch_size()` and `batch_generate_answer()` interfaces.
Returns:
tuple: A tuple containing lists of queries, ground truths, and predictions.
"""
queries, ground_truths, predictions = [], [], []
batch_size = participant_model.get_batch_size()
for batch in tqdm(load_data_in_batches(dataset_path, batch_size), desc="Generating predictions"):
batch_ground_truths = batch.pop("answer") # Remove answers from batch and store them
batch_predictions = participant_model.batch_generate_answer(batch)
queries.extend(batch["query"])
ground_truths.extend(batch_ground_truths)
predictions.extend(batch_predictions)
return queries, ground_truths, predictions
def evaluate_predictions(queries, ground_truths, predictions, evaluation_model_name, openai_client):
n_miss, n_correct, n_correct_exact = 0, 0, 0 n_miss, n_correct, n_correct_exact = 0, 0, 0
system_message = get_system_message() system_message = get_system_message()
for prediction_dict in tqdm( for _idx, prediction in enumerate(tqdm(
predictions, total=len(predictions), desc="Evaluating Predictions" predictions, total=len(predictions), desc="Evaluating Predictions"
): )):
query, ground_truth, prediction = ( query = queries[_idx]
prediction_dict["query"], ground_truth = ground_truths[_idx].strip()
prediction_dict["ground_truth"], # trim prediction to 75 tokens using Llama2 tokenizer
prediction_dict["prediction"], prediction = trim_predictions_to_max_token_length(prediction)
) prediction = prediction.strip()
ground_truth_lowercase = ground_truth.lower()
prediction_lowercase = prediction.lower()
messages = [ messages = [
{"role": "system", "content": system_message}, {"role": "system", "content": system_message},
{ {
...@@ -136,17 +173,15 @@ def evaluate_predictions(predictions, evaluation_model_name, openai_client): ...@@ -136,17 +173,15 @@ def evaluate_predictions(predictions, evaluation_model_name, openai_client):
"content": f"Question: {query}\n Ground truth: {ground_truth}\n Prediction: {prediction}\n", "content": f"Question: {query}\n Ground truth: {ground_truth}\n Prediction: {prediction}\n",
}, },
] ]
if prediction == "i don't know" or prediction == "i don't know.": if "i don't know" in prediction_lowercase:
n_miss += 1 n_miss += 1
continue continue
if prediction == ground_truth: elif prediction_lowercase == ground_truth_lowercase:
n_correct_exact += 1 n_correct_exact += 1
n_correct += 1 n_correct += 1
continue continue
response = attempt_api_call( response = attempt_api_call(openai_client, evaluation_model_name, messages)
openai_client, evaluation_model_name, messages
)
if response: if response:
log_response(messages, response) log_response(messages, response)
eval_res = parse_response(response) eval_res = parse_response(response)
...@@ -173,16 +208,14 @@ if __name__ == "__main__": ...@@ -173,16 +208,14 @@ if __name__ == "__main__":
from models.user_config import UserModel from models.user_config import UserModel
DATASET_PATH = "example_data/dev_data.jsonl.bz2" DATASET_PATH = "example_data/dev_data.jsonl.bz2"
EVALUATION_MODEL_NAME = os.getenv( EVALUATION_MODEL_NAME = os.getenv("EVALUATION_MODEL_NAME", "gpt-4-0125-preview")
"EVALUATION_MODEL_NAME", "gpt-4-0125-preview"
)
# Generate predictions # Generate predictions
participant_model = UserModel() participant_model = UserModel()
predictions = generate_predictions(DATASET_PATH, participant_model) queries, ground_truths, predictions = generate_predictions(DATASET_PATH, participant_model)
# Evaluate Predictions # Evaluate Predictions
openai_client = OpenAI() openai_client = OpenAI()
evaluation_results = evaluate_predictions( evaluation_results = evaluate_predictions(
predictions, EVALUATION_MODEL_NAME, openai_client queries, ground_truths, predictions, EVALUATION_MODEL_NAME, openai_client
) )
...@@ -4,7 +4,7 @@ ...@@ -4,7 +4,7 @@
For a streamlined experience, we suggest placing the code for all your models within the `models` directory. This is a recommendation for organizational purposes, but it's not a strict requirement. For a streamlined experience, we suggest placing the code for all your models within the `models` directory. This is a recommendation for organizational purposes, but it's not a strict requirement.
## Model Base Class ## Model Base Class
Your models should follow the format from the `DummyModel` class found in [dummy_model.py](dummy_model.py). We provide the example model, `dummy_model.py`, to illustrate the structure your own model. Crucially, your model class must implement the `generate_answer` method. Your models should follow the format from the `DummyModel` class found in [dummy_model.py](dummy_model.py). We provide the example model, `dummy_model.py`, to illustrate the structure your own model. Crucially, your model class must implement the `batch_generate_answer` method.
## Selecting which model to use ## Selecting which model to use
To ensure your model is recognized and utilized correctly, please specify your model class name in the [`user_config.py`](user_config.py) file, by following the instructions in the inline comments. To ensure your model is recognized and utilized correctly, please specify your model class name in the [`user_config.py`](user_config.py) file, by following the instructions in the inline comments.
...@@ -12,13 +12,19 @@ To ensure your model is recognized and utilized correctly, please specify your m ...@@ -12,13 +12,19 @@ To ensure your model is recognized and utilized correctly, please specify your m
## Model Inputs and Outputs ## Model Inputs and Outputs
### Inputs ### Inputs
Your model will receive two pieces of information for every task: Your model will receive a batch of input queries as a dictionary, where the dictionary has the following keys:
- `query`: String representing the input query
- `search_results`: List of strings, each comes from scraped HTML text of the search query. ```
- `query_time`: The time at which the query was made, represented as a string. - 'query' (List[str]): List of user queries.
- 'search_results' (List[List[Dict]]): List of search result lists, each corresponding
to a query. Please refer to the following link for
more details about the individual search objects:
https://gitlab.aicrowd.com/aicrowd/challenges/meta-comprehensive-rag-benchmark-kdd-cup-2024/meta-comphrehensive-rag-benchmark-starter-kit/-/blob/master/docs/dataset.md#search-results-detail
- 'query_time' (List[str]): List of timestamps (represented as a string), each corresponding to when a query was made.
```
### Outputs ### Outputs
The output from your model's `generate_answer` function should always be a string. The output from your model's `batch_generate_answer` function should be a list of string responses for all the queries in the input batch.
## Internet Access ## Internet Access
Your model will not have access to the internet during evaluation. Your model will not have access to the internet during evaluation.
\ No newline at end of file
import os import os
from typing import Dict, List from typing import Any, Dict, List
from models.utils import trim_predictions_to_max_token_length from models.utils import trim_predictions_to_max_token_length
...@@ -24,22 +24,35 @@ class DummyModel: ...@@ -24,22 +24,35 @@ class DummyModel:
""" """
pass pass
def generate_answer( def get_batch_size(self) -> int:
self, query: str, search_results: List[Dict], query_time: str
) -> str:
""" """
Generate an answer based on a provided query and a list of pre-cached search results. Determines the batch size that is used by the evaluator when calling the `batch_generate_answer` function.
Returns:
int: The batch size, an integer between 1 and 16. This value indicates how many
queries should be processed together in a single batch. It can be dynamic
across different batch_generate_answer calls, or stay a static value.
"""
self.batch_size = 4
return self.batch_size
def batch_generate_answer(self, batch: Dict[str, Any]) -> List[str]:
"""
Generates answers for a batch of queries using associated (pre-cached) search results and query times.
Parameters: Parameters:
- query (str): The user's question or query input. batch (Dict[str, Any]): A dictionary containing a batch of input queries with the following keys:
- search_results (List[Dict]): A list containing the search result objects, - 'interaction_id; (List[str]): List of interaction_ids for the associated queries
as described here: - 'query' (List[str]): List of user queries.
https://gitlab.aicrowd.com/aicrowd/challenges/meta-comprehensive-rag-benchmark-kdd-cup-2024/meta-comphrehensive-rag-benchmark-starter-kit/-/blob/master/docs/dataset.md#search-results-detail - 'search_results' (List[List[Dict]]): List of search result lists, each corresponding
- query_time (str): The time at which the query was made, represented as a string. to a query. Please refer to the following link for
more details about the individual search objects:
https://gitlab.aicrowd.com/aicrowd/challenges/meta-comprehensive-rag-benchmark-kdd-cup-2024/meta-comphrehensive-rag-benchmark-starter-kit/-/blob/master/docs/dataset.md#search-results-detail
- 'query_time' (List[str]): List of timestamps (represented as a string), each corresponding to when a query was made.
Returns: Returns:
- (str): A plain text response that answers the query. This response is limited to 75 tokens. List[str]: A list of plain text responses for each query in the batch. Each response is limited to 75 tokens.
If the generated response exceeds 75 tokens, it will be truncated to fit within this limit. If the generated response exceeds 75 tokens, it will be truncated to fit within this limit.
Notes: Notes:
- If the correct answer is uncertain, it's preferable to respond with "I don't know" to avoid - If the correct answer is uncertain, it's preferable to respond with "I don't know" to avoid
...@@ -47,10 +60,14 @@ class DummyModel: ...@@ -47,10 +60,14 @@ class DummyModel:
- Response Time: Ensure that your model processes and responds to each query within 10 seconds. - Response Time: Ensure that your model processes and responds to each query within 10 seconds.
Failing to adhere to this time constraint **will** result in a timeout during evaluation. Failing to adhere to this time constraint **will** result in a timeout during evaluation.
""" """
# Default response when unsure about the answer batch_interaction_ids = batch["interaction_id"]
answer = "i don't know" queries = batch["query"]
search_results = batch["search_results"]
query_times = batch["query_time"]
# Trim prediction to a max of 75 tokens answers = []
trimmed_answer = trim_predictions_to_max_token_length(answer) for idx, query in enumerate(queries):
# Implement logic to generate answers based on search results and query times
answers.append("i don't know") # Default placeholder response
return trimmed_answer return answers
This diff is collapsed.
# isort: skip_file
from models.dummy_model import DummyModel from models.dummy_model import DummyModel
UserModel = DummyModel UserModel = DummyModel
# Uncomment the lines below to use the Vanilla LLAMA baseline # Uncomment the lines below to use the Vanilla LLAMA baseline
# from models.vanilla_llama_baseline import ChatModel # from models.vanilla_llama_baseline import InstructModel
# UserModel = ChatModel # UserModel = InstructModel
# Uncomment the lines below to use the RAG LLAMA baseline # Uncomment the lines below to use the RAG LLAMA baseline
......
import os import os
from typing import Dict, List from typing import Any, Dict, List
import numpy as np import numpy as np
import torch import torch
import vllm
from models.utils import trim_predictions_to_max_token_length from models.utils import trim_predictions_to_max_token_length
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
pipeline,
)
###################################################################################################### ######################################################################################################
###################################################################################################### ######################################################################################################
...@@ -20,6 +15,8 @@ from transformers import ( ...@@ -20,6 +15,8 @@ from transformers import (
### ###
### https://gitlab.aicrowd.com/aicrowd/challenges/meta-comprehensive-rag-benchmark-kdd-cup-2024/meta-comphrehensive-rag-benchmark-starter-kit/-/blob/master/docs/download_baseline_model_weights.md ### https://gitlab.aicrowd.com/aicrowd/challenges/meta-comprehensive-rag-benchmark-kdd-cup-2024/meta-comphrehensive-rag-benchmark-starter-kit/-/blob/master/docs/download_baseline_model_weights.md
### ###
### And please pay special attention to the comments that start with "TUNE THIS VARIABLE"
### as they depend on your model and the available GPU resources.
### ###
### DISCLAIMER: This baseline has NOT been tuned for performance ### DISCLAIMER: This baseline has NOT been tuned for performance
### or efficiency, and is provided as is for demonstration. ### or efficiency, and is provided as is for demonstration.
...@@ -38,33 +35,35 @@ from transformers import ( ...@@ -38,33 +35,35 @@ from transformers import (
CRAG_MOCK_API_URL = os.getenv("CRAG_MOCK_API_URL", "http://localhost:8000") CRAG_MOCK_API_URL = os.getenv("CRAG_MOCK_API_URL", "http://localhost:8000")
class ChatModel: #### CONFIG PARAMETERS ---
# Batch size you wish the evaluators will use to call the `batch_generate_answer` function
AICROWD_SUBMISSION_BATCH_SIZE = 8 # TUNE THIS VARIABLE depending on the number of GPUs you are requesting and the size of your model.
# VLLM Parameters
VLLM_TENSOR_PARALLEL_SIZE = 4 # TUNE THIS VARIABLE depending on the number of GPUs you are requesting and the size of your model.
VLLM_GPU_MEMORY_UTILIZATION = 0.85 # TUNE THIS VARIABLE depending on the number of GPUs you are requesting and the size of your model.
#### CONFIG PARAMETERS END---
class InstructModel:
def __init__(self): def __init__(self):
""" """
Initialize your model(s) here if necessary. Initialize your model(s) here if necessary.
This is the constructor for your DummyModel class, where you can set up any This is the constructor for your DummyModel class, where you can set up any
required initialization steps for your model(s) to function correctly. required initialization steps for your model(s) to function correctly.
""" """
self.prompt_template = """You are given a quesition and references which may or may not help answer the question. Your goal is to answer the question in as few words as possible. self.initialize_models()
### Question
{query}
### Answer"""
bnb_config = BitsAndBytesConfig( def initialize_models(self):
load_in_4bit=True, # Initialize Meta Llama 3 - 8B Instruct Model
bnb_4bit_compute_dtype=torch.float16, self.model_name = "models/meta-llama/Meta-Llama-3-8B-Instruct"
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=False,
)
model_name = "models/meta-llama/Llama-2-7b-chat-hf" if not os.path.exists(self.model_name):
if not os.path.exists(model_name):
raise Exception( raise Exception(
f""" f"""
The evaluators expect the model weights to be checked into the repository, The evaluators expect the model weights to be checked into the repository,
but we could not find the model weights at {model_name} but we could not find the model weights at {self.model_name}
Please follow the instructions in the docs below to download and check in the model weights. Please follow the instructions in the docs below to download and check in the model weights.
...@@ -72,38 +71,46 @@ class ChatModel: ...@@ -72,38 +71,46 @@ class ChatModel:
""" """
) )
self.tokenizer = AutoTokenizer.from_pretrained(model_name) # initialize the model with vllm
self.llm = vllm.LLM(
self.llm = AutoModelForCausalLM.from_pretrained( self.model_name,
model_name, tensor_parallel_size=VLLM_TENSOR_PARALLEL_SIZE,
device_map="auto", gpu_memory_utilization=VLLM_GPU_MEMORY_UTILIZATION,
quantization_config=bnb_config, trust_remote_code=True,
torch_dtype=torch.float16, dtype="half", # note: bfloat16 is not supported on nvidia-T4 GPUs
enforce_eager=True
) )
self.tokenizer = self.llm.get_tokenizer()
self.generation_pipe = pipeline( def get_batch_size(self) -> int:
task="text-generation", """
model=self.llm, Determines the batch size that is used by the evaluator when calling the `batch_generate_answer` function.
tokenizer=self.tokenizer,
max_new_tokens=75,
)
def generate_answer( Returns:
self, query: str, search_results: List[Dict], query_time: str int: The batch size, an integer between 1 and 16. This value indicates how many
) -> str: queries should be processed together in a single batch. It can be dynamic
across different batch_generate_answer calls, or stay a static value.
""" """
Generate an answer based on a provided query and a list of pre-cached search results. self.batch_size = AICROWD_SUBMISSION_BATCH_SIZE
return self.batch_size
def batch_generate_answer(self, batch: Dict[str, Any]) -> List[str]:
"""
Generates answers for a batch of queries using associated (pre-cached) search results and query times.
Parameters: Parameters:
- query (str): The user's question or query input. batch (Dict[str, Any]): A dictionary containing a batch of input queries with the following keys:
- search_results (List[Dict]): A list containing the search result objects, - 'interaction_id; (List[str]): List of interaction_ids for the associated queries
as described here: - 'query' (List[str]): List of user queries.
https://gitlab.aicrowd.com/aicrowd/challenges/meta-comprehensive-rag-benchmark-kdd-cup-2024/meta-comphrehensive-rag-benchmark-starter-kit/-/blob/master/docs/dataset.md#search-results-detail - 'search_results' (List[List[Dict]]): List of search result lists, each corresponding
- query_time (str): The time at which the query was made, represented as a string. to a query. Please refer to the following link for
more details about the individual search objects:
https://gitlab.aicrowd.com/aicrowd/challenges/meta-comprehensive-rag-benchmark-kdd-cup-2024/meta-comphrehensive-rag-benchmark-starter-kit/-/blob/master/docs/dataset.md#search-results-detail
- 'query_time' (List[str]): List of timestamps (represented as a string), each corresponding to when a query was made.
Returns: Returns:
- (str): A plain text response that answers the query. This response is limited to 75 tokens. List[str]: A list of plain text responses for each query in the batch. Each response is limited to 75 tokens.
If the generated response exceeds 75 tokens, it will be truncated to fit within this limit. If the generated response exceeds 75 tokens, it will be truncated to fit within this limit.
Notes: Notes:
- If the correct answer is uncertain, it's preferable to respond with "I don't know" to avoid - If the correct answer is uncertain, it's preferable to respond with "I don't know" to avoid
...@@ -111,12 +118,64 @@ class ChatModel: ...@@ -111,12 +118,64 @@ class ChatModel:
- Response Time: Ensure that your model processes and responds to each query within 10 seconds. - Response Time: Ensure that your model processes and responds to each query within 10 seconds.
Failing to adhere to this time constraint **will** result in a timeout during evaluation. Failing to adhere to this time constraint **will** result in a timeout during evaluation.
""" """
batch_interaction_ids = batch["interaction_id"]
queries = batch["query"]
batch_search_results = batch["search_results"]
query_times = batch["query_time"]
formatted_prompts = self.format_prommpts(queries, query_times)
# Generate responses via vllm
responses = self.llm.generate(
formatted_prompts,
vllm.SamplingParams(
n=1, # Number of output sequences to return for each prompt.
top_p=0.9, # Float that controls the cumulative probability of the top tokens to consider.
temperature=0.1, # randomness of the sampling
skip_special_tokens=True, # Whether to skip special tokens in the output.
max_tokens=50, # Maximum number of tokens to generate per output sequence.
# Note: We are using 50 max new tokens instead of 75,
# because the 75 max token limit is checked using the Llama2 tokenizer.
# The Llama3 model instead uses a differet tokenizer with a larger vocabulary
# This allows it to represent the same content more efficiently, using fewer tokens.
),
use_tqdm = False
)
final_prompt = self.prompt_template.format(query=query) # Aggregate answers into List[str]
result = self.generation_pipe(final_prompt)[0]["generated_text"] answers = []
answer = result.split("### Answer")[1].strip() for response in responses:
answers.append(response.outputs[0].text)
# Trim prediction to a max of 75 tokens return answers
trimmed_answer = trim_predictions_to_max_token_length(answer)
def format_prommpts(self, queries, query_times):
"""
Formats queries and corresponding query_times using the chat_template of the model.
Parameters:
- queries (list of str): A list of queries to be formatted into prompts.
- query_times (list of str): A list of query_time strings corresponding to each query.
"""
system_prompt = "You are provided with a question and various references. Your task is to answer the question succinctly, using the fewest words possible. If the references do not contain the necessary information to answer the question, respond with 'I don't know'."
formatted_prompts = []
for _idx, query in enumerate(queries):
query_time = query_times[_idx]
user_message = ""
user_message += f"Current Time: {query_time}\n"
user_message += f"Question: {query}\n"
formatted_prompts.append(
self.tokenizer.apply_chat_template(
[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_message},
],
tokenize=False,
add_generation_prompt=True,
)
)
return trimmed_answer return formatted_prompts
...@@ -9,4 +9,5 @@ lxml ...@@ -9,4 +9,5 @@ lxml
openai==1.13.3 openai==1.13.3
sentence_transformers sentence_transformers
torch torch
transformers transformers
\ No newline at end of file vllm>=0.4.2
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment