Commit b9ced1ec authored by ashivani's avatar ashivani

added baseline

parent 917abf17
Pipeline #5674 failed with stages
in 10 minutes and 25 seconds
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![AIcrowd-Logo](https://raw.githubusercontent.com/AIcrowd/AIcrowd/master/app/assets/images/misc/aicrowd-horizontal.png)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Getting Started Code for [JIGSAW Challenge](www.aicrowd.com/challenges/jigsaw) on AIcrowd\n",
"#### Author : Sanjay Pokkali"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Download Necessary Packages 📚"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip install numpy\n",
"!pip install pandas\n",
"!pip install scikit-learn\n",
"!pip install textdistance"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Download Data\n",
"The first step is to download out train test data. We will be training a model on the train data and make predictions on test data. We submit our predictions.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"!rm -rf data\n",
"!mkdir data\n",
"!wget https://datasets.aicrowd.com/default/aicrowd-practice-challenges/public/cptcha/v0.1/train.tar.gz\n",
"!wget https://datasets.aicrowd.com/default/aicrowd-practice-challenges/public/cptcha/v0.1/test.tar.gz\n",
"!wget https://datasets.aicrowd.com/default/aicrowd-practice-challenges/public/cptcha/v0.1/train_info.csv\n",
"!wget https://datasets.aicrowd.com/default/aicrowd-practice-challenges/public/cptcha/v0.1/test_info.csv\n",
" \n",
"!tar -xvzf train.tar.gz\n",
"!tar -xvzf test.tar.gz\n",
"!mv train data/train\n",
"!mv test data/test\n",
"!mv train_info.csv data/train_info.csv\n",
"!mv test_info.csv data/test_info.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## Import packages"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"from sklearn.model_selection import train_test_split\n",
"from PIL import Image\n",
"import glob\n",
"import tempfile\n",
"import os \n",
"import random\n",
"import tqdm\n",
"import tarfile\n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load Data\n",
"- We use pandas 🐼 library to load our data. \n",
"- Pandas loads the data into dataframes and facilitates us to analyse the data. \n",
"- Learn more about it [here](https://www.tutorialspoint.com/python_data_science/python_pandas.htm) 🤓"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"train_info_path = \"data/train_info.csv\"\n",
"test_info_path = \"data/test_info.csv\"\n",
"\n",
"train_images_path = \"data/train/\"\n",
"test_images_path = \"data/test/\"\n",
"train_info = pd.read_csv(train_info_path)\n",
"test_info = pd.read_csv(test_info_path)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Visualize the images👀"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def plot_image(img_path):\n",
" img = cv2.imread(img_path)\n",
"# print(\"Shape of the captcha \",img.shape)\n",
" plt.imshow(img)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fig=plt.figure(figsize=(20,20))\n",
"columns = 3\n",
"rows = 3\n",
"for i in range(1, columns*rows +1):\n",
" img = train_images_path + train_info['filename'][i]\n",
" fig.add_subplot(rows, columns, i)\n",
" plot_image(img)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Split Data into Train and Validation 🔪\n",
"- The next step is to think of a way to test how well our model is performing. we cannot use the test data given as it does not contain the data labels for us to verify. \n",
"- The workaround this is to split the given training data into training and validation. Typically validation sets give us an idea of how our model will perform on unforeseen data. it is like holding back a chunk of data while training our model and then using it to for the purpose of testing. it is a standard way to fine-tune hyperparameters in a model. \n",
"- There are multiple ways to split a dataset into validation and training sets. following are two popular ways to go about it, [k-fold](https://machinelearningmastery.com/k-fold-cross-validation/), [leave one out](https://en.wikipedia.org/wiki/Cross-validation_statistics). 🧐\n",
"- Validation sets are also used to avoid your model from [overfitting](https://machinelearningmastery.com/overfitting-and-underfitting-with-machine-learning-algorithms/) on the train dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_train, X_val= train_test_split(train_info, test_size=0.2, random_state=42) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- We have decided to split the data with 20 % as validation and 80 % as training. \n",
"- To learn more about the train_test_split function [click here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). 🧐 \n",
"- This is of course the simplest way to validate your model by simply taking a random chunk of the train set and setting it aside solely for the purpose of testing our train model on unseen data. as mentioned in the previous block, you can experiment 🔬 with and choose more sophisticated techniques and make your model better."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Now, since we have our data splitted into train and validation sets, we need to get the corresponding labels separated from the data. \n",
"- with this step we are all set move to the next step with a prepared dataset."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# TRAINING PHASE 🏋️"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**We will use PyTesseract, an Optical Character Recognition library to recognize the characters in the test captcha directly and make a submission in this notebook.But lest see its performace on the train set.**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using PyTesseract on Training Set"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"labels = []\n",
"all_filenames = []\n",
"\n",
"for index,rows in train_info.iterrows():\n",
"\n",
" i = rows['filename']\n",
" img_path = train_images_path + i\n",
" label = pytesseract.image_to_string(Image.open(img_path))\n",
" #Removing garbage characters\n",
" label = label.replace(\"\\x0c\",\"\")\n",
" label = label.replace(\"\\n\",\"\")\n",
" labels.append(label)\n",
" all_filenames.append(i)\n",
" print(f'{str(index)+\"/\" + str(train_info.shape[0])}\\r',end=\"\")\n",
" \n",
" \n",
"\n",
"labels = np.asarray(labels)\n",
"all_filenames = np.asarray(all_filenames)\n",
"\n",
"\n",
"submission = pd.DataFrame()\n",
"submission['filename'] = all_filenames\n",
"submission['label'] = labels\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Evaluate the Performance\n",
"\n",
"**Here for evaluation mean over normalised [Levenshtein Similarity Score](https://en.wikipedia.org/wiki/Levenshtein_distance) will be used to test the efficiency of the model.**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def cal_lshtein_score(s_true,s_pred):\n",
" if type(s_pred) == type(1.0):\n",
" return 0\n",
" score = textdistance.levenshtein.normalized_similarity(s_true,s_pred) \n",
" return score\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"lst_scores = []\n",
"for idx in range(0,len(train_info)):\n",
" lst_scores.append(cal_lshtein_score(train_info['label'][idx],submission['label'][idx]))\n",
"\n",
"mean_lst_score = np.mean(lst_scores)\n",
"\n",
"print(\"The mean of normalised Levenshtein Similarity score is \" ,mean_lst_score)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Testing Phase\n",
"\n",
"## Generate Output for Test set"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"labels = []\n",
"all_filenames = []\n",
"\n",
"for index,rows in test_info.iterrows():\n",
"\n",
" i = rows['filename']\n",
" img_path = test_images_path + i\n",
" label = pytesseract.image_to_string(Image.open(img_path))\n",
" #Removing garbage characters\n",
" label = label.replace(\"\\x0c\",\"\")\n",
" label = label.replace(\"\\n\",\"\")\n",
" labels.append(label)\n",
" all_filenames.append(i)\n",
" print(f'{str(index)+\"/\" + str(test_info.shape[0])}\\r',end=\"\")\n",
"\n",
"labels = np.asarray(labels)\n",
"all_filenames = np.asarray(all_filenames)\n",
"\n",
"\n",
"submission_df = pd.DataFrame()\n",
"submission_df['filename'] = all_filenames\n",
"submission_df['label'] = labels\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Save the prediction to csv"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"submission_df.to_csv('submission.csv', index=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 🚧 Note : \n",
"- Do take a look at the submission format. \n",
"- The submission file should contain a header. \n",
"- Follow all submission guidelines strictly to avoid inconvenience."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## To download the generated csv in colab run the below command"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"try:\n",
" from google.colab import files\n",
" files.download('submission.csv') \n",
"except:\n",
" print(\"Option Only avilable in Google Colab\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Well Done! 👍 We are all set to make a submission and see your name on leaderborad. Lets navigate to [challenge page](https://www.aicrowd.com/challenges/CPTCHA) and make one."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
},
"colab": {
"name": "JIGSAW_baseline.ipynb",
"provenance": [],
"collapsed_sections": [],
"toc_visible": true
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "r_UoYN5PzsZh"
},
"source": [
"![AIcrowd-Logo](https://raw.githubusercontent.com/AIcrowd/AIcrowd/master/app/assets/images/misc/aicrowd-horizontal.png)\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "QnktOs5tzsZi"
},
"source": [
"# Getting Started Code for [JIGSAW Challenge](www.aicrowd.com/challenges/jigsaw) on AIcrowd\n",
"#### Author : Sharada Mohanty"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "FSQaFNqj4Xqg"
},
"source": [
"This baseline creates the image of desired size and places the puzzle pieces randomly."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "g3NTJjh6zsZj"
},
"source": [
"## Download Necessary Packages 📚"
]
},
{
"cell_type": "code",
"metadata": {
"id": "52VNLEdxzsZl"
},
"source": [
"!pip install numpy\n",
"!pip install pandas"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "v1252SIXzsZp"
},
"source": [
"## Download Data\n",
"The first step is to download out train test data. We will be training a model on the train data and make predictions on test data. We submit our predictions.\n"
]
},
{
"cell_type": "code",
"metadata": {
"scrolled": true,
"id": "y2jOSZWezsZq"
},
"source": [
"!rm -rf data\n",
"!mkdir data\n",
"!wget https://datasets.aicrowd.com/default/aicrowd-practice-challenges/public/jigsaw/v0.1/puzzles.tar.gz\n",
"!wget https://datasets.aicrowd.com/default/aicrowd-practice-challenges/public/jigsaw/v0.1/metadata.csv\n",
"\n",
"!mkdir data/puzzles \n",
"!tar -C data/puzzles -xvzf puzzles.tar.gz \n",
"\n",
"!mv metadata.csv data/metadata.csv"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "2HnIJcEHzsZu"
},
"source": [
"\n",
"## Import packages"
]
},
{
"cell_type": "code",
"metadata": {
"id": "XKUyBzLWzsZv"
},
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"from sklearn.model_selection import train_test_split\n",
"from PIL import Image\n",
"import glob\n",
"import tempfile\n",
"import os \n",
"import random\n",
"import tqdm\n",
"import tarfile\n",
"%matplotlib inline"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "OYmcu6cezsZy"
},
"source": [
"## Load Data"
]
},
{
"cell_type": "code",
"metadata": {
"id": "nIvcQSYNzsZz"
},
"source": [
"PUZZLES_DIRECTORY = \"data/puzzles\"\n",
"METADATA_FILE = \"data/metadata.csv\"\n",
"\n",
"OUTPUT_PATH = \"data/submission.tar.gz\"\n",
"metadata_df = pd.read_csv(METADATA_FILE)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "YsvIsudZ4j-d"
},
"source": [
"Create directory to store the solved image."
]
},
{
"cell_type": "code",
"metadata": {
"id": "Da3JIibY3W-9"
},
"source": [
"TEMP_SUBMISSION_DIR = tempfile.TemporaryDirectory()\n",
"TEMP_SUBMISSION_DIR_PATH = TEMP_SUBMISSION_DIR.name\n"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "eGBW_ioK4tPd"
},
"source": [
"**This is a very naive approach of creating a new image of the desired size, and paste all the individual puzzle pieces at random locations.**\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "qh7mGFfr3bbA"
},
"source": [
"for index, row in tqdm.tqdm(metadata_df.iterrows(), total=metadata_df.shape[0]):\n",
" # Get the height and widhts of all the images from metadata.csv.\n",
" \n",
"\n",
" puzzle_id = row[\"puzzle_id\"]\n",
" image_width = row[\"width\"]\n",
" image_height = row[\"height\"]\n",
"\n",
" puzzle_directory = os.path.join(\n",
" PUZZLES_DIRECTORY,\n",
" str(puzzle_id)\n",
" )\n",
" solved_puzzle_im = Image.new(\"RGBA\",\n",
" (image_width, image_height)\n",
" ) # Initially create RGBA images, and then drop A channel later\n",
"\n",
" for _puzzle_piece in glob.glob(os.path.join(puzzle_directory, \"*.png\")):\n",
" puzzle_piece_im = Image.open(_puzzle_piece)\n",
" pp_width, pp_height = puzzle_piece_im.size\n",
"\n",
" # Find Random location \n",
" random_x = random.randint(0, image_width - pp_width)\n",
" random_y = random.randint(0, image_height - pp_height)\n",
"\n",
" solved_puzzle_im.paste(puzzle_piece_im, (random_x, random_y))\n",
"\n",
" del puzzle_piece_im\n",
"\n",
" solved_puzzle_im.convert(\"RGB\").save(\n",
" os.path.join(TEMP_SUBMISSION_DIR_PATH, \"{}.jpg\".format(str(puzzle_id)))\n",
" )\n",
" del solved_puzzle_im\n",
"\n"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "j4GEh4rb_Irj"
},
"source": [
"# Visualize the generated Image"
]
},
{
"cell_type": "code",
"metadata": {
"id": "_FSJixiE6M7y"
},
"source": [
"img_path = os.path.join(TEMP_SUBMISSION_DIR_PATH,'2.jpg')\n",
"img = Image.open(img_path)\n",
"img"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "BhEwJBmW5VGR"
},
"source": [
"# Create the tar file with all the solved images and store it in the OUTPUT PATH."
]
},
{
"cell_type": "code",
"metadata": {
"id": "KMRVv9Cw3n29"
},
"source": [
"with tarfile.open(OUTPUT_PATH, mode=\"w:gz\") as tar_file:\n",
" for _filepath in glob.glob(os.path.join(TEMP_SUBMISSION_DIR_PATH, \"*.jpg\")):\n",
" print(_filepath)\n",
" _, filename = os.path.split(_filepath)\n",
" tar_file.add(_filepath, arcname=filename)\n",
"\n",
"print(\"Wrote output file to : \", OUTPUT_PATH)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "k-WPsT0Szsag"
},
"source": [
"## To download the generated tar.gz in colab run the below command"
]
},
{
"cell_type": "code",
"metadata": {
"id": "c2lFyA17zsah"
},
"source": [
"try:\n",
" from google.colab import files\n",
" files.download(OUTPUT_PATH) \n",
"except:\n",
" print(\"Option Only avilable in Google Colab\")"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "KKZfXEPUzsak"
},
"source": [
"### Well Done! 👍 We are all set to make a submission and see your name on leaderborad. Lets navigate to [challenge page](https://www.aicrowd.com/challenges/jigsaw) and make one."
]
},
{
"cell_type": "code",
"metadata": {
"id": "Pi2lTDqd51u6"
},