Commit b9ced1ec authored by ashivani's avatar ashivani
Browse files

added baseline

parent 917abf17
Pipeline #5674 failed with stages
in 10 minutes and 25 seconds
......@@ -4,23 +4,25 @@
%% Cell type:markdown id: tags:
# Getting Started Code for [JIGSAW Challenge](www.aicrowd.com/challenges/jigsaw) on AIcrowd
#### Author : Sanjay Pokkali
#### Author : Sharada Mohanty
%% Cell type:markdown id: tags:
This baseline creates the image of desired size and places the puzzle pieces randomly.
%% Cell type:markdown id: tags:
## Download Necessary Packages 📚
%% Cell type:code id: tags:
``` python
!pip install numpy
!pip install pandas
!pip install scikit-learn
!pip install textdistance
```
%% Cell type:markdown id: tags:
## Download Data
......@@ -30,21 +32,17 @@
%% Cell type:code id: tags:
``` python
!rm -rf data
!mkdir data
!wget https://datasets.aicrowd.com/default/aicrowd-practice-challenges/public/cptcha/v0.1/train.tar.gz
!wget https://datasets.aicrowd.com/default/aicrowd-practice-challenges/public/cptcha/v0.1/test.tar.gz
!wget https://datasets.aicrowd.com/default/aicrowd-practice-challenges/public/cptcha/v0.1/train_info.csv
!wget https://datasets.aicrowd.com/default/aicrowd-practice-challenges/public/cptcha/v0.1/test_info.csv
!tar -xvzf train.tar.gz
!tar -xvzf test.tar.gz
!mv train data/train
!mv test data/test
!mv train_info.csv data/train_info.csv
!mv test_info.csv data/test_info.csv
!wget https://datasets.aicrowd.com/default/aicrowd-practice-challenges/public/jigsaw/v0.1/puzzles.tar.gz
!wget https://datasets.aicrowd.com/default/aicrowd-practice-challenges/public/jigsaw/v0.1/metadata.csv
!mkdir data/puzzles
!tar -C data/puzzles -xvzf puzzles.tar.gz
!mv metadata.csv data/metadata.csv
```
%% Cell type:markdown id: tags:
......@@ -67,215 +65,125 @@
```
%% Cell type:markdown id: tags:
## Load Data
- We use pandas 🐼 library to load our data.
- Pandas loads the data into dataframes and facilitates us to analyse the data.
- Learn more about it [here](https://www.tutorialspoint.com/python_data_science/python_pandas.htm) 🤓
%% Cell type:code id: tags:
``` python
train_info_path = "data/train_info.csv"
test_info_path = "data/test_info.csv"
PUZZLES_DIRECTORY = "data/puzzles"
METADATA_FILE = "data/metadata.csv"
train_images_path = "data/train/"
test_images_path = "data/test/"
train_info = pd.read_csv(train_info_path)
test_info = pd.read_csv(test_info_path)
OUTPUT_PATH = "data/submission.tar.gz"
metadata_df = pd.read_csv(METADATA_FILE)
```
%% Cell type:markdown id: tags:
## Visualize the images👀
Create directory to store the solved image.
%% Cell type:code id: tags:
``` python
def plot_image(img_path):
img = cv2.imread(img_path)
# print("Shape of the captcha ",img.shape)
plt.imshow(img)
```
%% Cell type:code id: tags:
``` python
fig=plt.figure(figsize=(20,20))
columns = 3
rows = 3
for i in range(1, columns*rows +1):
img = train_images_path + train_info['filename'][i]
fig.add_subplot(rows, columns, i)
plot_image(img)
plt.show()
```
%% Cell type:markdown id: tags:
## Split Data into Train and Validation 🔪
- The next step is to think of a way to test how well our model is performing. we cannot use the test data given as it does not contain the data labels for us to verify.
- The workaround this is to split the given training data into training and validation. Typically validation sets give us an idea of how our model will perform on unforeseen data. it is like holding back a chunk of data while training our model and then using it to for the purpose of testing. it is a standard way to fine-tune hyperparameters in a model.
- There are multiple ways to split a dataset into validation and training sets. following are two popular ways to go about it, [k-fold](https://machinelearningmastery.com/k-fold-cross-validation/), [leave one out](https://en.wikipedia.org/wiki/Cross-validation_statistics). 🧐
- Validation sets are also used to avoid your model from [overfitting](https://machinelearningmastery.com/overfitting-and-underfitting-with-machine-learning-algorithms/) on the train dataset.
TEMP_SUBMISSION_DIR = tempfile.TemporaryDirectory()
TEMP_SUBMISSION_DIR_PATH = TEMP_SUBMISSION_DIR.name
%% Cell type:code id: tags:
``` python
X_train, X_val= train_test_split(train_info, test_size=0.2, random_state=42)
```
%% Cell type:markdown id: tags:
- We have decided to split the data with 20 % as validation and 80 % as training.
- To learn more about the train_test_split function [click here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). 🧐
- This is of course the simplest way to validate your model by simply taking a random chunk of the train set and setting it aside solely for the purpose of testing our train model on unseen data. as mentioned in the previous block, you can experiment 🔬 with and choose more sophisticated techniques and make your model better.
%% Cell type:markdown id: tags:
- Now, since we have our data splitted into train and validation sets, we need to get the corresponding labels separated from the data.
- with this step we are all set move to the next step with a prepared dataset.
**This is a very naive approach of creating a new image of the desired size, and paste all the individual puzzle pieces at random locations.**
%% Cell type:markdown id: tags:
# TRAINING PHASE 🏋️
%% Cell type:markdown id: tags:
**We will use PyTesseract, an Optical Character Recognition library to recognize the characters in the test captcha directly and make a submission in this notebook.But lest see its performace on the train set.**
%% Cell type:markdown id: tags:
## Using PyTesseract on Training Set
%% Cell type:code id: tags:
``` python
labels = []
all_filenames = []
for index,rows in train_info.iterrows():
i = rows['filename']
img_path = train_images_path + i
label = pytesseract.image_to_string(Image.open(img_path))
#Removing garbage characters
label = label.replace("\x0c","")
label = label.replace("\n","")
labels.append(label)
all_filenames.append(i)
print(f'{str(index)+"/" + str(train_info.shape[0])}\r',end="")
for index, row in tqdm.tqdm(metadata_df.iterrows(), total=metadata_df.shape[0]):
# Get the height and widhts of all the images from metadata.csv.
labels = np.asarray(labels)
all_filenames = np.asarray(all_filenames)
puzzle_id = row["puzzle_id"]
image_width = row["width"]
image_height = row["height"]
puzzle_directory = os.path.join(
PUZZLES_DIRECTORY,
str(puzzle_id)
)
solved_puzzle_im = Image.new("RGBA",
(image_width, image_height)
) # Initially create RGBA images, and then drop A channel later
submission = pd.DataFrame()
submission['filename'] = all_filenames
submission['label'] = labels
for _puzzle_piece in glob.glob(os.path.join(puzzle_directory, "*.png")):
puzzle_piece_im = Image.open(_puzzle_piece)
pp_width, pp_height = puzzle_piece_im.size
# Find Random location
random_x = random.randint(0, image_width - pp_width)
random_y = random.randint(0, image_height - pp_height)
```
%% Cell type:markdown id: tags:
## Evaluate the Performance
**Here for evaluation mean over normalised [Levenshtein Similarity Score](https://en.wikipedia.org/wiki/Levenshtein_distance) will be used to test the efficiency of the model.**
%% Cell type:code id: tags:
solved_puzzle_im.paste(puzzle_piece_im, (random_x, random_y))
``` python
def cal_lshtein_score(s_true,s_pred):
if type(s_pred) == type(1.0):
return 0
score = textdistance.levenshtein.normalized_similarity(s_true,s_pred)
return score
```
%% Cell type:code id: tags:
``` python
lst_scores = []
for idx in range(0,len(train_info)):
lst_scores.append(cal_lshtein_score(train_info['label'][idx],submission['label'][idx]))
del puzzle_piece_im
mean_lst_score = np.mean(lst_scores)
solved_puzzle_im.convert("RGB").save(
os.path.join(TEMP_SUBMISSION_DIR_PATH, "{}.jpg".format(str(puzzle_id)))
)
del solved_puzzle_im
print("The mean of normalised Levenshtein Similarity score is " ,mean_lst_score)
```
%% Cell type:markdown id: tags:
# Testing Phase
## Generate Output for Test set
# Visualize the generated Image
%% Cell type:code id: tags:
``` python
labels = []
all_filenames = []
for index,rows in test_info.iterrows():
i = rows['filename']
img_path = test_images_path + i
label = pytesseract.image_to_string(Image.open(img_path))
#Removing garbage characters
label = label.replace("\x0c","")
label = label.replace("\n","")
labels.append(label)
all_filenames.append(i)
print(f'{str(index)+"/" + str(test_info.shape[0])}\r',end="")
labels = np.asarray(labels)
all_filenames = np.asarray(all_filenames)
submission_df = pd.DataFrame()
submission_df['filename'] = all_filenames
submission_df['label'] = labels
img_path = os.path.join(TEMP_SUBMISSION_DIR_PATH,'2.jpg')
img = Image.open(img_path)
img
```
%% Cell type:markdown id: tags:
## Save the prediction to csv
# Create the tar file with all the solved images and store it in the OUTPUT PATH.
%% Cell type:code id: tags:
``` python
submission_df.to_csv('submission.csv', index=False)
```
%% Cell type:markdown id: tags:
with tarfile.open(OUTPUT_PATH, mode="w:gz") as tar_file:
for _filepath in glob.glob(os.path.join(TEMP_SUBMISSION_DIR_PATH, "*.jpg")):
print(_filepath)
_, filename = os.path.split(_filepath)
tar_file.add(_filepath, arcname=filename)
### 🚧 Note :
- Do take a look at the submission format.
- The submission file should contain a header.
- Follow all submission guidelines strictly to avoid inconvenience.
print("Wrote output file to : ", OUTPUT_PATH)
```
%% Cell type:markdown id: tags:
## To download the generated csv in colab run the below command
## To download the generated tar.gz in colab run the below command
%% Cell type:code id: tags:
``` python
try:
from google.colab import files
files.download('submission.csv')
files.download(OUTPUT_PATH)
except:
print("Option Only avilable in Google Colab")
```
%% Cell type:markdown id: tags:
### Well Done! 👍 We are all set to make a submission and see your name on leaderborad. Lets navigate to [challenge page](https://www.aicrowd.com/challenges/CPTCHA) and make one.
### Well Done! 👍 We are all set to make a submission and see your name on leaderborad. Lets navigate to [challenge page](https://www.aicrowd.com/challenges/jigsaw) and make one.
%% Cell type:code id: tags:
``` python
```
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment