I was searching Google for the English translation of this Hindi poem “with time” little did I know it was actually sarcastic code.. well I had a suspicion.. MFs!
from google.colab import drive import pandas as pd import numpy as np from numpy import array from numpy import asarray from numpy import zeros import nltk from nltk.corpus import stopwords import re import string from itertools import groupby from collections import Counter import matplotlib.pyplot as plt from scipy.sparse import hstack from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score from sklearn.metrics import recall_score, f1_score from sklearn.model_selection import train_test_split from sklearn.utils import shuffle from fuzzywuzzy import process
In [ ]:
pip install fuzzywuzzy
( FUZZY WUZZY WAS A “BEAR” wasn’t he? Stupid mfs !!! all year long!!! I have been plagued by the hackers (and my roommate) calling me black bear and making reference to it constantly . I knew it was something ,. But not being technologically advanced, I didn’t know what ! Do you know how hard it is to look someone in the face whom you know is lying to you and talking shit and making fun of you .. but you can’t prove it ( other than gut intuition) and you have to respond like the dumb twat they think you are .. and continue being nice and in the dark but no really .. I want to kill them ) and they keep doing it constantly degrading you’re very self-esteem and not only them but everybody you know until that’s happened to you. You’ve never walked a mile in my shoes.
Collecting fuzzywuzzy Downloading <https://files.pythonhosted.org/packages/43/ff/74f23998ad2f93b945c0309f825be92e04e0348e062026998b5eefef4c33/fuzzywuzzy-0.18.0-py2.py3-none-any.whl> Installing collected packages: fuzzywuzzy Successfully installed fuzzywuzzy-0.18.0
In [ ]:
drive.mount('/content/drive')
Enter your authorization code: ·········· Mounted at /content/drive`
In [ ]:
# sarcasm datawith open('/content/drive/My Drive/Data Files/code-mixed analysis data/Sarcasm_tweets.txt') as f: lines = [line.rstrip() for line in f]
new_lines = [] for line in lines: if line == '': continue new_lines.append(line)
tweet_ids = [] tweets = []
for i in range(len(new_lines)): if i%2 == 0: tweet_ids.append(new_lines[i]) else: tweets.append(new_lines[i])
# annotationswith open('/content/drive/My Drive/Data Files/code-mixed analysis data/Sarcasm_tweet_truth.txt') as f1: lines1 = [line.rstrip() for line in f1]
labels = []
for i in range(len(lines1)): if i%2 != 0: labels.append(lines1[i])
# tweets with language f2 = open('/content/drive/My Drive/Data Files/code-mixed analysis data/Sarcasm_tweets_with_language.txt', 'r') tokens_list = [] tokens = [] languages_list = [] languages = [] for line in f2: line = line.strip() line = line.split(' ') line = [token.strip() for token in line if token != '' and token != ' ' and token != '\n'] if len(line) == 0: tokens_list.append(tokens) languages_list.append(languages) tokens = [] languages = [] elif len(line) == 1: continue else: tokens.append(line[0]) languages.append(line[1])
tokens_list.append(tokens) languages_list.append(languages)
In [ ]:
df = pd.DataFrame(data=tweet_ids, columns=['Tweet ID']) df['Tweet'] = tweets df['Label'] = labels df['Tokens'] = tokens_list df['Languages'] = languages_list df
Out[ ]:
Tweet ID | Tweet | Label | Tokens | Languages | |
---|---|---|---|---|---|
0 | 866871160725794816 | Triple Talaq par Burbak Kuchh nahi bolega | NO | [Triple, Talaq, par, Burbak, Kuchh, nahi, bolega] | [en, hi, hi, hi, hi, hi, hi] |
1 | 880356789358743553 | Batao ye uss site pr se akki sir ke verdict ni... | YES | [Batao, ye, uss, site, pr, se, akki, sir, ke, ... | [hi, hi, hi, en, hi, hi, hi, en, hi, en, hi, h... |
2 | 877751493889105920 | Hindu baheno par julam bardas nahi hoga @Tripl... | NO | [Hindu, baheno, par, julam, bardas, nahi, hoga... | [hi, hi, hi, hi, hi, hi, hi, rest, hi, hi, hi,... |
3 | 901806457871466496 | Naa bhai.. aisa nhi hai.. mere handle karne se... | NO | [Naa, bhai, .., aisa, nhi, hai, .., mere, hand... | [hi, hi, rest, hi, hi, hi, rest, hi, en, hi, h... |
4 | 866264330748219392 | #RememberingRajiv aaj agar musalman auraten tr... | NO | [#RememberingRajiv, aaj, agar, musalman, aurat... | [rest, hi, hi, hi, hi, en, hi, hi, hi, hi, hi,... |
... | ... | ... | ... | ... | ... |
5245 | 256002351670898688 | Khiladi anari, aur shaamat equipment ki aye! B... | NO | [Khiladi, anari, ,, aur, shaamat, equipment, k... | [hi, hi, rest, hi, hi, en, hi, hi, rest, hi, e... |
5246 | 256306978811441152 | #irony RT @techno_charan: pallu k neche chhupa... | NO | [#irony, RT, @techno_charan:, pallu, k, neche,... | [rest, hi, rest, hi, hi, hi, hi, hi, hi, hi, h... |
5247 | 256416888568045569 | Jab Thak Hai Jaan. #Irony | NO | [Jab, Thak, Hai, Jaan, ., #Irony] | [hi, hi, hi, hi, rest, rest] |
5248 | 257194830449487872 | @beeba_puttar Acha! Aur koi nae mila tha #sarc... | NO | [@beeba_puttar, Acha, !, Aur, koi, nae, mila, ... | [rest, hi, rest, hi, hi, en, hi, hi, rest, hi,... |
5249 | 257448839827578880 | @Nirmalogy sacchi mucchi mein? Yah ye bhi #Sar... | NO | [@Nirmalogy, sacchi, mucchi, mein, ?, Yah, ye,... | [rest, hi, hi, hi, rest, hi, hi, hi, rest, hi,... |
5250 rows × 5 columns
In [ ]:
np.random.seed(10) df_y = df[df.Label =="YES"] df_n = df[df.Label == "NO"] drop_indices = np.random.choice(df_n.index, 4000, replace=False) df_subset_n = df_n.drop(drop_indices) frames = [df_y , df_subset_n] df = pd.concat(frames, ignore_index = True) df