[Ai] NLP-Toy-Project
Description
가사를 통해서 긍정적인 감정인지 부정적인 감정인지를 판단하는 모델입니다.
https://github.com/sooking87/NLP_Toy_Proj/blob/master/fine_tunning.ipynb
Environment
python
Contributions
손수경 | 김수빈 | 김은비 | 정승훈 | 정지영 | 정회성 |
---|---|---|---|---|---|
👑 팀장 | 팀원 | 팀원 | 팀원 | 팀원 | 팀원 |
Progress
1️⃣ get Dataset
-
genius API 사용해서 가사 불러오기 + VADER 사용해서 감정 점수 구하기
import lyricsgenius from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer import pandas as pd # Function to return song lyrics def get_lyrics(title, artist): try: return genius.search_song(title, artist).lyrics except: return 'not found' # Function to return sentiment score of each song def get_lyric_sentiment(lyrics): sentiment = sid_obj.polarity_scores(lyrics) return sentiment ''' 원본 파일에서 가수, 제목을 통해서 가사 + 감정 점수 불러오기''' genius = lyricsgenius.Genius( "DuO42xKa4Ts70InLe_Y_strEpeL_CxowzCtXyAMaiNlbAOVOTfFpt2q5FdP4lo_U") sid_obj = SentimentIntensityAnalyzer() lyrics_sentiment_dataset = pd.read_csv( './lyrics_sentiment_dataset/tcc_ceds_music.csv', encoding='cp949') print(lyrics_sentiment_dataset.keys()) lyrics_sentiment_dataset = lyrics_sentiment_dataset.drop( ['Unnamed: 0', 'genre', 'lyrics', 'len', 'dating', 'violence', 'world/life', 'night/time', 'shake the audience', 'family/gospel', 'romantic', 'communication', 'obscene', 'music', 'movement/places', 'light/visual perceptions', 'family/spiritual', 'like/girls', 'sadness', 'feelings', 'danceability', 'loudness', 'acousticness', 'instrumentalness', 'valence', 'energy', 'topic', 'age'], axis=1) lyrics_sentiment_dataset.drop_duplicates(subset='track_name', inplace=True) lyrics_sentiment_dataset.reset_index(drop=True) lyrics_sentiment_dataset_3 = lyrics_sentiment_dataset.iloc[9460:14190, 0:] # 노래 가사 불러오기 lyics1 = lyrics_sentiment_dataset_3.apply(lambda row: get_lyrics( row['track_name'], row['artist_name']), axis=1) lyrics_sentiment_dataset_3['lyrics'] = lyics1 print(lyrics_sentiment_dataset_3.shape) # not found 제거 lyrics_sentiment_dataset_3 = lyrics_sentiment_dataset_3.drop( lyrics_sentiment_dataset_3[lyrics_sentiment_dataset_3['lyrics'] == 'not found'].index) # Use get_lyric_sentiment to get sentiment score for all the song lyrics sentiment = lyrics_sentiment_dataset_3.apply( lambda row: get_lyric_sentiment(row['lyrics']), axis=1) for i in lyrics_sentiment_dataset_3.index.tolist(): lyrics_sentiment_dataset_3.loc[i, 'neg_sentiment'] = sentiment[i]['neg'] lyrics_sentiment_dataset_3.loc[i, 'neu_sentiment'] = sentiment[i]['neu'] lyrics_sentiment_dataset_3.loc[i, 'pos_sentiment'] = sentiment[i]['pos'] lyrics_sentiment_dataset_3.loc[i, 'com_sentiment'] = sentiment[i]['compound'] lyrics_sentiment_dataset_3.to_csv( "lyrics_sentiment_dataset.csv", index=False)
-
감정 점수가 하나라도 0점이거나 해당 노래에 맞지 않는 가사인 경우 지우기
import pandas as pd df = pd.read_csv('./lyrics_sentiment_dataset.csv') # 하나라도 0점이면 out filtered_df = df[(df['neg_sentiment'] != 0.000) & (df['neu_sentiment'] != 0.000) & (df['pos_sentiment'] != 0.000) & (df['com_sentiment'] != 0.000)] # lyrics_len column 추가 -> 길이순 대로 정렬 -> 너무 긴거는 지우기 get_len = [] for i in filtered_df.index.tolist(): length = len(filtered_df.loc[i, 'lyrics']) filtered_df.loc[i, 'lyrics_len'] = length # lyrics_len 내림차순 정렬 filtered_df.sort_values('lyrics_len', ascending=False, inplace=True) # 어디까지는 "무조건" 지워야하는지 확인 -> filtered_df 기준 인덱스 90까지 지워야됨 fin_filtered_df = filtered_df.iloc[91:, 0:] fin_filtered_df.to_csv("full_lyrics_sentiment_dataset.csv")
-
com_sentiment, textblob 을 이용해서 polarity 구하기
-
polarity
: vader을 통해서 com_sentiment >= 0.05 라면 pos, 아니라면 neg 로 분리하였다. -
textblob_pol
: textblob에서 제공하는 polarity 점수를 사용하였다.from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer from textblob import TextBlob sid_obj = SentimentIntensityAnalyzer() def sent_textblob_polarity(lyrics): polarity_score = TextBlob(lyrics).sentiment.subjectivity if float(polarity_score) > 0: return 1 return 0 def sent_vader_compound(com): if float(com) >= 0.05: return 1 return 0 for i in df.index.tolist(): lyrics = df.loc[i, "lyrics"] df.loc[i, "polarity"] = sent_vader_compound(df.loc[i, "com_sentiment"]) df.loc[i, "textblob_pol"] = sent_textblob_polarity(lyrics) df.to_csv("full_lyrics_polarity_dataset.csv", index=False)
-
데이터 전처리
import pandas as pd from nltk import sent_tokenize, word_tokenize from nltk.corpus import stopwords from nltk.stem import WordNetLemmatizer import nltk from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer import re new_stopwords = stopwords.words('english') add_stopwords = ["[", "]", "(", ")", ",", "lyrics", "chorus", "!", "?", "``", "oh", "ha", "ah", "-", "yo", "yeah", "uh", "uhh", ":"] new_stopwords.extend(add_stopwords) nltk.download('omw-1.4') nltk.download('wordnet') def tokenize(str_lyrics): split_lyrics = " ".join(str_lyrics.splitlines()) sentences = sent_tokenize(split_lyrics) init_words = [] for sentence in sentences: word_tokens = word_tokenize(sentence) for word in word_tokens: word = word.lower() init_words.append(word) return init_words def remove_not_alpa(words_list): for word in words_list: if (("'" in word) or (not word.isalpha()) or (len(word) == 1)): words_list.remove(word) return words_list def remove_stopwords(words_list): filtered_words = [] for word in words_list: if word not in new_stopwords: filtered_words.append(word) return filtered_words def lemmatizer(tokenzied_list): lemmatized_list = [] for word in tokenzied_list: word = lemma.lemmatize(word) lemmatized_list.append(word) return lemmatized_list def get_lyric_sentiment(lyrics): sentiment = sid_obj.polarity_scores(lyrics) return sentiment lemma = WordNetLemmatizer() sid_obj = SentimentIntensityAnalyzer() df = pd.read_csv('./lyrics_sentiment_dataset/lyrics_sentiment_dataset_3.csv') for i in df.index.tolist(): lyrics = df.loc[i, 'lyrics'] # tokenize init_words_list = tokenize(lyrics) init_words_list = list(set(init_words_list)) # remove not alpha alpha_words_list = remove_not_alpa(init_words_list) # remove stopword filtered_word_list = remove_stopwords(alpha_words_list) # lemmatizer lemmatized_list = lemmatizer(filtered_word_list) lemmatized_list = list(set(lemmatized_list)) # remove not alpha final_list = remove_not_alpa(lemmatized_list) final_str = " ".join(final_list) df.loc[i, 'lyrics'] = final_str df.to_csv("lyrics_sentiment_dataset_3_tokenized.csv", index=False)
2️⃣ BERT 모델에 맞는 input, target data 만들기
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.read_csv('./full_lyrics_polarity_dataset.csv')
X_train, X_test, y_train, y_test = train_test_split(df['lyrics'], df['polarity'], test_size = 0.25, random_state = 32)
X_train.shape
# input set
from transformers import BertTokenizer
# BERT에 맞는 Tag 달아주기
train_lyrics = []
test_lyrics = []
for i in X_train:
one = []
start = "[CLS] " + str(i)
split_lyrics = " [SEP]".join(start.split('\n'))
split_lyrics += " [SEP]"
one.append(split_lyrics)
train_lyrics.append(one)
for i in X_test:
one = []
start = "[CLS] " + str(i)
split_lyrics = " [SEP]".join(start.split('\n'))
split_lyrics += " [SEP]"
one.append(split_lyrics)
test_lyrics.append(one)
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased', do_lower_case=True)
'''bert-base-multilingual-cased'''
'''bert-base-uncased'''
tokenized_data = []
# 2차원 리스트로 태그를 달아주었으므로 2중 for 문 사용해서 가사에 접근
for each_lyrics in train_lyrics:
for j in each_lyrics:
tokens = tokenizer.tokenize(j)
tokenized_data.append(tokens)
tokenized_test_data = []
for each_lyrics in test_lyrics:
for j in each_lyrics:
tokens = tokenizer.tokenize(j)
tokenized_test_data.append(tokens)
# target set
y_train_list = y_train.to_list()
y_test_list = y_test.to_list()
3️⃣ Modeling Initialize
import tensorflow_datasets as tfds
import tensorflow as tf
import torch
def convert_example_to_feature(review):
return tokenizer.encode_plus(review,
add_special_tokens = False, # add [CLS], [SEP]
max_length = 256, # max length of the text that can go to BERT
pad_to_max_length = True, # add [PAD] tokens
return_attention_mask = True, # add attention mask to not focus on pad tokens
)
def map_example_to_dict(input_ids, attention_masks, token_type_ids, label):
return {
"input_ids": input_ids,
"token_type_ids": token_type_ids,
"attention_mask": attention_masks,
}, label
def encode_examples(data):
# prepare list, so that we can build up final TensorFlow dataset from slices.
input_ids_list = []
token_type_ids_list = []
attention_mask_list = []
target_list = []
for DATA_COL, LABEL_COL in data.to_numpy():
bert_input = convert_example_to_feature(DATA_COL)
input_ids_list.append(bert_input['input_ids'])
token_type_ids_list.append(bert_input['token_type_ids'])
attention_mask_list.append(bert_input['attention_mask'])
target_list.append([LABEL_COL])
return tf.data.Dataset.from_tensor_slices((input_ids_list, attention_mask_list, token_type_ids_list, target_list)).map(map_example_to_dict)
train = pd.DataFrame({'DATA_COL' : tokenized_data, 'LABEL_COL' : y_train_list})
test = pd.DataFrame({'DATA_COL' : tokenized_test_data, 'LABEL_COL' : y_test_list})
# train dataset
train_encoded = encode_examples(train).shuffle(100).batch(16)
# test dataset
test_encoded = encode_examples(test).batch(16)
from transformers import TFBertForSequenceClassification
import tensorflow as tf
# recommended learning rate for Adam 5e-5, 3e-5, 2e-5
learning_rate = 2e-5
# we will do just 1 epoch, though multiple epochs might be better as long as we will not overfit the model
number_of_epochs = 3
# model initialization
model = TFBertForSequenceClassification.from_pretrained('bert-base-multilingual-cased')
# choosing Adam optimizer
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate, epsilon=1e-08)
# we do not have one-hot vectors, we can use sparce categorical cross entropy and accuracy
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])
import torch, gc
gc.collect()
torch.cuda.empty_cache()
model.fit(train_encoded, epochs=number_of_epochs, validation_data=test_encoded)
>>>
Epoch 1/3
1020/1020 [==============================] - 1050s 1s/step - loss: 0.5040 - accuracy: 0.7483 - val_loss: 0.4533 - val_accuracy: 0.7975
Epoch 2/3
1020/1020 [==============================] - 1024s 1s/step - loss: 0.3648 - accuracy: 0.8360 - val_loss: 0.3790 - val_accuracy: 0.8326
Epoch 3/3
1020/1020 [==============================] - 1025s 1s/step - loss: 0.2689 - accuracy: 0.8896 - val_loss: 0.3668 - val_accuracy: 0.8438
4️⃣ Model predict
def model_predict(test_sentence):
predict_input = tokenizer.encode(test_sentence, truncation=True, padding=True, return_tensors="tf")
tf_output = model.predict(predict_input)[0]
tf_prediction = tf.nn.softmax(tf_output, axis=1)
labels = ['Negative','Positive'] # (0: negative, 1: positive)
label = tf.argmax(tf_prediction, axis=1)
label = label.numpy()
tf_prediction = tf_prediction.numpy()
print(test_sentence, ":: ", "{:.2f}%".format(abs(tf_prediction.take(label[0])) * 100), labels[label[0]])
test_sentence_1 = '''
There once was a king named Midas who did a good deed for a Satyr.
And he was then granted a wish by Dionysus, the god of wine.
For his wish, Midas asked that whatever he touched would turn to gold.
Despite Dionysus’ efforts to prevent it, Midas pleaded that this was a fantastic wish, and so, it was bestowed.
Excited about his newly-earned powers, Midas started touching all kinds of things, turning each item into pure gold.
'''
model_predict(test_sentence_1)
>>>
There once was a king named Midas who did a good deed for a Satyr.
And he was then granted a wish by Dionysus, the god of wine.
For his wish, Midas asked that whatever he touched would turn to gold.
Despite Dionysus’ efforts to prevent it, Midas pleaded that this was a fantastic wish, and so, it was bestowed.
Excited about his newly-earned powers, Midas started touching all kinds of things, turning each item into pure gold.
:: 96.57% Positive
test_sentence_2 = '''
But soon, Midas became hungry. As he picked up a piece of food, he found he couldn’t eat it. It had turned to gold in his hand.
Hungry, Midas groaned, “I’ll starve! Perhaps this was not such an excellent wish after all!”
Seeing his dismay, Midas’ beloved daughter threw her arms around him to comfort him, and she, too, turned to gold. “The golden touch is no blessing,” Midas cried.
'''
model_predict(test_sentence_2)
>>>
But soon, Midas became hungry. As he picked up a piece of food, he found he couldn’t eat it. It had turned to gold in his hand.
Hungry, Midas groaned, “I’ll starve! Perhaps this was not such an excellent wish after all!”
Seeing his dismay, Midas’ beloved daughter threw her arms around him to comfort him, and she, too, turned to gold. “The golden touch is no blessing,” Midas cried.
:: 55.15% Negative
lyrics = '''
And I know we weren't perfect
But I've never felt this way for no one, oh
And I just can't imagine how you could be so okay now that I'm gone
I guess you didn't mean what you wrote in that song about me
'Cause you said forever, now I drive alone past your street
'''
model_predict(lyrics)
>>>
And I know we weren't perfect
But I've never felt this way for no one, oh
And I just can't imagine how you could be so okay now that I'm gone
I guess you didn't mean what you wrote in that song about me
'Cause you said forever, now I drive alone past your street
:: 53.11% Positive
댓글남기기