7 분 소요

Description

가사를 통해서 긍정적인 감정인지 부정적인 감정인지를 판단하는 모델입니다.

https://github.com/sooking87/NLP_Toy_Proj/blob/master/fine_tunning.ipynb

Environment

python

Contributions

손수경 김수빈 김은비 정승훈 정지영 정회성
👑 팀장 팀원 팀원 팀원 팀원 팀원

Progress

1️⃣ get Dataset

  1. genius API 사용해서 가사 불러오기 + VADER 사용해서 감정 점수 구하기

    import lyricsgenius
    from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
    import pandas as pd
    
    # Function to return song lyrics
    
    
    def get_lyrics(title, artist):
        try:
            return genius.search_song(title, artist).lyrics
        except:
            return 'not found'
    
    # Function to return sentiment score of each song
    
    
    def get_lyric_sentiment(lyrics):
        sentiment = sid_obj.polarity_scores(lyrics)
        return sentiment
    
    
    ''' 원본 파일에서 가수, 제목을 통해서 가사 + 감정 점수 불러오기'''
    
    genius = lyricsgenius.Genius(
        "DuO42xKa4Ts70InLe_Y_strEpeL_CxowzCtXyAMaiNlbAOVOTfFpt2q5FdP4lo_U")
    sid_obj = SentimentIntensityAnalyzer()
    
    lyrics_sentiment_dataset = pd.read_csv(
        './lyrics_sentiment_dataset/tcc_ceds_music.csv', encoding='cp949')
    print(lyrics_sentiment_dataset.keys())
    lyrics_sentiment_dataset = lyrics_sentiment_dataset.drop(
        ['Unnamed: 0', 'genre',
        'lyrics', 'len', 'dating', 'violence', 'world/life', 'night/time',
        'shake the audience', 'family/gospel', 'romantic', 'communication',
        'obscene', 'music', 'movement/places', 'light/visual perceptions',
        'family/spiritual', 'like/girls', 'sadness', 'feelings', 'danceability',
        'loudness', 'acousticness', 'instrumentalness', 'valence', 'energy',
        'topic', 'age'], axis=1)
    
    lyrics_sentiment_dataset.drop_duplicates(subset='track_name', inplace=True)
    lyrics_sentiment_dataset.reset_index(drop=True)
    lyrics_sentiment_dataset_3 = lyrics_sentiment_dataset.iloc[9460:14190, 0:]
    
    # 노래 가사 불러오기
    lyics1 = lyrics_sentiment_dataset_3.apply(lambda row: get_lyrics(
        row['track_name'], row['artist_name']), axis=1)
    lyrics_sentiment_dataset_3['lyrics'] = lyics1
    print(lyrics_sentiment_dataset_3.shape)
    # not found 제거
    lyrics_sentiment_dataset_3 = lyrics_sentiment_dataset_3.drop(
        lyrics_sentiment_dataset_3[lyrics_sentiment_dataset_3['lyrics'] == 'not found'].index)
    
    # Use get_lyric_sentiment to get sentiment score for all the song lyrics
    sentiment = lyrics_sentiment_dataset_3.apply(
        lambda row: get_lyric_sentiment(row['lyrics']), axis=1)
    
    for i in lyrics_sentiment_dataset_3.index.tolist():
        lyrics_sentiment_dataset_3.loc[i, 'neg_sentiment'] = sentiment[i]['neg']
        lyrics_sentiment_dataset_3.loc[i, 'neu_sentiment'] = sentiment[i]['neu']
        lyrics_sentiment_dataset_3.loc[i, 'pos_sentiment'] = sentiment[i]['pos']
        lyrics_sentiment_dataset_3.loc[i,
                                    'com_sentiment'] = sentiment[i]['compound']
    
    lyrics_sentiment_dataset_3.to_csv(
        "lyrics_sentiment_dataset.csv", index=False)
    
  2. 감정 점수가 하나라도 0점이거나 해당 노래에 맞지 않는 가사인 경우 지우기

    import pandas as pd
    
     df = pd.read_csv('./lyrics_sentiment_dataset.csv')
    
     # 하나라도 0점이면 out
     filtered_df = df[(df['neg_sentiment'] != 0.000) & (df['neu_sentiment'] != 0.000) & (df['pos_sentiment'] != 0.000) & (df['com_sentiment'] != 0.000)]
    
     # lyrics_len column 추가 -> 길이순 대로 정렬 -> 너무 긴거는 지우기
     get_len = []
     for i in filtered_df.index.tolist():
         length = len(filtered_df.loc[i, 'lyrics'])
         filtered_df.loc[i, 'lyrics_len'] = length
    
     # lyrics_len 내림차순 정렬
     filtered_df.sort_values('lyrics_len', ascending=False, inplace=True)
    
     # 어디까지는 "무조건" 지워야하는지 확인 -> filtered_df 기준 인덱스 90까지 지워야됨
     fin_filtered_df = filtered_df.iloc[91:, 0:]
    
     fin_filtered_df.to_csv("full_lyrics_sentiment_dataset.csv")
    
  3. com_sentiment, textblob 을 이용해서 polarity 구하기

  • polarity: vader을 통해서 com_sentiment >= 0.05 라면 pos, 아니라면 neg 로 분리하였다.

  • textblob_pol: textblob에서 제공하는 polarity 점수를 사용하였다.

    from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
    from textblob import TextBlob
    
    sid_obj = SentimentIntensityAnalyzer()
    
    def sent_textblob_polarity(lyrics):
        polarity_score = TextBlob(lyrics).sentiment.subjectivity
        if float(polarity_score) > 0:
        return 1
        return 0
    
    def sent_vader_compound(com):
    if float(com) >= 0.05:
        return 1
    return 0
    
    for i in df.index.tolist():
    lyrics = df.loc[i, "lyrics"]
    df.loc[i, "polarity"] = sent_vader_compound(df.loc[i, "com_sentiment"])
    df.loc[i, "textblob_pol"] = sent_textblob_polarity(lyrics)
    
    df.to_csv("full_lyrics_polarity_dataset.csv", index=False)
    
  1. 데이터 전처리

    import pandas as pd
    from nltk import sent_tokenize, word_tokenize
    from nltk.corpus import stopwords
    from nltk.stem import WordNetLemmatizer
    import nltk
    from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
    import re
    
    new_stopwords = stopwords.words('english')
    add_stopwords = ["[", "]", "(", ")", ",", "lyrics",
                    "chorus", "!", "?", "``", "oh", "ha", "ah", "-", "yo", "yeah", "uh", "uhh", ":"]
    new_stopwords.extend(add_stopwords)
    nltk.download('omw-1.4')
    nltk.download('wordnet')
    
    
    def tokenize(str_lyrics):
        split_lyrics = " ".join(str_lyrics.splitlines())
        sentences = sent_tokenize(split_lyrics)
        init_words = []
        for sentence in sentences:
            word_tokens = word_tokenize(sentence)
            for word in word_tokens:
                word = word.lower()
                init_words.append(word)
    
        return init_words
    
    
    def remove_not_alpa(words_list):
        for word in words_list:
            if (("'" in word) or (not word.isalpha()) or (len(word) == 1)):
                words_list.remove(word)
    
        return words_list
    
    
    def remove_stopwords(words_list):
        filtered_words = []
        for word in words_list:
            if word not in new_stopwords:
                filtered_words.append(word)
    
        return filtered_words
    
    
    def lemmatizer(tokenzied_list):
        lemmatized_list = []
        for word in tokenzied_list:
            word = lemma.lemmatize(word)
            lemmatized_list.append(word)
        return lemmatized_list
    
    
    def get_lyric_sentiment(lyrics):
        sentiment = sid_obj.polarity_scores(lyrics)
        return sentiment
    
    
    lemma = WordNetLemmatizer()
    sid_obj = SentimentIntensityAnalyzer()
    
    df = pd.read_csv('./lyrics_sentiment_dataset/lyrics_sentiment_dataset_3.csv')
    
    
    for i in df.index.tolist():
        lyrics = df.loc[i, 'lyrics']
        # tokenize
        init_words_list = tokenize(lyrics)
        init_words_list = list(set(init_words_list))
        # remove not alpha
        alpha_words_list = remove_not_alpa(init_words_list)
        # remove stopword
        filtered_word_list = remove_stopwords(alpha_words_list)
        # lemmatizer
        lemmatized_list = lemmatizer(filtered_word_list)
        lemmatized_list = list(set(lemmatized_list))
        # remove not alpha
        final_list = remove_not_alpa(lemmatized_list)
        final_str = " ".join(final_list)
    
        df.loc[i, 'lyrics'] = final_str
    
    df.to_csv("lyrics_sentiment_dataset_3_tokenized.csv", index=False)
    

2️⃣ BERT 모델에 맞는 input, target data 만들기

import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv('./full_lyrics_polarity_dataset.csv')

X_train, X_test, y_train, y_test = train_test_split(df['lyrics'], df['polarity'], test_size = 0.25, random_state = 32)
X_train.shape
# input set
from transformers import BertTokenizer

# BERT에 맞는 Tag 달아주기
train_lyrics = []
test_lyrics = []

for i in X_train:
  one = []
  start = "[CLS] " + str(i)
  split_lyrics = " [SEP]".join(start.split('\n'))
  split_lyrics += " [SEP]"
  one.append(split_lyrics)
  train_lyrics.append(one)

for i in X_test:
  one = []
  start = "[CLS] " + str(i)
  split_lyrics = " [SEP]".join(start.split('\n'))
  split_lyrics += " [SEP]"
  one.append(split_lyrics)
  test_lyrics.append(one)

  tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased', do_lower_case=True)
'''bert-base-multilingual-cased'''
'''bert-base-uncased'''
tokenized_data = []

# 2차원 리스트로 태그를 달아주었으므로 2중 for 문 사용해서 가사에 접근
for each_lyrics in train_lyrics:
  for j in each_lyrics:
    tokens = tokenizer.tokenize(j)
    tokenized_data.append(tokens)


tokenized_test_data = []
for each_lyrics in test_lyrics:
  for j in each_lyrics:
    tokens = tokenizer.tokenize(j)
    tokenized_test_data.append(tokens)


# target set
y_train_list = y_train.to_list()
y_test_list = y_test.to_list()

3️⃣ Modeling Initialize

import tensorflow_datasets as tfds
import tensorflow as tf
import torch

def convert_example_to_feature(review):
  return tokenizer.encode_plus(review,
                add_special_tokens = False, # add [CLS], [SEP]
                max_length = 256, # max length of the text that can go to BERT
                pad_to_max_length = True, # add [PAD] tokens
                return_attention_mask = True, # add attention mask to not focus on pad tokens
              )

def map_example_to_dict(input_ids, attention_masks, token_type_ids, label):
  return {
      "input_ids": input_ids,
      "token_type_ids": token_type_ids,
      "attention_mask": attention_masks,
  }, label

def encode_examples(data):
  # prepare list, so that we can build up final TensorFlow dataset from slices.
  input_ids_list = []
  token_type_ids_list = []
  attention_mask_list = []
  target_list = []

  for DATA_COL, LABEL_COL in data.to_numpy():
    bert_input = convert_example_to_feature(DATA_COL)
    input_ids_list.append(bert_input['input_ids'])
    token_type_ids_list.append(bert_input['token_type_ids'])
    attention_mask_list.append(bert_input['attention_mask'])
    target_list.append([LABEL_COL])
  return tf.data.Dataset.from_tensor_slices((input_ids_list, attention_mask_list, token_type_ids_list, target_list)).map(map_example_to_dict)

train = pd.DataFrame({'DATA_COL' : tokenized_data, 'LABEL_COL' : y_train_list})
test = pd.DataFrame({'DATA_COL' : tokenized_test_data, 'LABEL_COL' : y_test_list})


# train dataset
train_encoded = encode_examples(train).shuffle(100).batch(16)
# test dataset
test_encoded = encode_examples(test).batch(16)


from transformers import TFBertForSequenceClassification
import tensorflow as tf
# recommended learning rate for Adam 5e-5, 3e-5, 2e-5
learning_rate = 2e-5
# we will do just 1 epoch, though multiple epochs might be better as long as we will not overfit the model
number_of_epochs = 3
# model initialization
model = TFBertForSequenceClassification.from_pretrained('bert-base-multilingual-cased')

# choosing Adam optimizer
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate, epsilon=1e-08)
# we do not have one-hot vectors, we can use sparce categorical cross entropy and accuracy
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])
import torch, gc
gc.collect()
torch.cuda.empty_cache()

model.fit(train_encoded, epochs=number_of_epochs, validation_data=test_encoded)

>>>
Epoch 1/3
1020/1020 [==============================] - 1050s 1s/step - loss: 0.5040 - accuracy: 0.7483 - val_loss: 0.4533 - val_accuracy: 0.7975
Epoch 2/3
1020/1020 [==============================] - 1024s 1s/step - loss: 0.3648 - accuracy: 0.8360 - val_loss: 0.3790 - val_accuracy: 0.8326
Epoch 3/3
1020/1020 [==============================] - 1025s 1s/step - loss: 0.2689 - accuracy: 0.8896 - val_loss: 0.3668 - val_accuracy: 0.8438

4️⃣ Model predict

def model_predict(test_sentence):
  predict_input = tokenizer.encode(test_sentence, truncation=True, padding=True, return_tensors="tf")
  tf_output = model.predict(predict_input)[0]
  tf_prediction = tf.nn.softmax(tf_output, axis=1)
  labels = ['Negative','Positive'] # (0: negative, 1: positive)
  label = tf.argmax(tf_prediction, axis=1)
  label = label.numpy()
  tf_prediction = tf_prediction.numpy()

  print(test_sentence, ":: ", "{:.2f}%".format(abs(tf_prediction.take(label[0])) * 100), labels[label[0]])
test_sentence_1 = '''
There once was a king named Midas who did a good deed for a Satyr.
And he was then granted a wish by Dionysus, the god of wine.
For his wish, Midas asked that whatever he touched would turn to gold.
Despite Dionysus’ efforts to prevent it, Midas pleaded that this was a fantastic wish, and so, it was bestowed.
Excited about his newly-earned powers, Midas started touching all kinds of things, turning each item into pure gold.
'''
model_predict(test_sentence_1)

>>>
There once was a king named Midas who did a good deed for a Satyr.
And he was then granted a wish by Dionysus, the god of wine.
For his wish, Midas asked that whatever he touched would turn to gold.
Despite Dionysus efforts to prevent it, Midas pleaded that this was a fantastic wish, and so, it was bestowed.
Excited about his newly-earned powers, Midas started touching all kinds of things, turning each item into pure gold.
 ::  96.57% Positive


test_sentence_2 = '''
But soon, Midas became hungry. As he picked up a piece of food, he found he couldn’t eat it. It had turned to gold in his hand.
Hungry, Midas groaned, “I’ll starve! Perhaps this was not such an excellent wish after all!”
Seeing his dismay, Midas’ beloved daughter threw her arms around him to comfort him, and she, too, turned to gold. “The golden touch is no blessing,” Midas cried.
'''
model_predict(test_sentence_2)
>>>
But soon, Midas became hungry. As he picked up a piece of food, he found he couldnt eat it. It had turned to gold in his hand.
Hungry, Midas groaned, Ill starve! Perhaps this was not such an excellent wish after all!”
Seeing his dismay, Midas beloved daughter threw her arms around him to comfort him, and she, too, turned to gold. The golden touch is no blessing, Midas cried.
 ::  55.15% Negative


lyrics = '''
And I know we weren't perfect
But I've never felt this way for no one, oh
And I just can't imagine how you could be so okay now that I'm gone
I guess you didn't mean what you wrote in that song about me
'Cause you said forever, now I drive alone past your street
'''
model_predict(lyrics)
>>>
And I know we weren't perfect
But I've never felt this way for no one, oh
And I just can't imagine how you could be so okay now that I'm gone
I guess you didn't mean what you wrote in that song about me
'Cause you said forever, now I drive alone past your street
 ::  53.11% Positive

댓글남기기