[Ai] NLP-Toy-Project

August 4, 2022 7 분 소요

Description

가사를 통해서 긍정적인 감정인지 부정적인 감정인지를 판단하는 모델입니다.

https://github.com/sooking87/NLP_Toy_Proj/blob/master/fine_tunning.ipynb

Environment

python

Contributions

손수경	김수빈	김은비	정승훈	정지영	정회성
👑 팀장	팀원	팀원	팀원	팀원	팀원

Progress

1️⃣ get Dataset

genius API 사용해서 가사 불러오기 + VADER 사용해서 감정 점수 구하기

import lyricsgenius
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import pandas as pd

# Function to return song lyrics


def get_lyrics(title, artist):
    try:
        return genius.search_song(title, artist).lyrics
    except:
        return 'not found'

# Function to return sentiment score of each song


def get_lyric_sentiment(lyrics):
    sentiment = sid_obj.polarity_scores(lyrics)
    return sentiment


''' 원본 파일에서 가수, 제목을 통해서 가사 + 감정 점수 불러오기'''

genius = lyricsgenius.Genius(
    "DuO42xKa4Ts70InLe_Y_strEpeL_CxowzCtXyAMaiNlbAOVOTfFpt2q5FdP4lo_U")
sid_obj = SentimentIntensityAnalyzer()

lyrics_sentiment_dataset = pd.read_csv(
    './lyrics_sentiment_dataset/tcc_ceds_music.csv', encoding='cp949')
print(lyrics_sentiment_dataset.keys())
lyrics_sentiment_dataset = lyrics_sentiment_dataset.drop(
    ['Unnamed: 0', 'genre',
    'lyrics', 'len', 'dating', 'violence', 'world/life', 'night/time',
    'shake the audience', 'family/gospel', 'romantic', 'communication',
    'obscene', 'music', 'movement/places', 'light/visual perceptions',
    'family/spiritual', 'like/girls', 'sadness', 'feelings', 'danceability',
    'loudness', 'acousticness', 'instrumentalness', 'valence', 'energy',
    'topic', 'age'], axis=1)

lyrics_sentiment_dataset.drop_duplicates(subset='track_name', inplace=True)
lyrics_sentiment_dataset.reset_index(drop=True)
lyrics_sentiment_dataset_3 = lyrics_sentiment_dataset.iloc[9460:14190, 0:]

# 노래 가사 불러오기
lyics1 = lyrics_sentiment_dataset_3.apply(lambda row: get_lyrics(
    row['track_name'], row['artist_name']), axis=1)
lyrics_sentiment_dataset_3['lyrics'] = lyics1
print(lyrics_sentiment_dataset_3.shape)
# not found 제거
lyrics_sentiment_dataset_3 = lyrics_sentiment_dataset_3.drop(
    lyrics_sentiment_dataset_3[lyrics_sentiment_dataset_3['lyrics'] == 'not found'].index)

# Use get_lyric_sentiment to get sentiment score for all the song lyrics
sentiment = lyrics_sentiment_dataset_3.apply(
    lambda row: get_lyric_sentiment(row['lyrics']), axis=1)

for i in lyrics_sentiment_dataset_3.index.tolist():
    lyrics_sentiment_dataset_3.loc[i, 'neg_sentiment'] = sentiment[i]['neg']
    lyrics_sentiment_dataset_3.loc[i, 'neu_sentiment'] = sentiment[i]['neu']
    lyrics_sentiment_dataset_3.loc[i, 'pos_sentiment'] = sentiment[i]['pos']
    lyrics_sentiment_dataset_3.loc[i,
                                'com_sentiment'] = sentiment[i]['compound']

lyrics_sentiment_dataset_3.to_csv(
    "lyrics_sentiment_dataset.csv", index=False)

감정 점수가 하나라도 0점이거나 해당 노래에 맞지 않는 가사인 경우 지우기

import pandas as pd

 df = pd.read_csv('./lyrics_sentiment_dataset.csv')

 # 하나라도 0점이면 out
 filtered_df = df[(df['neg_sentiment'] != 0.000) & (df['neu_sentiment'] != 0.000) & (df['pos_sentiment'] != 0.000) & (df['com_sentiment'] != 0.000)]

 # lyrics_len column 추가 -> 길이순 대로 정렬 -> 너무 긴거는 지우기
 get_len = []
 for i in filtered_df.index.tolist():
     length = len(filtered_df.loc[i, 'lyrics'])
     filtered_df.loc[i, 'lyrics_len'] = length

 # lyrics_len 내림차순 정렬
 filtered_df.sort_values('lyrics_len', ascending=False, inplace=True)

 # 어디까지는 "무조건" 지워야하는지 확인 -> filtered_df 기준 인덱스 90까지 지워야됨
 fin_filtered_df = filtered_df.iloc[91:, 0:]

 fin_filtered_df.to_csv("full_lyrics_sentiment_dataset.csv")

com_sentiment, textblob 을 이용해서 polarity 구하기

polarity: vader을 통해서 com_sentiment >= 0.05 라면 pos, 아니라면 neg 로 분리하였다.

textblob_pol: textblob에서 제공하는 polarity 점수를 사용하였다.

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from textblob import TextBlob

sid_obj = SentimentIntensityAnalyzer()

def sent_textblob_polarity(lyrics):
    polarity_score = TextBlob(lyrics).sentiment.subjectivity
    if float(polarity_score) > 0:
    return 1
    return 0

def sent_vader_compound(com):
if float(com) >= 0.05:
    return 1
return 0

for i in df.index.tolist():
lyrics = df.loc[i, "lyrics"]
df.loc[i, "polarity"] = sent_vader_compound(df.loc[i, "com_sentiment"])
df.loc[i, "textblob_pol"] = sent_textblob_polarity(lyrics)

df.to_csv("full_lyrics_polarity_dataset.csv", index=False)

데이터 전처리

import pandas as pd
from nltk import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import nltk
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import re

new_stopwords = stopwords.words('english')
add_stopwords = ["[", "]", "(", ")", ",", "lyrics",
                "chorus", "!", "?", "``", "oh", "ha", "ah", "-", "yo", "yeah", "uh", "uhh", ":"]
new_stopwords.extend(add_stopwords)
nltk.download('omw-1.4')
nltk.download('wordnet')


def tokenize(str_lyrics):
    split_lyrics = " ".join(str_lyrics.splitlines())
    sentences = sent_tokenize(split_lyrics)
    init_words = []
    for sentence in sentences:
        word_tokens = word_tokenize(sentence)
        for word in word_tokens:
            word = word.lower()
            init_words.append(word)

    return init_words


def remove_not_alpa(words_list):
    for word in words_list:
        if (("'" in word) or (not word.isalpha()) or (len(word) == 1)):
            words_list.remove(word)

    return words_list


def remove_stopwords(words_list):
    filtered_words = []
    for word in words_list:
        if word not in new_stopwords:
            filtered_words.append(word)

    return filtered_words


def lemmatizer(tokenzied_list):
    lemmatized_list = []
    for word in tokenzied_list:
        word = lemma.lemmatize(word)
        lemmatized_list.append(word)
    return lemmatized_list


def get_lyric_sentiment(lyrics):
    sentiment = sid_obj.polarity_scores(lyrics)
    return sentiment


lemma = WordNetLemmatizer()
sid_obj = SentimentIntensityAnalyzer()

df = pd.read_csv('./lyrics_sentiment_dataset/lyrics_sentiment_dataset_3.csv')


for i in df.index.tolist():
    lyrics = df.loc[i, 'lyrics']
    # tokenize
    init_words_list = tokenize(lyrics)
    init_words_list = list(set(init_words_list))
    # remove not alpha
    alpha_words_list = remove_not_alpa(init_words_list)
    # remove stopword
    filtered_word_list = remove_stopwords(alpha_words_list)
    # lemmatizer
    lemmatized_list = lemmatizer(filtered_word_list)
    lemmatized_list = list(set(lemmatized_list))
    # remove not alpha
    final_list = remove_not_alpa(lemmatized_list)
    final_str = " ".join(final_list)

    df.loc[i, 'lyrics'] = final_str

df.to_csv("lyrics_sentiment_dataset_3_tokenized.csv", index=False)

2️⃣ BERT 모델에 맞는 input, target data 만들기

import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv('./full_lyrics_polarity_dataset.csv')

X_train, X_test, y_train, y_test = train_test_split(df['lyrics'], df['polarity'], test_size = 0.25, random_state = 32)
X_train.shape

# input set
from transformers import BertTokenizer

# BERT에 맞는 Tag 달아주기
train_lyrics = []
test_lyrics = []

for i in X_train:
  one = []
  start = "[CLS] " + str(i)
  split_lyrics = " [SEP]".join(start.split('\n'))
  split_lyrics += " [SEP]"
  one.append(split_lyrics)
  train_lyrics.append(one)

for i in X_test:
  one = []
  start = "[CLS] " + str(i)
  split_lyrics = " [SEP]".join(start.split('\n'))
  split_lyrics += " [SEP]"
  one.append(split_lyrics)
  test_lyrics.append(one)

  tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased', do_lower_case=True)
'''bert-base-multilingual-cased'''
'''bert-base-uncased'''
tokenized_data = []

# 2차원 리스트로 태그를 달아주었으므로 2중 for 문 사용해서 가사에 접근
for each_lyrics in train_lyrics:
  for j in each_lyrics:
    tokens = tokenizer.tokenize(j)
    tokenized_data.append(tokens)


tokenized_test_data = []
for each_lyrics in test_lyrics:
  for j in each_lyrics:
    tokens = tokenizer.tokenize(j)
    tokenized_test_data.append(tokens)


# target set
y_train_list = y_train.to_list()
y_test_list = y_test.to_list()

3️⃣ Modeling Initialize

import tensorflow_datasets as tfds
import tensorflow as tf
import torch

def convert_example_to_feature(review):
  return tokenizer.encode_plus(review,
                add_special_tokens = False, # add [CLS], [SEP]
                max_length = 256, # max length of the text that can go to BERT
                pad_to_max_length = True, # add [PAD] tokens
                return_attention_mask = True, # add attention mask to not focus on pad tokens
              )

def map_example_to_dict(input_ids, attention_masks, token_type_ids, label):
  return {
      "input_ids": input_ids,
      "token_type_ids": token_type_ids,
      "attention_mask": attention_masks,
  }, label

def encode_examples(data):
  # prepare list, so that we can build up final TensorFlow dataset from slices.
  input_ids_list = []
  token_type_ids_list = []
  attention_mask_list = []
  target_list = []

  for DATA_COL, LABEL_COL in data.to_numpy():
    bert_input = convert_example_to_feature(DATA_COL)
    input_ids_list.append(bert_input['input_ids'])
    token_type_ids_list.append(bert_input['token_type_ids'])
    attention_mask_list.append(bert_input['attention_mask'])
    target_list.append([LABEL_COL])
  return tf.data.Dataset.from_tensor_slices((input_ids_list, attention_mask_list, token_type_ids_list, target_list)).map(map_example_to_dict)

train = pd.DataFrame({'DATA_COL' : tokenized_data, 'LABEL_COL' : y_train_list})
test = pd.DataFrame({'DATA_COL' : tokenized_test_data, 'LABEL_COL' : y_test_list})


# train dataset
train_encoded = encode_examples(train).shuffle(100).batch(16)
# test dataset
test_encoded = encode_examples(test).batch(16)

from transformers import TFBertForSequenceClassification
import tensorflow as tf
# recommended learning rate for Adam 5e-5, 3e-5, 2e-5
learning_rate = 2e-5
# we will do just 1 epoch, though multiple epochs might be better as long as we will not overfit the model
number_of_epochs = 3
# model initialization
model = TFBertForSequenceClassification.from_pretrained('bert-base-multilingual-cased')

# choosing Adam optimizer
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate, epsilon=1e-08)
# we do not have one-hot vectors, we can use sparce categorical cross entropy and accuracy
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

import torch, gc
gc.collect()
torch.cuda.empty_cache()

model.fit(train_encoded, epochs=number_of_epochs, validation_data=test_encoded)

>>>
Epoch 1/3
1020/1020 [==============================] - 1050s 1s/step - loss: 0.5040 - accuracy: 0.7483 - val_loss: 0.4533 - val_accuracy: 0.7975
Epoch 2/3
1020/1020 [==============================] - 1024s 1s/step - loss: 0.3648 - accuracy: 0.8360 - val_loss: 0.3790 - val_accuracy: 0.8326
Epoch 3/3
1020/1020 [==============================] - 1025s 1s/step - loss: 0.2689 - accuracy: 0.8896 - val_loss: 0.3668 - val_accuracy: 0.8438

4️⃣ Model predict

def model_predict(test_sentence):
  predict_input = tokenizer.encode(test_sentence, truncation=True, padding=True, return_tensors="tf")
  tf_output = model.predict(predict_input)[0]
  tf_prediction = tf.nn.softmax(tf_output, axis=1)
  labels = ['Negative','Positive'] # (0: negative, 1: positive)
  label = tf.argmax(tf_prediction, axis=1)
  label = label.numpy()
  tf_prediction = tf_prediction.numpy()

  print(test_sentence, ":: ", "{:.2f}%".format(abs(tf_prediction.take(label[0])) * 100), labels[label[0]])

test_sentence_1 = '''
There once was a king named Midas who did a good deed for a Satyr.
And he was then granted a wish by Dionysus, the god of wine.
For his wish, Midas asked that whatever he touched would turn to gold.
Despite Dionysus’ efforts to prevent it, Midas pleaded that this was a fantastic wish, and so, it was bestowed.
Excited about his newly-earned powers, Midas started touching all kinds of things, turning each item into pure gold.
'''
model_predict(test_sentence_1)

>>>
There once was a king named Midas who did a good deed for a Satyr.
And he was then granted a wish by Dionysus, the god of wine.
For his wish, Midas asked that whatever he touched would turn to gold.
Despite Dionysus’ efforts to prevent it, Midas pleaded that this was a fantastic wish, and so, it was bestowed.
Excited about his newly-earned powers, Midas started touching all kinds of things, turning each item into pure gold.
 ::  96.57% Positive


test_sentence_2 = '''
But soon, Midas became hungry. As he picked up a piece of food, he found he couldn’t eat it. It had turned to gold in his hand.
Hungry, Midas groaned, “I’ll starve! Perhaps this was not such an excellent wish after all!”
Seeing his dismay, Midas’ beloved daughter threw her arms around him to comfort him, and she, too, turned to gold. “The golden touch is no blessing,” Midas cried.
'''
model_predict(test_sentence_2)
>>>
But soon, Midas became hungry. As he picked up a piece of food, he found he couldn’t eat it. It had turned to gold in his hand.
Hungry, Midas groaned, “I’ll starve! Perhaps this was not such an excellent wish after all!”
Seeing his dismay, Midas’ beloved daughter threw her arms around him to comfort him, and she, too, turned to gold. “The golden touch is no blessing,” Midas cried.
 ::  55.15% Negative


lyrics = '''
And I know we weren't perfect
But I've never felt this way for no one, oh
And I just can't imagine how you could be so okay now that I'm gone
I guess you didn't mean what you wrote in that song about me
'Cause you said forever, now I drive alone past your street
'''
model_predict(lyrics)
>>>
And I know we weren't perfect
But I've never felt this way for no one, oh
And I just can't imagine how you could be so okay now that I'm gone
I guess you didn't mean what you wrote in that song about me
'Cause you said forever, now I drive alone past your street
 ::  53.11% Positive

Twitter Facebook LinkedIn

Sohn SooKyoung

[Ai] NLP-Toy-Project

Description

Environment

Contributions

Progress

1️⃣ get Dataset

2️⃣ BERT 모델에 맞는 input, target data 만들기

3️⃣ Modeling Initialize

4️⃣ Model predict

공유하기

댓글남기기

참고

Mujoco Tutorial 02/28

[졸업 프로젝트] 시각장애인을 위한 LaTeX 수식 음성 변환 기능 구현

[통계분석실습] 빅데이터 기반 프로야구 인기도 지표 분석 및 구단별 인기 기여 정도 파악

[Chap 5] EM 알고리즘