7 분 소요

🔮 IMDB 리뷰 데이터셋

이 데이터 셋은 인터넷 영화 데이터베이스인 imdb.com에서 수집한 리뷰를 감상평에 따라 긍정과 부정으로 분류해 놓은 데이터 셋이다. 각 단어에 고유한 정수를 부여하여 컴퓨터에서 처리한다.

  • 토큰: 분리된 단어

1️⃣ 데이터셋 가져오기

전체 데이터셋에서 가장 자주 등장하는 단어 500개만 사용하고, 검증 세트 20%를 떼어 놓는다.

from tensorflow.keras.datasets import imdb
from sklearn.model_selection import train_test_split

(train_input, train_target), (test_input, test_target) = imdb.load_data(num_words=500)

train_input, val_input, train_target, val_target = train_test_split(train_input, train_target, test_size = 0.2, random_state = 42)

그 후 훈련 세트의 단어 길이를 살펴보니 평균이 178개의 단어였고, 100개 미만인 경우도 많았다. => 💡 sol. 길이가 100이 되도록 시퀀스 데이터 길이를 맞추는 pad_sequences() 함수 를 사용한다.

from tensorflow.keras.preprocessing.sequence import pad_sequences
train_seq = pad_sequences(train_input, maxlen=100)
val_seq = pad_sequences(val_input, maxlen = 100)
  • train_seq: (20000, 100)의 크기를 가지고 있으며 20000개의 샘플의 개수, 100개의 단어 개수를 가지고 있는 훈련 세트임을 알 수 있다.
  • val_seq: (5000, 100)의 크기를 가지고 있으며 5000개의 샘플의 개수와 100개의 단어 개수를 가지고 있는 테스트 세트임을 알 수 있다.

2️⃣ 순환 신경망 만들기

📍 원-핫 인코딩 방식으로

from tensorflow import keras

model = keras.Sequential()
model.add(keras.layers.SimpleRNN(8, input_shape=(100, 500)))
model.add(keras.layers.Dense(1, activation='sigmoid'))
train_oh = keras.utils.to_categorical(train_seq)
val_oh = keras.utils.to_categorical(val_seq)
model.summary()

>>>
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #
=================================================================
 simple_rnn (SimpleRNN)      (None, 8)                 4072

 dense (Dense)               (None, 1)                 9

=================================================================
Total params: 4,081
Trainable params: 4,081
Non-trainable params: 0
_________________________________________________________________
  1. Dense와 Conv2D 클래스 대신 SimpleRNN 클래스를 사용한다.
  2. (100, 500) 차원의 input을 넣어준다. => 왜 100, 500? => 샘플의 길이가 100이고, 정수들 사이에 관련이 없도록 만들어주기 위해서 원-핫 인코딩으로 바꾸어 준다.(토큰 마다 500개의 숫자를 사용해 표현)

📍 단어 임베딩을 사용하기

단어 임베딩이란 순환 신경망에서 텍스트를 처리할 때 즐겨 사용하는 방법이다. 단어 임베딩은 각 단어를 고정된 크기의 실수 벡터로 바꾸어 준다.

model2 = keras.Sequential()
model2.add(keras.layers.Embedding(500, 16, input_length=100))
model2.add(keras.layers.SimpleRNN(8))
model2.add(keras.layers.Dense(1, activation='sigmoid'))
model2.summary()

>>>
Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #
=================================================================
 embedding (Embedding)       (None, 100, 16)           8000

 simple_rnn_1 (SimpleRNN)    (None, 8)                 200

 dense_1 (Dense)             (None, 1)                 9

=================================================================
Total params: 8,209
Trainable params: 8,209
Non-trainable params: 0
_________________________________________________________________

3️⃣ 순환 신경망 훈련하기

📍 원-핫 인코딩 방식

rmsprop = keras.optimizers.RMSprop(learning_rate=1e-4)
model.compile(optimizer=rmsprop, loss='binary_crossentropy', metrics=['accuracy'])

checkpoint_cb = keras.callbacks.ModelCheckpoint('best-simplernn-model.h5', save_best_only=True)
early_stopping_cb = keras.callbacks.EarlyStopping(patience=3, restore_best_weights=True)

history = model.fit(train_oh, train_target, epochs=100, batch_size=64,  validation_data=(val_oh, val_target), callbacks=[checkpoint_cb, early_stopping_cb])

>>>
Epoch 1/100
313/313 [==============================] - 14s 42ms/step - loss: 0.6858 - accuracy: 0.5464 - val_loss: 0.6697 - val_accuracy: 0.5910
Epoch 2/100
313/313 [==============================] - 14s 45ms/step - loss: 0.6547 - accuracy: 0.6227 - val_loss: 0.6414 - val_accuracy: 0.6368
Epoch 3/100
313/313 [==============================] - 13s 42ms/step - loss: 0.6289 - accuracy: 0.6618 - val_loss: 0.6185 - val_accuracy: 0.6788
Epoch 4/100
313/313 [==============================] - 13s 41ms/step - loss: 0.6045 - accuracy: 0.6939 - val_loss: 0.5949 - val_accuracy: 0.7040
Epoch 5/100
313/313 [==============================] - 13s 41ms/step - loss: 0.5813 - accuracy: 0.7167 - val_loss: 0.5769 - val_accuracy: 0.7244
Epoch 6/100
313/313 [==============================] - 13s 41ms/step - loss: 0.5629 - accuracy: 0.7358 - val_loss: 0.5603 - val_accuracy: 0.7394
Epoch 7/100
313/313 [==============================] - 14s 44ms/step - loss: 0.5459 - accuracy: 0.7474 - val_loss: 0.5510 - val_accuracy: 0.7330
Epoch 8/100
313/313 [==============================] - 13s 41ms/step - loss: 0.5315 - accuracy: 0.7571 - val_loss: 0.5342 - val_accuracy: 0.7554
Epoch 9/100
313/313 [==============================] - 13s 41ms/step - loss: 0.5185 - accuracy: 0.7672 - val_loss: 0.5238 - val_accuracy: 0.7570
Epoch 10/100
313/313 [==============================] - 13s 41ms/step - loss: 0.5069 - accuracy: 0.7745 - val_loss: 0.5138 - val_accuracy: 0.7672
Epoch 11/100
313/313 [==============================] - 13s 41ms/step - loss: 0.4964 - accuracy: 0.7801 - val_loss: 0.5057 - val_accuracy: 0.7682
Epoch 12/100
313/313 [==============================] - 13s 41ms/step - loss: 0.4875 - accuracy: 0.7857 - val_loss: 0.5018 - val_accuracy: 0.7656
Epoch 13/100
313/313 [==============================] - 13s 41ms/step - loss: 0.4794 - accuracy: 0.7891 - val_loss: 0.4972 - val_accuracy: 0.7672
Epoch 14/100
313/313 [==============================] - 13s 41ms/step - loss: 0.4728 - accuracy: 0.7922 - val_loss: 0.4867 - val_accuracy: 0.7758
Epoch 15/100
313/313 [==============================] - 13s 41ms/step - loss: 0.4663 - accuracy: 0.7941 - val_loss: 0.4807 - val_accuracy: 0.7826
Epoch 16/100
313/313 [==============================] - 13s 41ms/step - loss: 0.4606 - accuracy: 0.7982 - val_loss: 0.4770 - val_accuracy: 0.7832
Epoch 17/100
313/313 [==============================] - 13s 41ms/step - loss: 0.4556 - accuracy: 0.8000 - val_loss: 0.4725 - val_accuracy: 0.7818
Epoch 18/100
313/313 [==============================] - 13s 41ms/step - loss: 0.4505 - accuracy: 0.8023 - val_loss: 0.4688 - val_accuracy: 0.7848
Epoch 19/100
313/313 [==============================] - 13s 41ms/step - loss: 0.4472 - accuracy: 0.8048 - val_loss: 0.4678 - val_accuracy: 0.7848
Epoch 20/100
313/313 [==============================] - 14s 44ms/step - loss: 0.4431 - accuracy: 0.8073 - val_loss: 0.4644 - val_accuracy: 0.7866
Epoch 21/100
313/313 [==============================] - 13s 41ms/step - loss: 0.4399 - accuracy: 0.8073 - val_loss: 0.4633 - val_accuracy: 0.7850
Epoch 22/100
313/313 [==============================] - 13s 41ms/step - loss: 0.4368 - accuracy: 0.8075 - val_loss: 0.4629 - val_accuracy: 0.7864
Epoch 23/100
313/313 [==============================] - 13s 41ms/step - loss: 0.4336 - accuracy: 0.8102 - val_loss: 0.4617 - val_accuracy: 0.7828
Epoch 24/100
313/313 [==============================] - 13s 41ms/step - loss: 0.4316 - accuracy: 0.8102 - val_loss: 0.4600 - val_accuracy: 0.7882
Epoch 25/100
313/313 [==============================] - 13s 41ms/step - loss: 0.4290 - accuracy: 0.8116 - val_loss: 0.4596 - val_accuracy: 0.7858
Epoch 26/100
313/313 [==============================] - 13s 42ms/step - loss: 0.4268 - accuracy: 0.8120 - val_loss: 0.4552 - val_accuracy: 0.7884
Epoch 27/100
313/313 [==============================] - 13s 41ms/step - loss: 0.4248 - accuracy: 0.8136 - val_loss: 0.4550 - val_accuracy: 0.7876
Epoch 28/100
313/313 [==============================] - 13s 41ms/step - loss: 0.4232 - accuracy: 0.8126 - val_loss: 0.4606 - val_accuracy: 0.7844
Epoch 29/100
313/313 [==============================] - 13s 41ms/step - loss: 0.4209 - accuracy: 0.8146 - val_loss: 0.4552 - val_accuracy: 0.7886
Epoch 30/100
313/313 [==============================] - 13s 41ms/step - loss: 0.4199 - accuracy: 0.8163 - val_loss: 0.4620 - val_accuracy: 0.7846
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.xlabel('epoch')
plt.ylabel('loss')
plt.legend(['train', 'val'])
plt.show()

download1

토큰 1개를 500차원으로 늘렸기 때문에 약 500배가 커진다. -> 썩 좋은 방법은 아니므로 더 좋은 단어 표현 방법을 알아보자

📍 단어 임베딩 방식

rmsprop = keras.optimizers.RMSprop(learning_rate=1e-4)
model2.compile(optimizer=rmsprop, loss='binary_crossentropy', metrics=['accuracy'])

checkpoint_cb = keras.callbacks.ModelCheckpoint('best-embedding-model.h5', save_best_only=True)
early_stopping_cb = keras.callbacks.EarlyStopping(patience=3, restore_best_weights=True)

history = model2.fit(train_seq, train_target, epochs=100, batch_size=64, validation_data=(val_seq, val_target), callbacks=[checkpoint_cb, early_stopping_cb])

>>>
Epoch 1/100
313/313 [==============================] - 9s 24ms/step - loss: 0.6820 - accuracy: 0.5606 - val_loss: 0.6515 - val_accuracy: 0.6328
Epoch 2/100
313/313 [==============================] - 8s 25ms/step - loss: 0.6274 - accuracy: 0.6729 - val_loss: 0.6079 - val_accuracy: 0.7062
Epoch 3/100
313/313 [==============================] - 7s 23ms/step - loss: 0.5927 - accuracy: 0.7193 - val_loss: 0.5888 - val_accuracy: 0.7222
Epoch 4/100
313/313 [==============================] - 7s 23ms/step - loss: 0.5696 - accuracy: 0.7453 - val_loss: 0.5679 - val_accuracy: 0.7460
Epoch 5/100
313/313 [==============================] - 7s 23ms/step - loss: 0.5521 - accuracy: 0.7605 - val_loss: 0.5535 - val_accuracy: 0.7526
Epoch 6/100
313/313 [==============================] - 7s 23ms/step - loss: 0.5366 - accuracy: 0.7703 - val_loss: 0.5489 - val_accuracy: 0.7490
Epoch 7/100
313/313 [==============================] - 7s 24ms/step - loss: 0.5235 - accuracy: 0.7763 - val_loss: 0.5343 - val_accuracy: 0.7684
Epoch 8/100
313/313 [==============================] - 7s 23ms/step - loss: 0.5125 - accuracy: 0.7832 - val_loss: 0.5256 - val_accuracy: 0.7648
Epoch 9/100
313/313 [==============================] - 7s 23ms/step - loss: 0.5000 - accuracy: 0.7898 - val_loss: 0.5170 - val_accuracy: 0.7676
Epoch 10/100
313/313 [==============================] - 7s 23ms/step - loss: 0.4894 - accuracy: 0.7944 - val_loss: 0.5087 - val_accuracy: 0.7732
Epoch 11/100
313/313 [==============================] - 8s 24ms/step - loss: 0.4803 - accuracy: 0.7979 - val_loss: 0.5032 - val_accuracy: 0.7740
Epoch 12/100
313/313 [==============================] - 8s 24ms/step - loss: 0.4715 - accuracy: 0.8026 - val_loss: 0.4963 - val_accuracy: 0.7764
Epoch 13/100
313/313 [==============================] - 7s 23ms/step - loss: 0.4628 - accuracy: 0.8050 - val_loss: 0.4897 - val_accuracy: 0.7800
Epoch 14/100
313/313 [==============================] - 7s 23ms/step - loss: 0.4560 - accuracy: 0.8070 - val_loss: 0.4838 - val_accuracy: 0.7800
Epoch 15/100
313/313 [==============================] - 7s 23ms/step - loss: 0.4494 - accuracy: 0.8098 - val_loss: 0.4826 - val_accuracy: 0.7782
Epoch 16/100
313/313 [==============================] - 7s 24ms/step - loss: 0.4440 - accuracy: 0.8116 - val_loss: 0.4768 - val_accuracy: 0.7798
Epoch 17/100
313/313 [==============================] - 7s 23ms/step - loss: 0.4372 - accuracy: 0.8144 - val_loss: 0.4763 - val_accuracy: 0.7762
Epoch 18/100
313/313 [==============================] - 7s 23ms/step - loss: 0.4335 - accuracy: 0.8150 - val_loss: 0.4740 - val_accuracy: 0.7782
Epoch 19/100
313/313 [==============================] - 7s 23ms/step - loss: 0.4286 - accuracy: 0.8167 - val_loss: 0.4896 - val_accuracy: 0.7726
Epoch 20/100
313/313 [==============================] - 7s 24ms/step - loss: 0.4248 - accuracy: 0.8178 - val_loss: 0.4798 - val_accuracy: 0.7752
Epoch 21/100
313/313 [==============================] - 7s 23ms/step - loss: 0.4207 - accuracy: 0.8201 - val_loss: 0.4704 - val_accuracy: 0.7806
Epoch 22/100
313/313 [==============================] - 7s 23ms/step - loss: 0.4180 - accuracy: 0.8195 - val_loss: 0.4732 - val_accuracy: 0.7784
Epoch 23/100
313/313 [==============================] - 7s 23ms/step - loss: 0.4140 - accuracy: 0.8220 - val_loss: 0.4710 - val_accuracy: 0.7804
Epoch 24/100
313/313 [==============================] - 7s 23ms/step - loss: 0.4122 - accuracy: 0.8225 - val_loss: 0.4709 - val_accuracy: 0.7784
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.xlabel('epoch')
plt.ylabel('loss')
plt.legend(['train', 'val'])
plt.show()

download2

출력 결과는 비슷하지만 원-핫 인코딩을 사용한 모델보다 사용한 훈련 세트의 크기가 줄었다. 또한 순환층의 가중치 개수 역시 준 것을 확인할 수 있다.

댓글남기기