# Exercise 10

In [61]:
from transformers import AutoTokenizer
from transformers import TFAutoModelForSequenceClassification, TFDistilBertForSequenceClassification, TFBertForMaskedLM

import numpy as np
import datasets
from datasets import load_dataset
import pandas as pd
import tensorflow as tf

Load IMDB dataset in HuggingFace dataset format

Labels: 0 = neg; 1 = pos

In [62]:
dataset = load_dataset("imdb")

In [63]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

These datasets can also be converted to pandas dataframes

In [64]:
df = pd.DataFrame(dataset["train"])
df

Unnamed: 0,text,label
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0
...,...,...
24995,A hit at the time but now better categorised a...,1
24996,I love this movie like no other. Another time ...,1
24997,This film and it's sequel Barry Mckenzie holds...,1
24998,'The Adventures Of Barry McKenzie' started lif...,1


And you can also create a dataset in the datasets format if you have pandas dataframes available

In [65]:
new_dataset = datasets.DatasetDict()
new_dataset["train"] = datasets.Dataset.from_pandas(df.head(50))
new_dataset["test"] = datasets.Dataset.from_pandas(df.tail(50))
new_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 50
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 50
    })
})

In [66]:
train = dataset["train"].shuffle(seed=42).select(range(500))

In [67]:
tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")
# bert-base-uncased

In [68]:
x_train = dict(tokenizer(train["text"], return_tensors="np", padding='max_length', truncation=True))
y_train = np.array(train["label"])

In [69]:
model = TFDistilBertForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased")
model.distilbert.trainable = False
model.summary()

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

Model: "tf_distil_bert_for_sequence_classification_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMa  multiple                  66362880  
 inLayer)                                                        
                                                                 
 pre_classifier (Dense)      multiple                  590592    
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
 dropout_860 (Dropout)       multiple                  0 (unused)
                                                                 
Total params: 66955010 (255.41 MB)
Trainable params: 592130 (2.26 MB)
Non-trainable params: 66362880 (253.15 MB)
_________________________________________________________________


In [70]:
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
model.fit(x_train, y_train, epochs=5, batch_size=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
 14/100 [===>..........................] - ETA: 1:47 - loss: 4.8629 - accuracy: 0.5857

KeyboardInterrupt: 

In [39]:
test = dataset["test"].shuffle(seed=42).select(range(100))
x_test = dict(tokenizer(test["text"], return_tensors="np", padding='max_length', truncation=True))
y_test = np.array(test["label"])

In [40]:
print(model.evaluate(x_test, y_test))

[0.3891203701496124, 0.17000000178813934]


In [73]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = TFBertForMaskedLM.from_pretrained("bert-base-uncased")
inputs = tokenizer("The capital of Germany is [MASK].", return_tensors="np")
logits = model(**inputs).logits
mask_token_index = tf.where((inputs.input_ids == tokenizer.mask_token_id)[0])
selected_logits = tf.gather_nd(logits[0], indices=mask_token_index)
predicted_token_id = tf.math.argmax(selected_logits, axis=-1)
tokenizer.decode(predicted_token_id)

All PyTorch model weights were used when initializing TFBertForMaskedLM.

All the weights of TFBertForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForMaskedLM for predictions without further training.


'bonn'