250x250
Notice
Recent Posts
Recent Comments
일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | |||||
3 | 4 | 5 | 6 | 7 | 8 | 9 |
10 | 11 | 12 | 13 | 14 | 15 | 16 |
17 | 18 | 19 | 20 | 21 | 22 | 23 |
24 | 25 | 26 | 27 | 28 | 29 | 30 |
Tags
- Python
- Websocket
- pytorch
- Topics
- r
- Gmarket
- tomoto
- test
- 토픽추출
- 과학백과사전
- oracle
- (깃)git bash
- 幼稚园杀手(유치원킬러)
- mysql
- 코사인 유사도
- 파이썬
- db
- 이력서
- java
- 자바
- 지마켓
- word2vec
- jsp 파일 설정
- 게시판 만들기
- lda
- RESFUL
- 네이버뉴스
- 크롤링
- spring MVC(모델2)방식
- 방식으로 텍스트
Archives
- Today
- Total
무회blog
python: pytorch,Bert, 로 토큰나이징 하기 본문
In [1]:
import torch
from transformers import AutoModel,AutoTokenizer, BertTokenizer
print(torch.__version__)
torch.set_grad_enabled(False)
Out[1]:
In [2]:
# 모델 저장하기 , Store the model we want to use
MODEL_NAME = "bert-base-cased"
# We need to create the model and tokenizer
model = AutoModel.from_pretrained(MODEL_NAME)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
print('model : ',model)
In [3]:
# 문장 자르기, 뛰어쓰기 기준?
tokens = tokenizer.tokenize("This is an input example")
print("Tokens: {}".format(tokens))
# 토큰에 정수 숫자 id 정해주기 , not a problem, let's convert tokens to ids.
tokens_ids = tokenizer.convert_tokens_to_ids(tokens)
print("Tokens id: {}".format(tokens_ids))
# 충분히 필요한 만큼의 토큰 id 를 추가해주기 , Add the required special tokens
tokens_ids = tokenizer.build_inputs_with_special_tokens(tokens_ids)
print('tokens_ids: ' ,tokens_ids, type(tokens_ids))
# We need to convert to a Deep Learning framework specific format, let's use PyTorch for now.
# 토큰 id를 기준으로 타입변환을 시켜주기 , list -> tensor
tokens_pt = torch.tensor([tokens_ids])
print("Tokens PyTorch: {}".format(tokens_pt), type(tokens_pt))
# Now we're ready to go through BERT with out input
outputs, pooled = model(tokens_pt)
print(type(outputs))
print("顺滑的Token wise output: {}, 公用资源Pooled output: {}".format(outputs.shape, pooled.shape))
print('')
print(len(outputs))
print(len(outputs[0]))
print(len(outputs[0][0]))
In [4]:
### transformer 시작
In [5]:
# tokens = tokenizer.tokenize("This is an input example")
# tokens_ids = tokenizer.convert_tokens_to_ids(tokens)
# tokens_pt = torch.tensor([tokens_ids])
# This code can be factored into one-line as follow
tokens_pt2 = tokenizer("This is an input example", return_tensors="pt")
for key, value in tokens_pt2.items():
print("{}:\n\t{}".format(key, value))
outputs2, pooled2 = model(**tokens_pt2)
print("Difference with previous code: ({}, {})".format((outputs2 - outputs).sum(), (pooled2 - pooled).sum()))
In [6]:
# token_type_ids: This tensor will map every tokens to their corresponding segment (see below).
# attention_mask: This tensor is used to "mask" padded values in a batch of sequence with different lengths (see below).
# Single segment input
single_seg_input = tokenizer("This is a sample input")
print("Single segment token (str): {}".format(tokenizer.convert_ids_to_tokens(single_seg_input['input_ids'])))
print("Single segment token (int): {}".format(single_seg_input['input_ids']))
print("Single segment type : {}".format(single_seg_input['token_type_ids']))
# Multiple segment input
multi_seg_input = tokenizer("This is segment A", "This is segment B")
# Segments are concatened in the input to the model, with
print()
print("Multi segment token (str): {}".format(tokenizer.convert_ids_to_tokens(multi_seg_input['input_ids'])))
print("Multi segment token (int): {}".format(multi_seg_input['input_ids']))
print("Multi segment type : {}".format(multi_seg_input['token_type_ids']))
In [7]:
# Padding highlight
tokens = tokenizer(
["This is a sample", "This is another longer sample text"],
padding=True # First sentence will have some PADDED tokens to match second sequence length
)
for i in range(2):
print("Tokens (int) : {}".format(tokens['input_ids'][i]))
print("Tokens (str) : {}".format([tokenizer.convert_ids_to_tokens(s) for s in tokens['input_ids'][i]]))
print("Tokens (attn_mask): {}".format(tokens['attention_mask'][i]))
print()
In [8]:
# from transformers import BertTokenizer, BertModel
# tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-uncased')
# model = BertModel.from_pretrained("bert-base-multilingual-uncased")
# text = "Replace me by any text you'd like."
# encoded_input = tokenizer(text, return_tensors='pt')
# output = model(**encoded_input)
In [10]:
# # 텐서플로우 혹은 파이토츠리로 버트 모델 불러오기
# # Let's load a BERT model for TensorFlow and PyTorch
from transformers import TFBertModel, BertModel
model_tf = TFBertModel.from_pretrained('bert-base-uncased')
model_pt = BertModel.from_pretrained('bert-base-uncased')
In [11]:
# transformers generates a ready to use dictionary with all the required parameters for the specific framework.
input_tf = tokenizer("This is a sample input", return_tensors="tf")
input_pt = tokenizer("This is a sample input", return_tensors="pt")
# Let's compare the outputs
output_tf, output_pt = model_tf(input_tf), model_pt(**input_pt)
# Models outputs 2 values (The value for each tokens, the pooled representation of the input sentence)
# Here we compare the output differences between PyTorch and TensorFlow.
for name, o_tf, o_pt in zip(["output", "pooled"], output_tf, output_pt):
print("{} differences: {:.5}".format(name, (o_tf.numpy() - o_pt.numpy()).sum()))
In [13]:
from transformers import DistilBertModel
bert_distil = DistilBertModel.from_pretrained('distilbert-base-uncased')
input_pt = tokenizer(
'This is a sample input to demonstrate performance of distiled models especially inference time',
return_tensors="pt"
)
%time _ = bert_distil(input_pt['input_ids'])
%time _ = model_pt(input_pt['input_ids'])
In [31]:
from transformers import TFBertModel,BertModel,DistilBertModel,AutoModel
from transformers import AutoTokenizer, BertTokenizer
# Let's load 한국어 BERT from 멀티 라이브러리
de_bert = BertModel.from_pretrained('bert-base-multilingual-uncased')
de_tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-uncased')
de_input = de_tokenizer(
"오늘따라 날씨도 안좋게 비가 오지 않고 있어서 버트 테스트를 해보고 있다.그래서 기분이 좋았다. 이유는 날씨가 좋지 않아서다",
return_tensors="pt"
)
print("Tokens (int) : {}".format(de_input['input_ids'].tolist()[0]))
print("Tokens (str) : {}".format([de_tokenizer.convert_ids_to_tokens(s) for s in de_input['input_ids'].tolist()[0]]))
print("Tokens (attn_mask): {}".format(de_input['attention_mask'].tolist()[0]))
print()
output_de, pooled_de = de_bert(**de_input)
print("Token wise output: {}, Pooled output: {}".format(outputs.shape, pooled.shape))
'Python > TA' 카테고리의 다른 글
200814,test bert 001 (0) | 2020.08.14 |
---|---|
python: pytorch, model , train 시키기 , (0) | 2020.08.13 |
python: pytorch, 질의응답관련 , bert 사용 (0) | 2020.08.13 |
python: pytorch,transformer , pipline , 提问解答사용해보기 (0) | 2020.08.13 |
Comments