python: 200608-python, LDA 토픽추출 테스트 001

250x250

Notice

Recent Posts

Recent Comments

Link

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Tags more

Archives

Today

Total

관리 메뉴

무회blog

python: 200608-python, LDA 토픽추출 테스트 001_success, tomoto 본문

Python

python: 200608-python, LDA 토픽추출 테스트 001_success, tomoto

최무회 2020. 6. 8. 16:12

cankaoshu001.xlsx

0.02MB

point

k_cnt      = 5                         # 토픽의 개수    , 행 , 1 ~ 32767 사이의 정수
top_n_cnt  = 7                        # 토픽의  갯수   , 열
min_cf_cnt = 10                       # 단어 최소 출현 빈도  , 0 일시 모든 단어를 동일하게 봄 
alpha_cnt  = 0.1                        # 문헌‐토픽 빈도
eta_cnt    = 0.01                       # 토픽‐단어 빈도
tran_cnt   = 500                        # 자동학습 빈도


model     = tp.LDAModel(k=k_cnt, alpha=alpha_cnt,eta = eta_cnt)
model_PMI = tp.LDAModel(k=k_cnt, alpha=alpha_cnt,eta = eta_cnt, min_cf=min_cf_cnt,tw=tp.TermWeight.PMI)
model_IDF = tp.LDAModel(k=k_cnt, alpha=alpha_cnt,eta = eta_cnt, min_cf=min_cf_cnt,tw=tp.TermWeight.IDF)
model_ONE = tp.LDAModel(k=k_cnt, alpha=alpha_cnt,eta = eta_cnt, min_cf=min_cf_cnt,tw=tp.TermWeight.ONE)  # one 모든 단어를 동등하게 보다     

testLda.startfunc(model,model_PMI,model_IDF,model_ONE)

df01 = pd.DataFrame(dic01)
df01

# for_005_topikTs_tomoto
# %pip install tomotopy
# %pip install nltk
#  한국어 전처리 
# %pip install --upgrade kiwipiepy  
# %pip install KoNLP
# import nltk
# nltk.download()

import tomotopy as tp 
import pandas as pd
import numpy as np
import nltk.stem, nltk.corpus, nltk.tokenize, re
# from konlpy.tag import Kkma
# from konlpy.utils import pprint
from kiwipiepy import Kiwi
# kkma = Kkma()
kiwi = Kiwi()
kiwi.prepare()


# ######################## # # #  한국어 전처리 

filepath = './testfile/문재인대통령취임연설문_ansi.txt'
stemmer = nltk.stem.porter.PorterStemmer() 
stopwords = set(nltk.corpus.stopwords.words('korean')) 
def tokenize(sent):
    res, score = kiwi.analyze(sent)[0] # 첫번째 결과를 사용
    return [word
            for word, tag, _, _ in res
            if not tag.startswith('E') 
            and not tag.startswith('J') 
            and not tag.startswith('S')] # 조사, 어미, 특수기호는 제거

class testLda:
    def startfunc(model,model_PMI,model_IDF,model_ONE):
        for i, line in enumerate(open(filepath)):
            token0 = tokenize(line)
            stopwords = set([wd for wd in token0 if len(wd) <= 1]) #  한글자 단어는 불요어로 지정 
            token0 = [wd for wd in token0 if len(wd) > 1]          # 한글자 이상 단어 토큰으로 지정 
            model.add_doc(token0)                                    # tokenize함수를 이용해 전처리한 결과를 add_doc에 넣습니다.
            model_PMI.add_doc(token0)  
            model_IDF.add_doc(token0)  
            model_ONE.add_doc(token0)  
        model.train(tran_cnt)    
        for i in range(model.k):
            ttx1= ', '.join(w for w, p in model.get_topic_words(i,top_n=top_n_cnt))
            ttx2= ', '.join(w for w, p in model_PMI.get_topic_words(i, top_n=top_n_cnt))
            ttx3= ', '.join(w for w, p in model_IDF.get_topic_words(i, top_n=top_n_cnt))
            ttx4= ', '.join(w for w, p in model_ONE.get_topic_words(i, top_n=top_n_cnt))
            
            ttx1 = re.sub('[a-zA-Z@.,]','',ttx1)
            ttx2 = re.sub('[a-zA-Z@.,]','',ttx2)
            ttx3 = re.sub('[a-zA-Z@.,]','',ttx3)
            ttx4 = re.sub('[a-zA-Z@.,]','',ttx4)
            
            li_model.append(ttx1)
            li_model_PMI.append(ttx2)
            li_model_IDF.append(ttx3)
            li_model_ONE.append(ttx4)
            
            dic01['lda_model'] = li_model
            dic01['lda_PMI'] = li_model_PMI
            dic01['lda_IDF'] = li_model_IDF
            dic01['lda_ONE'] = li_model_ONE
    #     print('Topic #{}'.format(i), end='\t')
    
# tokenize 처리 
dic01 = {}
token0 = []
li_model = [] 
li_model_PMI = []
li_model_IDF = []
li_model_ONE = []

k_cnt      = 5                         # 토픽의 개수    , 행 , 1 ~ 32767 사이의 정수
top_n_cnt  = 7                        # 토픽의  갯수   , 열
min_cf_cnt = 10                       # 단어 최소 출현 빈도  , 0 일시 모든 단어를 동일하게 봄 
alpha_cnt  = 0.1                        # 문헌‐토픽 빈도
eta_cnt    = 0.01                       # 토픽‐단어 빈도
tran_cnt   = 500                        # 자동학습 빈도


model     = tp.LDAModel(k=k_cnt, alpha=alpha_cnt,eta = eta_cnt)
model_PMI = tp.LDAModel(k=k_cnt, alpha=alpha_cnt,eta = eta_cnt, min_cf=min_cf_cnt,tw=tp.TermWeight.PMI)
model_IDF = tp.LDAModel(k=k_cnt, alpha=alpha_cnt,eta = eta_cnt, min_cf=min_cf_cnt,tw=tp.TermWeight.IDF)
model_ONE = tp.LDAModel(k=k_cnt, alpha=alpha_cnt,eta = eta_cnt, min_cf=min_cf_cnt,tw=tp.TermWeight.ONE)  # one 모든 단어를 동등하게 보다     

testLda.startfunc(model,model_PMI,model_IDF,model_ONE)

df01 = pd.DataFrame(dic01)
df01

저작자표시 비영리 변경금지

'Python' 카테고리의 다른 글

python: 200609-python_part1_topics_추출 (0)	2020.06.09
200609-005.03.02_topikTs_LDA (0)	2020.06.09
python: python_ jsonToExcel, json (0)	2020.06.08
python: 200607- python 토픽추출 , nltk, tomoto, test->for_topics-004 (0)	2020.06.07
python: tomotopy API 문서, 토픽추출시 참고 (0)	2020.06.06

'Python' Related Articles

Comments

무회blog

python: 200608-python, LDA 토픽추출 테스트 001_success, tomoto 본문

python: 200608-python, LDA 토픽추출 테스트 001_success, tomoto

'Python' 카테고리의 다른 글

티스토리툴바