250x250
Notice
Recent Posts
Recent Comments
일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | |||||
3 | 4 | 5 | 6 | 7 | 8 | 9 |
10 | 11 | 12 | 13 | 14 | 15 | 16 |
17 | 18 | 19 | 20 | 21 | 22 | 23 |
24 | 25 | 26 | 27 | 28 | 29 | 30 |
Tags
- jsp 파일 설정
- Gmarket
- 방식으로 텍스트
- tomoto
- mysql
- 코사인 유사도
- 게시판 만들기
- 네이버뉴스
- RESFUL
- 토픽추출
- 지마켓
- spring MVC(모델2)방식
- 이력서
- oracle
- (깃)git bash
- 幼稚园杀手(유치원킬러)
- java
- 크롤링
- lda
- db
- word2vec
- 파이썬
- 과학백과사전
- Websocket
- test
- Python
- 자바
- Topics
- pytorch
- r
Archives
- Today
- Total
무회blog
python: 200601- 파이썬 워드클라우드 그리기(konlpy, nltk) 본문
## wordcloud 그리기
# %pip install wordcloud
# %pip install nltk
# %pip install pandas
# %pip install numpy
# %pip install konlpy
##################################################################
import nltk
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import pandas as pd
from konlpy.tag import Hannanum
hannanum = Hannanum()
from wordcloud import WordCloud
from collections import Counter
f=open("..\\00.Data\\문재인대통령취임연설문_utf-8.txt",'r',encoding='utf-8')
lines = f.readlines()
f.close()
lines01 = 'Chief Justice Roberts, President Carter, President Clinton, President Bush, President Obama, fellow'
##################################################################
tokennizer = RegexpTokenizer('[\w]+')
stop_words = stopwords.words('english') # C:\\Users\\C20A-018\\AppData\\Roaming\\nltk_data\\corpora\\stopwords\\
words = lines01.lower()
# words = str(lines[0:5])
tokens = tokennizer.tokenize(words)
stoped_tokens = [i for i in list((tokens)) if not i in stop_words]
stoped_tokens2 = [i for i in stoped_tokens if len(i) > 1]
pd.Series(stoped_tokens2).value_counts().head(10)
##################################################################
# lines
temp = []
for i in range(len(lines)):
temp.append(hannanum.nouns(lines[i]))
temp = list(filter(bool, temp))
def flatten(l):
flatList = []
for elem in l:
if type(elem) == list:
for e in elem:
flatList.append(e)
else:
flatList.append(elem)
return flatList
word_list = flatten(temp)
word_list=pd.Series([x for x in word_list if len(x)> 1])
word_list.value_counts().head(10)
# dir(word_list)
##################################################################
font_path = 'D://app_src/anaconda/06-font/나눔바른고딕/CJnXlA0w_D7iilTV5nZ2CsjiEBQ.ttf'
wordcloud = WordCloud(
font_path = font_path,
width=800,
height=800,
background_color="white"
)
count = Counter(stoped_tokens2)
wordcloud = wordcloud.generate_from_frequencies(count)
def __array__(self):
return self.to_array()
"""Convert to numpy array. Returns
image : nd-array size (width, height, 3), Word cloud image as numpy matrix."""
def to_array(self):
return np.array(self.to_image())
"""Convert to numpy array. Returns
image : nd-array size(width, height,3), Word cloud image as numpy matrix."""
# array = wordcloud.to_array()
import matplotlib.pyplot as plt
# fig = plt.figure(figsize=(10,10))
# plt.imshow(array,interpolation="bilinear")
# plt.show()
# fig.savefig('wordcloud.png')
count = Counter(word_list)
wordcloud = wordcloud.generate_from_frequencies(count)
array = wordcloud.to_array()
fig = plt.figure(figsize=(10,10))
plt.imshow(array, interpolation='bilinear')
plt.show()
fig.savefig('word_list_cloud.png')
'Python' 카테고리의 다른 글
200601-파이썬 LDAmodel 적용하기 ,토픽추출 (topik) 하기 (0) | 2020.06.01 |
---|---|
200601-군집분석 ,k-평균군집 (0) | 2020.06.01 |
python: 200529-python_문서내 단어 빈도 분석 (0) | 2020.05.29 |
python: 200529-python-test002ldaModel-토픽추출 (0) | 2020.05.29 |
python: 200529-python-test001 ldaModel-토픽추출 (0) | 2020.05.29 |
Comments