信息化热词分析python

时间：2023-08-29 21:23:02浏览次数：46

标签：fp word 信息化热词 python words import txt data

环境准备

# 安装requests库
pip install requests
# 安装bs4库
pip install bs4
# 安装jieba库
pip install jieba
# 安装selenium库
pip install selenium
# 安装lxml库
pip install lxml
# 安装matplotlib库
pip install matplotlib
# 安装numpy库
pip install numpy
# 安装Pillow库
pip install Pillow
# 安装wordcloud库
pip install wordcloud

工具准备

chromedriver

历史版本:chromedriver.storage.googleapis.com/index.html

本项目爬取了博客园的推荐新闻，并且获取热度最高的100个词并生成词云图片

项目代码

import requests
from bs4 import BeautifulSoup
import jieba
import jieba.analyse as anls
from selenium import webdriver
from lxml import etree
import  re
from time import sleep
import os
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image
from wordcloud import WordCloud
from selenium.webdriver.common.keys import Keys


def get_words():
    url="https://news.cnblogs.com/n/recommend"
    header={
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36 Edg/115.0.1901.188"
    }
    for i in range(100):
        parm = {
            "page": i+1
        }
        response=requests.request(method="post",url=url,params=parm,headers=header)
        response.encoding = "utf-8"
        soup = BeautifulSoup(response.text, "lxml")
        h2= soup.find_all('h2', class_='news_entry')
        with open("data/words.txt","a",encoding="utf-8") as fp:
             for j in h2:
                fp.write(j.find("a").text+"\n")


def words_filter():
    jieba.load_userdict("data/THUOCL_IT.txt")
    jieba.analyse.set_stop_words("data/stopword.txt")
    text=str()
    with open("data/words.txt","r",encoding="utf-8") as fp:
        text=fp.read()
    words_list=jieba.lcut(text,False)
    words_filter_list = [word for word in words_list if word in jieba.dt.FREQ and len(word)>1  and not word.isnumeric()]
    with open("data/words_res.txt","a",encoding="utf-8") as fp:
        for word in words_filter_list:
            fp.write(word + "\n")
    res = str()
    with open("data/words_res.txt", "r", encoding="utf-8") as fp:
        res=fp.read()
    print("基于TF-IDF提取关键词结果：")
    with open("data/words_ans.txt","a",encoding="utf-8") as fp:
        for x, w in anls.extract_tags(res, topK=100, withWeight=True):
            fp.write('%s\t%s\n'%(x,w))
            print('%s\t%s' % (x, w))
def words_explain():
    word_list=[]
    with open("data/words_ans.txt","r",encoding="utf-8") as fp:
        line=fp.readline()
        while line:
            word_list.append(line.split("\t")[0])
            line=fp.readline()
    option = webdriver.ChromeOptions()
    option.add_argument('headless')
    option.add_argument('--disable-gpu')
    chro=webdriver.Chrome(executable_path="data/chromedriver.exe",options=option)
    with open("data/words_explain.txt", "a", encoding="utf-8") as fp:
        for i in word_list:
            chro.get("https://baike.sogou.com")
            sleep(2)
            query = chro.find_element_by_id("searchText")
            query.clear()
            query.send_keys(i, Keys.ENTER)
            sleep(2)
            page_text=chro.page_source
            tree=etree.HTML(page_text)
            divs=tree.xpath('//div[@class="abstract"]//text()')
            explain=''
            for j in divs:
                j=re.sub('\\s','',j)
                explain+=j
            print(i)
            fp.write(i+"\t"+explain+"\n")
    chro.quit()
def word_cloud():
    word_list=[]
    with open("data/words_ans.txt","r",encoding="utf-8") as fp:
        line=fp.readline()
        while line:
            word_line=line.strip().split("\t")
            word_list.append(word_line[0])
            line=fp.readline()
    text="/".join(word_list)
    maskph = np.array(Image.open('data/bg.jfif'))
    wordcloud:WordCloud = WordCloud(mask=maskph, background_color='white', font_path='data/SimHei.ttf',
                          margin=2).generate(text)

    plt.imshow(wordcloud, interpolation="bilinear")
    plt.axis("off")
    plt.show()
    wordcloud.to_file("data/词云.jpg")




if __name__ == '__main__':
    if os.path.exists("data/words.txt"): os.remove("data/words.txt")
    if os.path.exists("data/words_res.txt"): os.remove("data/words_res.txt")
    if os.path.exists("data/words_ans.txt"): os.remove("data/words_ans.txt")
    if os.path.exists("data/words_explain.txt"): os.remove("data/words_explain.txt")
    get_words()
    words_filter()
    words_explain()
    word_cloud()

标签：fp,word,信息化,热词,python,words,import,txt,data
From： https://www.cnblogs.com/liyiyang/p/17665866.html

python中实现兔子问题递推
兔子一代生3对，然后每隔一代兔子才有繁殖能力，问最初有1对兔子，问5代后一共有多少只兔子？001、直接实现>>>list1=[1]*5>>>list1[1,1,1,1,1]>>>foriinrange(2,5):...list1[i]=list1[i-1]+list1[i-2]*3...>>>list1##1到5代......
python多线程
python多线程多线程threading，利用CPU和IO可以同时执行的原理多进程multiprocessing，利用多核CPU的能力，真正的并行执行任务异步IOasyncio，在单线程利用CPU和IO同时执行的原理，实现函数异步执行使用Lock对资源加锁，防止冲突访问使用Queue实现不同线程/进程之间的数据通信，实现生......
Centos误删系统自带python2.7，yum报错恢复方法
使用wget分别下载python以及yum的rpm包资源地址如下： http://vault.centos.orgmkdir/usr/local/src/pythoncd/usr/local/src/pythonwgethttp://vault.centos.org/7.6.1810/os/x86_64/Packages/python-backports-1.0-8.el7.x86_64.rpmwgethtt......
python3.6使用wordcloud 1.9报错ValueError: Only supported for TrueType fonts
该版本的wordcloud的源码中显示不兼容python3.6 解决办法：修改wordcloud源码修改前508：box_size=draw.textbox((0,0),word,font=transposed_font,anchor="lt")510：result=occupancy.sample_position(box_size[3]+self.margin,......
python字符串内容分割
分隔主要涉及到split,rsplit,splitlines,partition和rpartition五个方法。split以指定字符串为分隔符切片，如果maxsplit有指定值，则仅分隔maxsplit+1个子字符串。返回的结果是一个列表。没有指定分隔符，默认使用空格，换行等空白字符进行分隔char="hello\nworld"result=c......
【Python】报错处理笔记
shutil.rmtree(path)报错：PermissionError:[WinError5]分析：对应的目录或文件被设置了只读属性解决方案：defremove_readonly(func,path,_):#错误回调函数，改变只读属性位，重新删除"Clearthereadonlybitandreattempttheremoval"os.chmod(path,stat.S_I......
重启python-数据类型-字典和集合
一，字典和集合初始字典：d1={'name':'jason','age':20,'gender':'male'}集合：s1={1，2，3，4，5}二，二者的区别唯一的区别，就是集合没有键和值的配对，是一系列无序的、唯一的元素组合。三，内置操作字典：增删改查集合：增删改查注意：集合的pop()操作是删除集合中最后一个元素，可是......
Python 中一些常用的
对变量类型转换的内置函数int()：将一个数值或字符串转换成整数，可以指定进制。float()：将一个字符串转换成浮点数。str()：将指定的对象转换成字符串形式，可以指定编码。chr()：将整数转换成该编码对应的字符串（一个字符）。ord()：将字符串（一个字符）转换成对应的编码（整数）。这个经常用。......
Python+Flask设置接口开机自启动
Windows系统适用创建一个批处理文件（例如 start_flask_api.bat），内容如下：@echooffcd/dC:\path\to\your\flask\app//你要启动程序的路径pythonapp.py//你要启动的程序将批处理文件添加到Windows的启动项中：按下Win+R键打开"运行"对话框，输入 shell:startup 并按回车......
Windows环境 python手动安装三方库详解
当运行pip安装三方库时，无法正常安装，可以自己手动安装一下详解：1、首先找到需要下载的三方库的安装包---三方库下载地址：https://pypi.org/project 打开网址搜索需要下载的三方库的安装包，我是以“locust”为例找到需要下载的三方库，点击进去找到对应的版本下载对应的****......

信息化热词分析python

项目代码

相关文章

赞助商

阅读排行