标签：info openpyxl re 模块 home requests data findall

内容概要

第三方模块的下载及使用
网络爬虫及requests模块
网络爬虫实战爬取二手房信息
自动化办公领域模块openpyxl
练习题及答案

第三方模块的下载

第三方模块就类似与别人写好的模块，我们可以直接拿来使用
但是这种模块我们一般需要进行下载才能在我们的python里面使用(下载好后就等同于内置模块)
下载第三方模块的方式
	1.pip工具
    我们如果电脑中有很多版本的python解释器，那么我们一定要注意pip工具是哪一个版本解释器下的
    不同的版本目录有不同的pip工具，否则容易出现使用A解释器，但是模块下载到了B解释器中
我们也可以为了避免pip冲突，在使用的时候添加对应的版本型号
python27   pip2.7
python38   pip3.8
python36   pip3.6
下载第三方模块的句式
	pip install 模块名
下载第三方模块临时切换仓库
	pip install 模块名-i 仓库地址
下载第三方模块指定版本(不指定默认是最新版)
	pip install 模块名==版本号 -i 仓库地址
    
pip 下载第三方模块如果没有指定仓库地址，默认是在Python.org 国外开始的
所以我们如果遇到下载速度很慢的情况下我们就可以使用国内的镜像下载网站进行下载操作

pycharm 提供的快捷下载方式

1.
我们在pychram 中 import 导入一个我们没有下载的第三方模块
在这个语句的上面有红小灯泡的按钮，点击找到里面的install选项
即可安装第三方模块进行使用了

'''下载第三方模块可能会遇到的问题
1. 报错并且有警告信息：
	WARNING: You are using pip version 20.2.1
	可能是由于你的pip工具版本较低，需要更新
	d:\python38\python.exe -m pip install --upgrade pip
	更新完成后再次执行下载第三方模块的命令即可
2.报错并且含有time out 关键字
	说明当前网络不稳定，只需要重新尝试几次或者切换稳定网络即可解决
3.报错没有关键字
	百度解决！
	有一些模块时需要提前准备一些环境才可以顺利下载
4.下载速度很慢：
		pip默认下载的仓库地址是国外的 python.org
		我们可以切换下载的地址
		pip install 模块名 -i 仓库地址
		pip的仓库地址有很多 百度查询即可
		清华大学 ：https://pypi.tuna.tsinghua.edu.cn/simple/
		阿里云：http://mirrors.aliyun.com/pypi/simple/
		中国科学技术大学 ：http://pypi.mirrors.ustc.edu.cn/simple/
		华中科技大学：http://pypi.hustunique.com/
		豆瓣源：http://pypi.douban.com/simple/
		腾讯源：http://mirrors.cloud.tencent.com/pypi/simple
		华为镜像源：https://repo.huaweicloud.com/repository/pypi/simple/
'''

我们可以看到在模块的后面有提示下载的仓库地址，我们根据仓库地址选择我们想要下载的模块即可解决下载速度慢的问题

网络爬虫模块之requests模块

requests模块可以模拟浏览器发送网络请求
我们使用Pip工具或者是pycharm工具下载好这个模块后
import 导入它就可以正常使用这个模块的功能了

import requests


get() get方法是朝指定网址发送请求并获取网址页面数据(相当于在浏览器地址栏中输入网址并回车访问)

res = requests.get('www.baidu.com')
print(res.content)  # 获取Bytes 类型的网页数据（二进制）
res.enconding = 'utf8' # 指定编码
print(res.text)  # 获取字符串类型的网页数据（默认编码为utf8）

网络爬虫实战之爬取二手房数据信息

import requests
import re
res = requests.get(r"https://sh.lianjia.com/ershoufang/pudong/")
home_info = res.text
home_title_list = re.findall(' data-is_focus="" data-sl="">(.*?)</a>',home_info)
# print(home_title_list)  # 二手房标题
home_addr_name = re.findall('<a href=".*?" target=".*?" data-log_index=".*?" data-el="region">(.*?) </a>',home_info)
# print(home_addr)  # xx地址
home_addr_street = re.findall('<div class="positionInfo"><span class="positionIcon"></span><a href=".*?" target="_.*?" data-log_index=".*?" data-el="region">.*?</a>   -  <a href=".*?" target="_blank">(.*?)</a> </div>',home_info)
# print(home_addr_street)  # xx地址-名称
home_data = re.findall('<div class="houseInfo"><span class="houseIcon"></span>(.*?)</div>',home_info)
# print(home_data)
home_watch = re.findall('<div class="followInfo"><span class="starIcon"></span>(.*?)</div>',home_info)
# print(home_watch)
home_money = re.findall('<div class="totalPrice totalPrice2"><i> </i><span class="">(.*?)</span><i>.*?</i></div>',home_info)
home_limit_money = re.findall('<div class="unitPrice" data-hid=".*?" data-rid=".*?" data-price=".*?"><span>(.*?)</span></div>',home_info)


all_home_data = zip(home_title_list,home_addr_name,home_addr_street,home_data,home_watch,home_money,home_limit_money)
with open('all_home_info.txt,','w',encoding='utf8') as f:
    for i in all_home_data:
        f.write("""
                房屋标题:%s
                小区名称:%s
                街道名称:%s
                详细信息:%s
                关注程度:%s
                房屋总价:%s万
                房屋平米单价:%s\n
                """ % i)

自动化办公领域openpyxl模块

1.excel文件后缀名问题
03 版本之前 文件名后缀为.xls
03 版本之后 文件名后缀为.xlsx
2.操作excel表格的第三方模块
	xlwt往表格中写入数据、wlrd从表格中读取数据
    	兼容所有版本的excel文件
    openpyxl 是最近几年比较火的操作excel表格的模块
    	03版本之前的兼容性比较差
 # 也有许多别的模块可以操作excel并且功能更加强大>>>>>>pandas

3.openpyxl 模块操作
官方文档：https://openpyxl.readthedocs.io/en/stable/tutorial.html
    上面有openpyxl的使用说明
from openpyxl import Workbook
# 创建excel文件
变量名 = Workbook()
# 保存excel文件
变量名.save('文件名.xlsx')
# 在一个excel文件中创建多个工作簿（sheet）
home_info_excel = Workbook()
home_sheet1 = home_info_excel.create_sheet('就是牛')
home_sheet2 = home_info_excel.create_sheet('还是牛')
home_sheet3 = home_info_excel.create_sheet('牛上天了')
home_info_excel.save(r'text.xlsx')

我们可以看到excel文件本身就会有一个sheet工作簿
所以我们新增的工作簿都在他的后面
我们也可以通过在creat_sheet('工作簿名称',0) 加参数来控制我们新建工作簿的位置
后面的数字为索引位置

我们也可以二次修改工作簿名称：
原有工作簿.title = '想要修改的工作簿名字'
# 切记，需要先关闭excel在操作，不然会报错
工作簿.sheet_properties.tabColor = "1072BA"  # 修改工作簿的颜色


填写表格文件中数据的方式：
1.
工作簿.append([1,2,3,4]) # 表头字段  每行1,2,3,4
工作簿.append([1,2,3,4])  
工作簿.append([1,2,3,4])
工作簿.append([1,2,3,4])
工作簿.append([1,2,3,4])
工作簿.append([1,2,3,4])
2.
工作簿['坐标'] = 212  # 在当前工作簿指定坐标上写212数据
3.
工作簿.cell(row=1, column=1, value='212')  # row 为行 column 为列 也相当于坐标
# 相当于在第一行第一列写数据212
填写数学公式：
工作簿['坐标'] = '=sum(坐标：坐标)'  

# openpyxl主要用于数据的写入 至于后续的表单操作它并不是很擅长 如果想做需要更高级的模块pandas
pandas模块
import requests
import re
import pandas
from openpyxl import Workbook

res = requests.get(r"https://sh.lianjia.com/ershoufang/pudong/")
home_info = res.text
home_title_list = re.findall(' data-is_focus="" data-sl="">(.*?)</a>', home_info)
# print(home_title_list)  # 二手房标题
home_addr_name = re.findall('<a href=".*?" target=".*?" data-log_index=".*?" data-el="region">(.*?) </a>', home_info)
# print(home_addr)  # xx地址
home_addr_street = re.findall(
    '<div class="positionInfo"><span class="positionIcon"></span><a href=".*?" target="_.*?" data-log_index=".*?" data-el="region">.*?</a>   -  <a href=".*?" target="_blank">(.*?)</a> </div>',
    home_info)
# print(home_addr_street)  # xx地址-名称
home_data = re.findall('<div class="houseInfo"><span class="houseIcon"></span>(.*?)</div>', home_info)
# print(home_data)
home_watch = re.findall('<div class="followInfo"><span class="starIcon"></span>(.*?)</div>', home_info)
# print(home_watch)
home_money = re.findall('<div class="totalPrice totalPrice2"><i> </i><span class="">(.*?)</span><i>.*?</i></div>',
                        home_info)
home_limit_money = re.findall(
    '<div class="unitPrice" data-hid=".*?" data-rid=".*?" data-price=".*?"><span>(.*?)</span></div>', home_info)

info_dict = {'房屋简介': home_title_list,
             '小区地址': home_addr_name,
             '所在街道': home_addr_street,
             '房屋详细信息': home_data,
             '关注度': home_watch,
             '房屋总价': home_money,
             '房屋平米单价': home_limit_money
             }
df_info = pandas.DataFrame(info_dict)
df_info.to_excel(r'home_info.xlsx')
excel软件正常可以打开操作的数据集在10万左右 一旦数据集过大 软件操作几乎无效

练习题及答案

爬取二手房指定页数的数据
import requests
import re

from openpyxl import Workbook
def get_pudong_home_info(x):
    home_data_file = Workbook()
    home_data_file_sheet = home_data_file.create_sheet(rf'{x}页房屋信息',0)
    home_data_file_sheet.append(['房屋信息','小区地址','所在街道','详细信息','受关注程度','房屋总价','房屋平米单价'])
    for num in range(1,(x+1)):
        res = requests.get(rf"https://sh.lianjia.com/ershoufang/pudong/pg{num}/")
        home_info = res.text
        home_title_list = re.findall(' data-is_focus="" data-sl="">(.*?)</a>', home_info)
        home_addr_name = re.findall('<a href=".*?" target=".*?" data-log_index=".*?" data-el="region">(.*?) </a>',home_info)
        home_addr_street = re.findall(
            '<div class="positionInfo"><span class="positionIcon"></span><a href=".*?" target="_.*?" data-log_index=".*?" data-el="region">.*?</a>   -  <a href=".*?" target="_blank">(.*?)</a> </div>',
            home_info)
        home_data = re.findall('<div class="houseInfo"><span class="houseIcon"></span>(.*?)</div>', home_info)
        home_watch = re.findall('<div class="followInfo"><span class="starIcon"></span>(.*?)</div>', home_info)
        home_money = re.findall('<div class="totalPrice totalPrice2"><i> </i><span class="">(.*?)</span><i>.*?</i></div>',home_info)
        home_limit_money = re.findall('<div class="unitPrice" data-hid=".*?" data-rid=".*?" data-price=".*?"><span>(.*?)</span></div>', home_info)
        a = zip(home_title_list, home_addr_name, home_addr_street, home_data, home_watch, home_money, home_limit_money)
        for i in a:
            home_data_file_sheet.append(i)
    home_data_file.save('home_info_openpy.xlsx')
    print(f'已生成{x}页的二手房信息~')

get_pudong_home_info(10)

标签：info,openpyxl,re,模块,home,requests,data,findall
From： https://www.cnblogs.com/ddsuifeng/p/16829669.html

requests模块/openpyxl模块/简单爬虫实战

内容概要

第三方模块的下载

pycharm 提供的快捷下载方式

网络爬虫模块之requests模块

网络爬虫实战之爬取二手房数据信息

自动化办公领域openpyxl模块

练习题及答案

相关文章

赞助商

阅读排行