python爬虫入门笔记(随便记记,持续更新)

标签：记记 word get python self 爬虫 url print resp

`准备：安装库`

pip3 install beautifulsoup4

apt-get install python-lxml

pip3 install parse

一、获得从baidu.com中能够跳转到的页面(的链接)

import requests
from bs4 import BeautifulSoup

resp=requests.get('https://www.baidu.com') #请求百度首页
print(resp) #打印请求结果的状态码
print(resp.content) #打印请求到的网页源码

bsobj=BeautifulSoup(resp.content,'lxml') #将网页源码构造成BeautifulSoup对象，方便操作
a_list=bsobj.find_all('a') #获取网页中的所有a标签对象
for a in a_list: #遍历
    print(a.get('href')) #打印a标签对象的href属性，即这个对象指向的链接地址

运行结果

数据保存到txt

import requests
from bs4 import BeautifulSoup

resp=requests.get('https://www.baidu.com') #请求百度首页
print(resp) #打印请求结果的状态码
print(resp.content) #打印请求到的网页源码

bsobj=BeautifulSoup(resp.content,'lxml') #将网页源码构造成BeautifulSoup对象，方便操作
a_list=bsobj.find_all('a') #获取网页中的所有a标签对象
text='' # 创建一个空字符串
for a in a_list:
    href=a.get('href') #获取a标签对象的href属性，即这个对象指向的链接地址
    text+=href+'\n' #加入到字符串中，并换行
with open('url.txt','w') as f: #在当前路径下，以写的方式打开一个名为'url.txt'，如果不存在则创建
    f.write(text) #将text里的数据写入到文本中

运行结果

生成了url.txt

二、抓取网络照片

# -*- coding:utf8 -*-
import requests
import re
from urllib import parse
import os

class BaiduImageSpider(object):
    def __init__(self):
        self.url = 'https://image.baidu.com/search/flip?tn=baiduimage&word={}'
        self.headers = {'User-Agent':'Mozilla/4.0'}

    # 获取图片
    def get_image(self,url,word):
        #使用 requests模块得到响应对象
        res= requests.get(url,headers=self.headers)
        # 更改编码格式
        res.encoding="utf-8"
        # 得到html网页
        html=res.text
        print(html)
        #正则解析
        pattern = re.compile('"hoverURL":"(.*?)"',re.S)
        img_link_list = pattern.findall(html)
        #存储图片的url链接 
        print(img_link_list)

        # 创建目录，用于保存图片
        directory = 'C:/Users/Administrator/Desktop/image/{}/'.format(word)
        # 如果目录不存在则创建，此方法常用
        if not os.path.exists(directory):
            os.makedirs(directory)
        
        #添加计数 
        i = 1
        for img_link in img_link_list:
            filename = '{}{}_{}.jpg'.format(directory, word, i)
            self.save_image(img_link,filename)
            i += 1
    #下载图片
    def save_image(self,img_link,filename):
        html = requests.get(url=img_link,headers=self.headers).content
        with open(filename,'wb') as f:
            f.write(html)
        print(filename,'下载成功')

    # 入口函数 
    def run(self):
        word = input("您想要谁的照片？")
        word_parse = parse.quote(word)
        url = self.url.format(word_parse)
        self.get_image(url,word)

if __name__ == '__main__':
    spider = BaiduImageSpider()
    spider.run()

运行结果























参考：https://blog.csdn.net/aaronjny/article/details/77945329?spm=1001.2101.3001.6650.2&utm_medium=distribute.pc_relevant.none-task-blog-2%7Edefault%7ECTRLIST%7ERate-2-77945329-blog-123905684.pc_relevant_multi_platform_whitelistv3&depth_1-utm_source=distribute.pc_relevant.none-task-blog-2%7Edefault%7ECTRLIST%7ERate-2-77945329-blog-123905684.pc_relevant_multi_platform_whitelistv3&utm_relevant_index=5
     http://c.biancheng.net/python_spider/crawl-photo.html


今天晚上先随便看看，明天继续。

标签：记记,word,get,python,self,爬虫,url,print,resp
From： https://www.cnblogs.com/qwertyyuiop/p/16705789.html

python爬虫入门笔记(随便记记,持续更新)

`准备：安装库`

二、抓取网络照片

相关文章

赞助商

阅读排行