重新认识正则

标签：重新认识匹配 re 正则 base result http findall

参考资料1：https://tool.oschina.net/uploads/apidocs/jquery/regexp.html

必备知识

正则匹配

正则匹配是一个模糊的匹配，不是精确的匹配

原子表 [ ]

[a] 匹配字母a
[ab] 匹配字母a或b
[abc] 匹配a或b或c

[a-z] 匹配任意一位小写字母
[A-Z] 匹配任意一位大写字母
[0-9] 匹配任意一位数字
[a-zA-Z] 匹配任意一位字母
[a-zA-Z0-9] 匹配任意一位字母或数字

[b][c]  匹配bc
[a][bc] 匹配ab或者ac  等同于(ab) | (ac)

重复次数

{m} 代表匹配前面表达式匹配m次
[a-z][a-z][a-z][a-z][a-z]  匹配5个小写字母
[a-z]{5} 匹配5个小写字母

{m,n} 匹配前面表达式的m到n次
[a-z]{2-4} 匹配2-4位的小写字母

{m,} 代表匹配前面表达式的至少m次
[a-z] {4,} 大于等于4个小写字母

可有可无 ?

代表前面表达式的次数可有可无
[a-z]?  
相当于
[a-z]{0,1}

-?[1-9]  普配一位1-9的正负整数

匹配除了换行符以外的任意字符 .

除了 \r\n 其它的都可以匹配
一般不会单独使用，而是结合其它正则表达式一起使用

匹配前面表达式的0次到多次 *

等同于 {0,}

重要的组合 .*?

拒绝贪婪：匹配到的结果可以是多个

匹配除换行符以外的任意字符任意次  拒绝贪婪

重要的组合 .*

贪婪模式：匹配到的结果基本是一个

匹配除换行符以外的任意字符任意次  贪婪模式

匹配一次到多次 +

相当于{1,}

重要的组合 .+?

拒绝贪婪

匹配除换行符以外的任意字符至少一次到任意次  拒绝贪婪

重要的组合 .+

贪婪模式

匹配除换行符以外的任意字符至少一次到任意次  贪婪模式

或 |

举例：匹配手机号 或者 qq号码
1[3-9][0-9]{9} | [1-9][0-9]{4, 10}

子存储或者单元用于拿到结果 ( )

会把括号中的匹配到的结果进行单独返回
(1[3-9][0-9]{9}) | ([1-9][0-9]{4, 10})

开头结尾^ $

以什么开头  ^
以什么结尾 $ 

爬虫基本用不到

匹配任意一位数字 \d

\d  等同于 [0-9]

匹配任意一位非数字 \D

\D 等同于 [^0-9]

匹配任意一位数字字母或者下划线 \w

\w 等同于 [0-9a-zA-Z_]

匹配任意一位非数字字母或者下划线 \W

\W [^0-9a-zA-Z_]

匹配空白和非空白

\s  匹配任意空白字符 \r和\n
\S  匹配任意非空白字符

python re模块

search 匹配一次

# 匹配到结果就返回 不管后面的了
In [2]: import re

In [3]: text = 'a233abv2s'

In [4]: re.search('[a-z]', text)
Out[4]: <re.Match object; span=(0, 1), match='a'>

In [5]: re.search('[a-z]{2}', text)
Out[5]: <re.Match object; span=(4, 6), match='ab'>

In [6]: re.search('[a-z]{2,}', text)
Out[6]: <re.Match object; span=(4, 7), match='abv'>

match

# 必须第一位就匹配成功 否则失败 相当于 search('^')
In [15]: import re

In [16]: re.search('[\d]', '123abc')
Out[16]: <re.Match object; span=(0, 1), match='1'>

In [17]: re.match('[\d]', '123abc')
Out[17]: <re.Match object; span=(0, 1), match='1'>

In [18]: re.search('[\d]', 'abc123')
Out[18]: <re.Match object; span=(3, 4), match='1'>

In [19]: [re.match('[\d]', 'abc123')]
Out[19]: [None]

In [20]: [re.match('^\d', 'abc123')]
Out[20]: [None]

匹配所有 findall 返回列表

In [29]: import re

In [30]: re.findall('\d', '4567ghjkk')
Out[30]: ['4', '5', '6', '7']

In [31]: re.findall('\d', 'python')
Out[31]: []

In [32]: re.findall('\d{2}', '4567ghjkk')
Out[32]: ['45', '67']
    
In [39]: base_str = '<div>我是HTML标签</div><div>我是div标签</div><div></div>'

In [40]: re.findall('<div>.*?</div>', base_str)
Out[40]: ['<div>我是HTML标签</div>', '<div>我是div标签</div>', '<div></div>']

In [41]: re.findall('<div>.*</div>', base_str)
Out[41]: ['<div>我是HTML标签</div><div>我是div标签</div><div></div>']

In [42]: re.findall('<div>.+?</div>', base_str)
Out[42]: ['<div>我是HTML标签</div>', '<div>我是div标签</div>']

In [43]: re.findall('<div>.+</div>', base_str)
Out[43]: ['<div>我是HTML标签</div><div>我是div标签</div><div></div>']
    
In [44]: # 存储括号中的值

In [45]: re.findall('<div>(.*?)</div>', base_str)
Out[45]: ['我是HTML标签', '我是div标签', '']

In [46]: re.findall('<div>(.+?)</div>', base_str)
Out[46]: ['我是HTML标签', '我是div标签']

In [47]: # 多个括号的情况

In [48]: re.findall('(<div>(.+?)</div>)', base_str)
Out[48]: [('<div>我是HTML标签</div>', '我是HTML标签'), ('<div>我是div标签</div>', '我是div标签')]

匹配换行 re.S

一般写爬虫的时候都建议添加上这个参数，也就是第三个参数

# 案例1
import re


base_html = """
<a href="http://www.baidu.com">百度</a>
<a href="http://www.taobao.com">淘宝</a>
<a href="http://3.cn">京东</a>
"""

result = re.findall(r'<a href="(.*?)">(.*?)</a>', base_html)
print(result)  # [('http://www.baidu.com', '百度'), ('http://www.taobao.com', '淘宝'), ('http://3.cn', '京东')]

# 案例2
import re


base_html = """
<a href="http://www.baidu.com">百度</a>
<a href="http://www.taobao.com">淘
宝</a>
<a href="http://3.cn">京
东</a>
"""

result = re.findall(r'<a href="(.*?)">(.*?)</a>', base_html)
print(result)  # [('http://www.baidu.com', '百度')]

# 案例3  加上re.S 使其可以匹配换行符，不过后续要自己去处理换行符
import re


base_html = """
<a href="http://www.baidu.com">百度</a>
<a href="http://www.taobao.com">淘
宝</a>
<a href="http://3.cn">京
东</a>
"""

result = re.findall(r'<a href="(.*?)">(.*?)</a>', base_html, re.S)  # 注意这里
print(result)  # [('http://www.baidu.com', '百度'), ('http://www.taobao.com', '淘\n宝'), ('http://3.cn', '京\n东')]

匹配不区分大小写 re.I

# 方案1 未h使用re.I
import re


base_html = """
<A href="http://www.baidu.com">百度</A>
<A href="http://www.taobao.com">淘宝</a>
<a href="http://3.cn">京东</A>
"""

result = re.findall(r'<[aA] href="(.*?)">(.*?)</[aA]>', base_html)
print(result)  # [('http://www.baidu.com', '百度'), ('http://www.taobao.com', '淘宝'), ('http://3.cn', '京东')]

# 方案2 使用re.I
import re


base_html = """
<A href="http://www.baidu.com">百度</A>
<A href="http://www.taobao.com">淘宝</a>
<a href="http://3.cn">京东</A>
"""

result = re.findall(r'<a href="(.*?)">(.*?)</a>', base_html, re.I)  # 加上re.I 让匹配不区分大小写 也就算默认不加是区分大小写的
print(result)  # [('http://www.baidu.com', '百度'), ('http://www.taobao.com', '淘宝'), ('http://3.cn', '京东')]

换行加匹配大小写 re.S | re.I

芜湖 ~~

import re


base_html = """
<A href="http://www.baidu.com">百
度</A>
<A href="http://www.taobao.com">淘
宝</a>
<a href="http://3.cn">京
东</A>
"""

result = re.findall(r'<a href="(.*?)">(.*?)</a>', base_html, re.S|re.I) # 用管道符隔开 
print(result)  # [('http://www.baidu.com', '百\n度'), ('http://www.taobao.com', '淘\n宝'), ('http://3.cn', '京\n东')]

finditer 和findall一样区别是finditer返回迭代器

import re


base_html = """
<A href="http://www.baidu.com">百度</A>
<A href="http://www.taobao.com">淘宝</a>
<a href="http://3.cn">京东</A>
"""

result = re.finditer(r'<a href="(.*?)">(.*?)</a>', base_html, re.S|re.I)
print(result)  # <callable_iterator object at 0x0000016D16F90AC0>

import re


base_html = """
<A href="http://www.baidu.com">百度</A>
<A href="http://www.taobao.com">淘宝</a>
<a href="http://3.cn">京东</A>
"""

result = re.finditer(r'<a href="(.*?)">(.*?)</a>', base_html, re.S|re.I)
print(result)  # <callable_iterator object at 0x0000016D16F90AC0>

for line in result:
    print(line, type(line))

# <re.Match object; span=(1, 38), match='<A href="http://www.baidu.com">百度</A>'> <class 're.Match'>
# <re.Match object; span=(39, 77), match='<A href="http://www.taobao.com">淘宝</a>'> <class 're.Match'>
# <re.Match object; span=(78, 106), match='<a href="http://3.cn">京东</A>'> <class 're.Match'>

re.Match对象怎么取值

# group() 直接取值
# group(index) 0默认是原结果 如果有多个结果 取对应序号就行
# groups() 取全部结果  一般使用这个 然后根据索引取值进一步操作

import re


base_html = """
<A href="http://www.baidu.com">百度</A>
<A href="http://www.taobao.com">淘宝</a>
<a href="http://3.cn">京东</A>
"""

result = re.finditer(r'<a href="(.*?)">(.*?)</a>', base_html, re.S|re.I)

item = result.__next__()
# print(item.group()) # <A href="http://www.baidu.com">百度</A>
# print('group(index)', item.group(0)) # group(index) <A href="http://www.baidu.com">百度</A>
# print('group(index>0)', item.group(1))  # group(index>0) http://www.baidu.com
print("groups", item.groups())  # groups ('http://www.baidu.com', '百度')

给当前匹配的结果起一个名字 ?P

# 起名字
?P<name>

# 取名字
match对象.group(name) # 设置的别名

import re


base_html = """
<A href="http://www.baidu.com">百度</A>
<A href="http://www.taobao.com">淘宝</a>
<a href="http://3.cn">京东</A>
"""

result = re.finditer(r'<a href="(?P<link>.*?)">(?P<title>.*?)</a>', base_html, re.S|re.I) 

for line in result:
    title = line.group('title')
    link = line.group('link')
    print(f'{title} 的官网是 {link}')


# 百度 的官网是 http://www.baidu.com
# 淘宝 的官网是 http://www.taobao.com
# 京东 的官网是 http://3.cn


print(result.group('title'))
print(result.group('link'))

# 百度
# http://www.baidu.com

result = re.search(r'<a href="(?P<link>.*?)">(?P<title>.*?)</a>', base_html, re.S|re.I) 

print(result.group('title'))
print(result.group('link'))

# 百度
# http://www.baidu.com

正则中的split 结果列表

得到的结果可能会有空字符，需要自己二次处理

In [1]: import re

In [2]: re.split('\s', 'abc\r13\nasb')
Out[2]: ['abc', '13', 'asb']

In [3]: re.split('\d', 'abc13asb')
Out[3]: ['abc', '', 'asb']

In [4]: re.split('[a-z]', 'abc13asb')
Out[4]: ['', '', '', '13', '', '', '']
    
In [5]: re.split('\S', 'abc\r13\nasb')
Out[5]: ['', '', '', '\r', '', '\n', '', '', '']
    
# 如果单独匹配换行符和空白 推荐使用findall
In [8]: re.findall('\s', 'abc\rmsd\n')
Out[8]: ['\r', '\n']
    
In [9]: # 第二个参数，可以控制拆分的次数

In [10]: re.split('\d', 'abc13asb34bn', 2)
Out[10]: ['abc', '', 'asb34bn']

编译 compile

使用场景，有相同的正则规则的字符需要处理，注意是多个，如果每一个需要匹配的正则规则都不一样，那么用不用编译都一样。

# 演示1
In [16]: import re

    # 括号内是一个正则表达式
In [17]: pattern = re.compile('\d')

In [18]: pattern.findall('3j43k4j3k4j14j3k4jk43')
Out[18]: ['3', '4', '3', '4', '3', '4', '1', '4', '3', '4', '4', '3']
    
In [19]: type(pattern)
Out[19]: re.Pattern
    
# 之前学过的正则表达式函数都可以使用pattern
pattern.search()
pattern.match()
pattern.findall()
pattern.finditer()

# 演示2  也是可以指定参数的
import re


base_html = """
<A href="http://www.baidu.com">百度</A>
<A href="http://www.taobao.com">淘宝</a>
<a href="http://3.cn">京东</A>
"""

pattern = re.compile(r'<a href="(?P<link>.*?)">(?P<title>.*?)</a>', re.S|re.I)
result = pattern.finditer(base_html)

for line in result:
    title = line.group('title')
    link = line.group('link')
    print(f'{title} 的官网是 {link}')

# 百度 的官网是 http://www.baidu.com
# 淘宝 的官网是 http://www.taobao.com
# 京东 的官网是 http://3.cn


# 百度
# http://www.baidu.com


# 百度 的官网是 http://www.baidu.com
# 淘宝 的官网是 http://www.taobao.com
# 京东 的官网是 http://3.cn

正则实战

爬取的建议是：

优先使用xpath或者bs4之类的去匹配

如果上面的工具不太好匹配，就可以使用正则，js的代码页可以使用正则去匹配

import re
import base64
from pathlib import Path
import requests
from concurrent.futures import ThreadPoolExecutor
from fake_useragent import UserAgent
from openpyxl import Workbook


# 请求头的定制
def Headers():
    headers = {
        'UerAgent': UserAgent().random
    }
    return headers


# 发起请求
def get_request(url):
    response = requests.get(url=url, headers=Headers())
    response.encoding = 'utf-8'
    if response.ok:
        content = response.text
        return content


result_list = []
BASE_DIR = Path(__file__).parent
result_path = (BASE_DIR / '证券基金结果')
result_path.mkdir(parents=True, exist_ok=True)
wb = Workbook()
ws = wb.active
ws.title = '全部结果'
ws.append(['产品名称', '管理人', '风险评级', '认购金额起点', '公示信息'])


def encryption():
    base_url = b'aHR0cDovL3d3dy5jcy5lY2l0aWMuY29tL25ld3NpdGUvY3B6eC9qcmNweHhncy96Z2NwL2luZGV4'
    return base64.b64decode(base_url).decode()

# 清洗数据
def fetch_data(s):
    content = s.result()

    product_pattern = re.compile(r'<span class="th1" value=".*?">(.*?)</span>', re.S)
    manager_pattern = re.compile(r'<span class="th2" value=".*?">(.*?)</span>', re.S)
    wind_pattern = re.compile(r'<span class="th3" value=".*?">(.*?)</span>', re.S)
    begin_pattern = re.compile(r'<span class="th4" value=".*?">(.*?)</span>', re.S)
    show_pattern = re.compile(r'<span class="th5" value=".*?">(.*?)</span>', re.S)

    product_result = product_pattern.findall(content)
    manager_result = manager_pattern.findall(content)
    wind_result = wind_pattern.findall(content)
    begin_result = begin_pattern.findall(content)
    show_result = show_pattern.findall(content)

    for product, manager, wind, begin, show in zip(product_result, manager_result, wind_result, begin_result,
                                                   show_result):
        ws.append([product, manager, wind, begin, show])


# 主函数
def main():
    pool = ThreadPoolExecutor(10)

    for page in range(1, 104):
        if page == 1:
            url = encryption()
        else:
            url = f'{encryption()}_{page - 1}.html'
        pool.submit(get_request, url).add_done_callback(fetch_data)

    pool.shutdown()
    wb.save(rf'{result_path}/结果.xlsx')


if __name__ == '__main__':
    main()

标签：重新认识,匹配,re,正则,base,result,http,findall
From： https://www.cnblogs.com/ccsvip/p/18097643

必备知识

正则匹配

原子表 [ ]

重复次数

可有可无 ?

匹配除了换行符以外的任意字符 .

匹配前面表达式的0次到多次 *

重要的组合 .*?

重要的组合 .*

匹配一次到多次 +

重要的组合 .+?

重要的组合 .+

或 |

子存储或者单元用于拿到结果 ( )

开头结尾^ $

匹配任意一位数字 \d

匹配任意一位非数字 \D

匹配任意一位数字字母或者下划线 \w

匹配任意一位非数字字母或者下划线 \W

匹配空白和非空白

python re模块

search 匹配一次

match

匹配所有 findall 返回列表

匹配换行 re.S

匹配不区分大小写 re.I

换行加匹配大小写 re.S | re.I

finditer 和findall一样区别是finditer返回迭代器

re.Match对象怎么取值

给当前匹配的结果起一个名字 ?P

正则中的split 结果列表

编译 compile

正则实战

相关文章

赞助商

阅读排行

重新认识正则

必备知识

正则匹配

原子表 [ ]

重复次数

可有可无 ?

匹配除了换行符以外的任意字符 .

匹配前面表达式的0次到多次 *

重要的组合 .*?

重要的组合 .*

匹配一次到多次 +

重要的组合 .+?

重要的组合 .+

或 |

子存储 或者单元 用于拿到结果 ( )

开头结尾^ $

匹配任意一位数字 \d

匹配任意一位非数字 \D

匹配任意一位数字字母或者下划线 \w

匹配任意一位 非 数字字母或者下划线 \W

匹配空白和非空白

python re模块

search 匹配一次

match

匹配所有 findall 返回列表

匹配换行 re.S

匹配不区分大小写 re.I

换行 加 匹配大小写 re.S | re.I

finditer 和findall一样 区别是finditer返回迭代器

re.Match对象怎么取值

给当前匹配的结果起一个名字 ?P

正则中的split 结果列表

编译 compile

正则实战

相关文章

赞助商

阅读排行

子存储或者单元用于拿到结果 ( )

匹配任意一位非数字字母或者下划线 \W

换行加匹配大小写 re.S | re.I

finditer 和findall一样区别是finditer返回迭代器