能够从一大堆字符中规范提取字符串是python语言中的基本技能之一。尤其是在使用python爬取网页数据时,规范提取字符串技术直接决定爬取数据的成败和效率。这里给大家分享一个仅用三行代码提取网址数据的方法。
以下是数据源
"<div style='display:none'><a href='../../../n30888572/n31109385/n31125884/index.html'></a>\
<a href='../../../n30888572/n31109385/n31125884/index_31131703_2.html'></a><a href='../../../n30\
888572/n31109385/n31125884/index_31131703_3.html'></a><a href='../../../n30888572/n31109385/n3112\
5884/index_31131703_4.html'></a><a href='../../../n30888572/n31109385/n31125884/index_31131703_5.\
html'></a><a href='../../../n30888572/n31109385/n31125884/index_31131703_6.html'></a><a href='../\
../../n30888572/n31109385/n31125884/index_31131703_7.html'></a><a href='../../../n30888572/n31109\
385/n31125884/index_31131703_8.html'></a><a href='../../../n30888572/n31109385/n31125884/index_31\
131703_9.html'></a><a href='../../../n30888572/n31109385/n31125884/index_31131703_10.html'></a><a \
href='../../../n30888572/n31109385/n31125884/index_31131703_11.html'></a><a href='../../../n30888\
572/n31109385/n31125884/index_31131703_12.html'></a><a href='../../../n30888572/n31109385/n311258\
84/index_31131703_13.html'></a><a href='../../../n30888572/n31109385/n31125884/index_31131703_14.\
html'></a><a href='../../../n30888572/n31109385/n31125884/index_31131703_15.html'></a><a href='..\
/../../n30888572/n31109385/n31125884/index_31131703_16.html'></a><a href='../../../n30888572/n311\
09385/n31125884/index_31131703_17.html'></a><a href='../../../n30888572/n31109385/n31125884/index\
_31131703_18.html'></a><a href='../../../n30888572/n31109385/n31125884/index_31131703_19.html'></\
a><a href='../../../n30888572/n31109385/n31125884/index_31131703_20.html'></a><a href='../../../n\
30888572/n31109385/n31125884/index_31131703_21.html'></a><a href='../../../n30888572/n31109385/n3\
1125884/index_31131703_22.html'></a><a href='../../../n30888572/n31109385/n31125884/index_3113170\
3_23.html'></a><a href='../../../n30888572/n31109385/n31125884/index_31131703_24.html'></a><a hre\
f='../../../n30888572/n31109385/n31125884/index_31131703_25.html'></a><a href='../../../n30888572\
/n31109385/n31125884/index_31131703_26.html'></a><a href='../../../n30888572/n31109385/n31125884/\
index_31131703_27.html'></a><a href='../../../n30888572/n31109385/n31125884/index_31131703_28.htm\
l'></a><a href='../../../n30888572/n31109385/n31125884/index_31131703_29.html'></a><a href='../..\
/../n30888572/n31109385/n31125884/index_31131703_30.html'></a><a href='../../../n30888572/n311093\
85/n31125884/index_31131703_31.html'></a><a href='../../../n30888572/n31109385/n31125884/index_31\
131703_32.html'></a><a href='../../../n30888572/n31109385/n31125884/index_31131703_33.html'></a><\
a href='../../../n30888572/n31109385/n31125884/index_31131703_34.html'></a><a href='../../../n308\
88572/n31109385/n31125884/index_31131703_35.html'></a><a href='../../../n30888572/n31109385/n3112\
5884/index_31131703_36.html'></a><a href='../../../n30888572/n31109385/n31125884/index_31131703_3\
7.html'></a><a href='../../../n30888572/n31109385/n31125884/index_31131703_38.html'></a><a href='\
../../../n30888572/n31109385/n31125884/index_31131703_39.html'></a><a href='../../../n30888572/n3\
1109385/n31125884/index_31131703_40.html'></a><a href='../../../n30888572/n31109385/n31125884/ind\
ex_31131703_41.html'></a><a href='../../../n30888572/n31109385/n31125884/index_31131703_42.html'>\
</a><a href='../../../n30888572/n31109385/n31125884/index_31131703_43.html'></a><a href='../../..\
/n30888572/n31109385/n31125884/index_31131703_44.html'></a><a href='../../../n30888572/n31109385/\
n31125884/index_31131703_45.html'></a><a href='../../../n30888572/n31109385/n31125884/index_31131\
703_46.html'></a><a href='../../../n30888572/n31109385/n31125884/index_31131703_47.html'></a><a h\
ref='../../../n30888572/n31109385/n31125884/index_31131703_48.html'></a><a href='../../../n308885\
72/n31109385/n31125884/index_31131703_49.html'></a><a href='../../../n30888572/n31109385/n3112588\
4/index_31131703_50.html'></a><a href='../../../n30888572/n31109385/n31125884/index_31131703_51.h\
tml'></a><a href='../../../n30888572/n31109385/n31125884/index_31131703_52.html'></a><a href='../\
../../n30888572/n31109385/n31125884/index_31131703_53.html'></a><a href='../../../n30888572/n3110\
9385/n31125884/index_31131703_54.html'></a><a href='../../../n30888572/n31109385/n31125884/index_\
31131703_55.html'></a><a href='../../../n30888572/n31109385/n31125884/index_31131703_56.html'></a\
><a href='../../../n30888572/n31109385/n31125884/index_31131703_57.html'></a><a href='../../../n3\
0888572/n31109385/n31125884/index_31131703_58.html'></a><a href='../../../n30888572/n31109385/n31\
125884/index_31131703_59.html'></a><a href='../../../n30888572/n31109385/n31125884/index_31131703\
_60.html'></a><a href='../../../n30888572/n31109385/n31125884/index_31131703_61.html'></a><a href\
='../../../n30888572/n31109385/n31125884/index_31131703_62.html'></a><a href='../../../n30888572/\
n31109385/n31125884/index_31131703_63.html'></a><a href='../../../n30888572/n31109385/n31125884/i\
ndex_31131703_64.html'></a><a href='../../../n30888572/n31109385/n31125884/index_31131703_65.html\
'></a><a href='../../../n30888572/n31109385/n31125884/index_31131703_66.html'></a><a href='../../\
../n30888572/n31109385/n31125884/index_31131703_67.html'></a></div>"
我们需要从数据源中提取出全部类似以下格式的数据
<a href='../../../n30888572/n31109385/n31125884/index_31131703_52.html'></a>
规范提取字符串数据,确定分割点是关键,这里我们把<a作为分割点
ff = wd.split('<a')
遍历输出提取的字符串
for i in ff:
nf = '<a>'+i
print(nf)
运行代码,输出提取结果(为节省资源,输出结果省略了中间几十行),其中第1行和最后一行不符合提取内容,请自己动手做进一步处理,或者在评论区留言。