做可视化比较麻烦我就没做,用文件处理的,这里需要两个文件
1、readUrl.txt文件保存需要解析的字符串
2、newUrl.txt文件保存解析完成的字符串
目录
readUrl.txt文件示例
编码示例:
推荐获取网页URL的正则
解析结果newUrl.txt
编码示例:
import requests
import re
file = open("readUrl.txt", "r", encoding="utf-8")
strListArr = file.readlines()
strList = "".join(strListArr)
file.close()
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
}
rep="http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"
listUrl = re.findall(rep, strList)
list_not_dup = list()
for i in listUrl:
if i not in list_not_dup:
list_not_dup.append(i)
for item in list_not_dup:
print(item)
strUrl = ""
for item in list_not_dup:
html = requests.get(item, headers).url
result = html.split("?")
strUrl += result[0] + "\n"
file = open("newUrl.txt", "w", encoding="utf-8")
file.write(strUrl)
file.close()
推荐获取网页URL的正则
标签:示例,python,list,网址,fA,file,dup,txt From: https://blog.51cto.com/laoshifu/5950775"http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"