arxiv
文章目录
一、关于 arxiv.py
arxiv.py 是 arXiv API 的 Python wrapper。
arXiv API : https://info.arxiv.org/help/api/index.html
arXiv是康奈尔大学图书馆的一个项目,提供物理、数学、计算机科学、定量生物学、定量金融和统计学领域 100 多万篇文章的开放访问。
安装
pip install arxiv
Python 引入
import arxiv
二、使用示例
1、获取结果
import arxiv
# Construct the default API client.
client = arxiv.Client()
# Search for the 10 most recent articles matching the keyword "quantum."
search = arxiv.Search(
query = "quantum",
max_results = 10,
sort_by = arxiv.SortCriterion.SubmittedDate
)
results = client.results(search)
# `results` is a generator; you can iterate over its elements one by one...
for r in client.results(search):
print(r.title)
# ...or exhaust it into a list. Careful: this is slow for large results sets.
all_results = list(results)
print([r.title for r in all_results])
# For advanced query syntax documentation, see the arXiv API User Manual:
# https://arxiv.org/help/api/user-manual#query_details
search = arxiv.Search(query = "au:del_maestro AND ti:checkerboard")
first_result = next(client.results(search))
print(first_result)
# Search for the paper with ID "1605.08386v1"
search_by_id = arxiv.Search(id_list=["1605.08386v1"])
# Reuse client to fetch the paper, then print its title.
first_result = next(client.results(search))
print(first_result.title)
2、下载 papers
要下载 ID 为 "1605.08386v1 "的论文 PDF,请运行 Search
,然后使用 Result.download_pdf()
:
import arxiv
paper = next(arxiv.Client().results(arxiv.Search(id_list=["1605.08386v1"])))
# Download the PDF to the PWD with a default filename.
paper.download_pdf()
# Download the PDF to the PWD with a custom filename.
paper.download_pdf(filename="downloaded-paper.pdf")
# Download the PDF to a specified directory with a custom filename.
paper.download_pdf(dirpath="./mydir", filename="downloaded-paper.pdf")
同样的界面也可用于下载论文源的 .tar.gz
文件:
import arxiv
paper = next(arxiv.Client().results(arxiv.Search(id_list=["1605.08386v1"])))
# Download the archive to the PWD with a default filename.
paper.download_source()
# Download the archive to the PWD with a custom filename.
paper.download_source(filename="downloaded-paper.tar.gz")
# Download the archive to a specified directory with a custom filename.
paper.download_source(dirpath="./mydir", filename="downloaded-paper.tar.gz")
3、自定义 client 获取结果
import arxiv
big_slow_client = arxiv.Client(
page_size = 1000,
delay_seconds = 10.0,
num_retries = 5
)
# Prints 1000 titles before needing to make another request.
for result in big_slow_client.results(arxiv.Search(query="quantum")):
print(result.title)
4、日志
要检查此软件包的网络行为和 API 逻辑,请配置一个 DEBUG
级日志记录器。
>>> import logging, arxiv
>>> logging.basicConfig(level=logging.DEBUG)
>>> client = arxiv.Client()
>>> paper = next(client.results(arxiv.Search(id_list=["1605.08386v1"])))
INFO:arxiv.arxiv:Requesting 100 results at offset 0
INFO:arxiv.arxiv:Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=&id_list=1605.08386v1&sortBy=relevance&sortOrder=descending&start=0&max_results=100
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): export.arxiv.org:443
DEBUG:urllib3.connectionpool:https://export.arxiv.org:443 "GET /api/query?search_query=&id_list=1605.08386v1&sortBy=relevance&sortOrder=descending&start=0&max_results=100&user-agent=arxiv.py%2F1.4.8 HTTP/1.1" 200 979
三、类型说明
1、Client
Client
指定了从 arXiv 应用程序接口获取结果的可重用策略。
对于大多数用例,默认client就足够了。
客户端配置指定了分页和重试逻辑。
Reusing客户端允许连续的 API 调用使用相同的连接池,并确保它们遵守您设置的速率限制。
2、Search
Search
指定了对 arXiv 数据库的搜索。
使用Client.results
可以获得Results
的生成器。
3、Result
Client.results
生成的 Result
对象包括每篇论文的元数据和下载论文内容的帮助方法。
底层原始数据的含义记录在 arXiv API User Manual: Details of Atom Results Returned。
Result
还公开了下载论文的辅助方法: Result.download_pdf
和 Result.download_source
.
2024-03-28(四)
标签:Search,Python,arxiv,results,client,paper,filename From: https://blog.csdn.net/lovechris00/article/details/137098632