Python requests库的一些实用经验


Python的requests库应该是Python里最常用也是最好用的HTTP库。相较于之前的urllib、httplib等库,requests的封装更友好,上手更加容易。

show me the code

使用requests库完成一次http请求非常简单,代码如下:

import requests

url = "https://www.baidu.com"
resp = requests.get(url)

requests库在拥有高级封装的同时,也提供了一些有用的request hooks。它们可以帮助我们更好的处理请求中出现的问题。

request hooks

当我们在调用一个第三方的API时,通常需要检查API的返回是否有效,比如是否出现了4xx错误或者5xx错误。我们可以这么做:

import requests

url = "https://www.baidu.com"
resp = requests.get(url)

if resp.status_code >= 400:
    # do sth

也可以使用requests库提供的raise_for_status方法来实现:

import requests

url = "https://www.baidu.com"
resp = requests.get(url)

resp. raise_for_status()

在上面的示例中,raise_for_status() 会在返回码是4xx或者5xx时raise一个exception出来。相比于第一个例子,实用raise_for_status()会显得更加优雅一些,但是也带来了另一个问题:我们不能在每一处的函数调用都执行一次 resp.raise_for_status() ,幸好,requests提供了一个hook接口,可以让我们一劳永逸的解决这个问题:

import requests

http = requests.Session()

assert_status_hook = lambda response, *args, **kwargs: response.raise_for_status()
http.hooks["response"] = [assert_status_hook]

url = "https://www.baidu.com"
http.get(url)

这么一来,代码显得更加整洁了~

设置请求的base_url

假设我们在请求一个API服务,比如: api.example.com,我们需要在每次都输入域名+URI,例如:

requests.get("https://api.example.com/api/v1/user")
requests.get("https://api.example.com/api/v1/goods")

如果不想每次都这么写的话,可以用BaseUrlSession。代码如下:

from requests_toolbelt import sessions
http = sessions.BaseUrlSession(base_url="https://api.example.com/api/v1")
http.get("/user")
http.get("/goods")

** 要注意,requests_toolbelt库并不在requests库中,所以需要pip安装一下。

设置超时

requests库的官方文档中提到,我们应该在使用requests库的生产环境代码中都加上超时设置。这是因为如果我们不加超时设置,当对端服务器阻塞时,我们的程序也会被卡住,并一直等待结果。这会让整个系统都挂起,严重影响系统的可用性。

在requests中增加超时非常简单:

requests.get('https://github.com/', timeout=0.001)

但是这样也会有一个隐患:如果某个新同学在代码中忘记增加超时设置。。。

Transport Adapters

所幸,requests库提供了Transport Adapters给我们,可以让我们对所有的requests请求都增加一个默认的超时设置,代码如下:

from requests.adapters import HTTPAdapter

DEFAULT_TIMEOUT = 5 # seconds

class TimeoutHTTPAdapter(HTTPAdapter):
    def __init__(self, *args, **kwargs):
        self.timeout = DEFAULT_TIMEOUT
        if "timeout" in kwargs:
            self.timeout = kwargs["timeout"]
            del kwargs["timeout"]
        super().__init__(*args, **kwargs)

    def send(self, request, **kwargs):
        timeout = kwargs.get("timeout")
        if timeout is None:
            kwargs["timeout"] = self.timeout
        return super().send(request, **kwargs)

接下来我们可以这么用:

import requests

http = requests.Session()

# 同时允许http和https请求
adapter = TimeoutHTTPAdapter(timeout=2.5)
http.mount("https://", adapter)
http.mount("http://", adapter)

# 使用默认的2.5s超时
response = http.get("https://www.baidu.com/")

# 自定义一个10秒的超时
response = http.get("https://www.baidu.com/", timeout=10)

问题解决!

重试

在做网络爬虫时,由于网络不稳定,经常会出现某次请求失败的情况,这时就需要重试机制进行重试。我们可以通过自定义一个HTTPAdapter来为每一个请求增加一个重试策略:

from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

retry_strategy = Retry(
    total=3,
    status_forcelist=[429, 500, 502, 503, 504],
    method_whitelist=["HEAD", "GET", "OPTIONS"]
)
adapter = HTTPAdapter(max_retries=retry_strategy)
http = requests.Session()
http.mount("https://", adapter)
http.mount("http://", adapter)

response = http.get("https://www.baidu.com")

其中Retry接受了3个参数,它们所表达的意思分别是:

1. total
total的的意思是最多重试次数,超过设定次数依旧失败时会抛出一个```urllib3.exceptions.MaxRetryError ```异常

2. status_forcelist
status_forcelist的意思是,当收到列表中的状态码时才会进行retry。其它错误码会忽略retry机制;

3. method_whitelist
method_whitelist的意思是,对于列表中的HTTP Method才会进行retry,其它比如POST则忽略retry策略。这是因为POST不是幂等的,直接重试可能会带来结果的不确定性;

还有一个参数是backoff_factor,它用来设置重试之前的等待时间,通常我们会建议使用一个递增的数列,比如:

··· {backoff factor} * (2 ** ({number of total retries} - 1))

例如,我们分别使用如下的backoff factor: backoff-factor=1, 每次retry的间隔为: 0.5, 1, 2, 4, 8, 16, 32, 64, 128, 256. backoff-factor=2: 1, 2, 4, 8, 16, 32, 64, 128, 256, 512 backoff-factor=10: 5, 10, 20, 40, 80, 160, 320, 640, 1280, 2560 ···

集成到一起

将retry的Adapter和超时的Adapter集成到一起也很简单:

retries = Retry(total=3, backoff_factor=1, status_forcelist=[429, 500, 502, 503, 504])
http.mount("https://", TimeoutHTTPAdapter(max_retries=retries))

调试http请求

我们可以通过设置request的debug_level,来打印更多的调试信息,例如:

import requests
import http

http.client.HTTPConnection.debuglevel = 1

requests.get("https://www.baidu.com/")

会得到如下输出:

send: b'GET / HTTP/1.1\r\nHost: www.baidu.com\r\nUser-Agent: python-requests/2.25.0\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nConnection: keep-alive\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Cache-Control: private, no-cache, no-store, proxy-revalidate, no-transform
header: Connection: keep-alive
header: Content-Encoding: gzip
header: Content-Type: text/html
header: Date: Wed, 13 Jan 2021 09:26:44 GMT
header: Last-Modified: Mon, 23 Jan 2017 13:23:46 GMT
header: Pragma: no-cache
header: Server: bfe/1.0.8.18
header: Set-Cookie: BDORZ=27315; max-age=86400; domain=.baidu.com; path=/
header: Transfer-Encoding: chunked

如果你想将参数信息都打印出来,可以这么做:

··· import requests from requests_toolbelt.utils import dump

def logging_hook(response, *args, **kwargs): data = dump.dump_all(response) print(data.decode('utf-8'))

http = requests.Session() http.hooks["response"] = [logging_hook]

http.get("https://api.openaq.org/v1/cities", params={"country": "BA"}) ··· 会得到如下输出:

< GET /v1/cities?country=BA HTTP/1.1
< Host: api.openaq.org
< User-Agent: python-requests/2.25.0
< Accept-Encoding: gzip, deflate
< Accept: */*
< Connection: keep-alive
<

> HTTP/1.1 200 OK
> Content-Type: application/json; charset=utf-8
> Transfer-Encoding: chunked
> Connection: keep-alive
> access-control-allow-credentials: true
> access-control-allow-headers: Authorization, Content-Type, If-None-Match
> access-control-allow-methods: GET, HEAD, POST, PUT, PATCH, DELETE, OPTIONS
> access-control-allow-origin: *
> access-control-expose-headers: WWW-Authenticate, Server-Authorization
> access-control-max-age: 86400
> Cache-Control: no-cache
> Content-Encoding: gzip
> Date: Wed, 13 Jan 2021 09:27:53 GMT
> Vary: origin,accept-encoding
> X-Cache: Miss from cloudfront
> Via: 1.1 65866bb6c20ad09669a6cfc294087ec0.cloudfront.net (CloudFront)
> X-Amz-Cf-Pop: NRT57-C2
> X-Amz-Cf-Id: pSZOGLVzQyAh_muhd6MFg7YXzjZhwsc6HOc0cQsgmxwF6hdCX3usSA==
>
{"meta":{"name":"openaq-api","license":"CC BY 4.0","website":"https://docs.openaq.org/","page":1,"limit":100,"found":10},"results":[{"country":"BA","name":"Goražde","city":"Goražde","count":70797,"locations":1},{"country":"BA","name":"Ilijaš","city":"Ilijaš","count":2912,"locations":1},{"country":"BA","name":"Jajce","city":"Jajce","count":62562,"locations":1},{"country":"BA","name":"Kakanj","city":"Kakanj","count":5637,"locations":1},{"country":"BA","name":"Lukavac","city":"Lukavac","count":149534,"locations":1},{"country":"BA","name":"N/A","city":"N/A","count":17428,"locations":1},{"country":"BA","name":"Sarajevo","city":"Sarajevo","count":493627,"locations":8},{"country":"BA","name":"Tuzla","city":"Tuzla","count":413909,"locations":3},{"country":"BA","name":"Zenica","city":"Zenica","count":233517,"locations":4},{"country":"BA","name":"Živinice","city":"Živinice","count":136137,"locations":1}]}

更多用法,可以参考:https://toolbelt.readthedocs.io/en/latest/dumputils.html

测试

使用requests库,我们可以很容易的mock一个http请求的返回,例如:

import unittest
import requests
import responses


class TestAPI(unittest.TestCase):
    @responses.activate  # intercept HTTP calls within this method
    def test_simple(self):
        response_data = {
                "id": "ch_1GH8so2eZvKYlo2CSMeAfRqt",
                "object": "charge",
                "customer": {"id": "cu_1GGwoc2eZvKYlo2CL2m31GRn", "object": "customer"},
            }
        # mock the Stripe API
        responses.add(
            responses.GET,
            "https://api.stripe.com/v1/charges",
            json=response_data,
        )

        response = requests.get("https://api.stripe.com/v1/charges")
        self.assertEqual(response.json(), response_data)

如果这个http跟mock的responses不匹配,那么会抛出一个ConnectionError 异常,例如:

class TestAPI(unittest.TestCase):
    @responses.activate
    def test_simple(self):
        responses.add(responses.GET, "https://api.stripe.com/v1/charges")
        response = requests.get("https://invalid-request.com")

会抛出异常:
requests.exceptions.ConnectionError: Connection refused by Responses - the call doesn't match any registered mock.

Request:
- GET https://invalid-request.com/

Available matches:
- GET https://api.stripe.com/v1/charges

修改UA

通常服务端会根据客户端请求的UA来判断是否是爬虫,因此我们可以在请求时,更改本次请求的UA信息,方法如下:

import requests
http = requests.Session()
http.headers.update({
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0"
})