本文主题-规避反爬虫 什么是反爬虫?参看 # 反爬虫(Anti-spider) #

今天穿插一个话题,教大家怎么规避反爬虫。一般来说我们会遇到网站反爬虫策略主要有下面几点:

  1. 限制IP访问频率,超过频率就断开连接。
  2. 限制UA访问频率,超过频率就断开连接。[介绍]
  3. 限制Session访问频率,超过频率就断开连接。
  4. 限制Cookies访问频率,超过频率就断开连接。
附加说明:
针对IP: 爬虫解决办法是,降低爬虫的速度在每个请求前面加上 sleep() 或者不停的更换代理IP。
针对UA: 爬虫解决办法是,准备大量UserAgent随机调用。

我们今天就来针对1、2两点来写出反爬虫的下载模块,别害怕真的很简单。

使用模块
  • requests(网络请求模块)
  • re(正则表达式模块)
  • random(随机选择模块)
实现思路
  1. 代理IP发布网站中获取有效IP地址;
  2. 本地IP访问网页;当本地IP失效时,转换使用代理IP访问网页;
  3. 代理IP失败六次后转换下一个代理IP。

下面我们开整ヽ(●-`Д´-)ノ


第一步、先看看Python的默认UA
import requests

payload = {'key1': 'value1', 'key2': 'value2'}
r = requests.post("http://httpbin.org/post", data=payload)
print r.text

运行结果:

{
  "args": {},
  "data": "",
  "files": {},
  "form": {
    "key1": "value1",
    "key2": "value2"
  },
  "headers": {
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate",
    "Connection": "close",
    "Content-Length": "23",
    "Content-Type": "application/x-www-form-urlencoded",
    "Host": "httpbin.org",
    "User-Agent": "python-requests/2.13.0"
  },
  "json": null,
  "origin": "220.200.59.163",
  "url": "http://httpbin.org/post"
}

我们可以看到,程序请求页面的UA是这样的: python-requests/2.13.0,而正常浏览器请求页面的UA应该是这样的:Mozilla/5.0 (Windows NT 6.1; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0

规避反爬虫-正常UA

反爬虫程序若做上述检测,则当场被抓现形,直接枪毙。

所以呢,我们必须要伪造正常浏览器访问的UA,同时为了一次性解决 2.限制UA访问频率 ,我们还需要伪造大量的UA,请求时随机选取一个,混淆视听。

第二步、准备UA库,随机UA发起请求

为了让反爬虫程序以为我们是真的浏览器,同时解决一些网站限制相同的User-Agent的访问频率的问题,我们做两件事:

  • 准备UA库
  • 随机UA请求

去 Google 上可以搜索到一大堆的UA,下面我们贴出用于制作UA库:

"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0"
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11"
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6"
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6"
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1"
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5"
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5"
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3"
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3"
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3"
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3"
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3"
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3"
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3"
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3"
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"

接下来,我们来写一个HTTP类,Talk is cheap, show my code:

# -*- coding:utf-8 -*-
# -----------------------
# 载入类型
# -----------------------
# 载入类库[系统]
import sys
# 载入类库[操作系统]
import os
# 载入类库[网络请求]
import requests
# 载入类库[正则表达式]
import re
# 载入类库[随机数]
import random





# -----------------------
# 初始化配置
# -----------------------
# 设置文件编码[UTF-8]
reload(sys)
sys.setdefaultencoding('utf-8')





# -----------------------
# 类型定义
# -----------------------
# HTTP操作类
class HTTP:

    # 构造方法
    def __init__(self):
        self.user_agents = [
            'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1',
            'Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6',
            'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6',
            'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1',
            'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5',
            'Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3',
            'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3',
            'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3',
            'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3',
            'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3',
            'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3',
            'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24',
            'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24',
        ]

    # 网络请求[GET]
    # @param url: 网址
    def get(self, url):
        # 请求头
        headers = {'User-Agent': random.choice(self.user_agents)}  # 随机选取User-Agent进行浏览器伪造
        # 发起网络请求,获得响应结果
        response = requests.get(url, headers=headers)
        # 返回结果
        return response

好了,大家可以试试,再次请求:http://httpbin.org/get,你可以发现浏览器的UA已经改变了。 🙂

OK,我们还有一个任务没有处理:那就是解决限制IP频率的反爬虫。

第三步、准备代理IP库,策略切换

首先是需要获取代理IP的网站,这个站点 快代理 非常不错。

# -*- coding:utf-8 -*-
# -----------------------
# 载入类型
# -----------------------
# 载入类库[网络请求]
import requests
# 载入类库[正则表达式]
import re





# -----------------------
# 功能处理
# -----------------------
# IP连接池
ip_pool = []

# 遍历网页提供代理IP
for page in xrange(1, 10+1):
    # 网页内容
    html = requests.get("http://www.kuaidaili.com/proxylist/" + str(page)).text
    # IP数据(re.S:包括匹配换行符,findall:返回list!)
    data = re.findall(r'data-title="IP">(.*?)</td>\s+<td data-title="PORT">(.*?)</td>', html, re.S)

    # 添加到数据池
    for ip in data:
        ip_pool.append(ip[0] + ':' + ip[1])

print ip_pool

运行结果:

规避反爬虫-代理IP池

Wonderful! 好啦,接下来我们将代码合并进HTTP类,优化一下代码,添加代理IP功能:

# -*- coding:utf-8 -*-
# -----------------------
# 载入类型
# -----------------------
# 载入类库[系统]
import sys
# 载入类库[操作系统]
import os
# 载入类库[网络请求]
import requests
# 载入类库[正则表达式]
import re
# 载入类库[随机数]
import random





# -----------------------
# 初始化配置
# -----------------------
# 设置文件编码[UTF-8]
reload(sys)
sys.setdefaultencoding('utf-8')





# -----------------------
# 类型定义
# -----------------------
# HTTP操作类
class HTTP:

    # 构造方法
    def __init__(self):
        # 初始化IP池
        self.__init_ips()
        # 初始化UA池
        self.__init_user_agents()

    # 初始化IP池
    def __init_ips(self):
        # IP池
        self.ips = []

        # 遍历网页提供代理IP
        for page in xrange(1, 10+1):
            # 网页内容
            html = requests.get("http://www.kuaidaili.com/proxylist/" + str(page)).text
            # IP数据(re.S:包括匹配换行符,findall:返回list!)
            data = re.findall(r'data-title="IP">(.*?)</td>\s+<td data-title="PORT">(.*?)</td>', html, re.S)

            # 添加到数据池
            for ip in data:
                self.ips.append(ip[0] + ':' + ip[1])

        # 调试日志
        # print(self.ips)

    # 初始化UA池
    def __init_user_agents(self):
        # UA池
        self.user_agents = [
            'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1',
            'Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6',
            'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6',
            'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1',
            'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5',
            'Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3',
            'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3',
            'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3',
            'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3',
            'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3',
            'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3',
            'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24',
            'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24',
        ]

        # 调试日志
        # print(self.user_agents)

    # 网络请求[GET]
    # @param url: 网址
    # @param proxy: 代理地址
    def get(self, url, proxy=None):
        # 请求头
        headers = {'User-Agent': random.choice(self.user_agents)}  # 随机选取User-Agent进行浏览器伪造

        # 判断是否采用代理IP
        if proxy == None:
            ###不用代理
            # 发起网络请求,获得响应结果
            response = requests.get(url, headers=headers)
            # 返回结果
            return response
        else:
            ###使用代理
            # 随机一个代理IP
            IP = str(random.choice(self.ips)).strip()
            # 构建一个代理对象
            proxies = {'http': IP}
            # 发起网络请求,获得响应结果
            response = requests.get(url, headers=headers, proxies=proxies)
            # 返回结果
            return response

if __name__ == '__main__':
    # 实例化对象
    instance = HTTP()

    # 打印响应头信息
    try:
        # 结果提示
        print(instance.get('http://mzitu.com', True).headers)
    except Exception, e:
        # 异常提示
        print('ERROR OCCOURED! MAYBE PROXY FAIL.')

输出结果:

{'Date': 'Fri, 26 May 2017 12:11:56 GMT', 'Content-Type': 'text/html; charset=iso-8859-1', 'Content-Length': '380', 'Connection': 'Keep-Alive', 'Keep-Alive': 'timeout=5, max=100'}

下面我们来考虑一下什么时候需要代理!规定一下多少次切换成代理爬取,多少次取消代理!我们改改代码,成下面这样:

# -*- coding:utf-8 -*-
# -----------------------
# 载入类型
# -----------------------
# 载入类库[系统]
import sys
# 载入类库[操作系统]
import os
# 载入类库[网络请求]
import requests
# 载入类库[正则表达式]
import re
# 载入类库[随机数]
import random





# -----------------------
# 初始化配置
# -----------------------
# 设置文件编码[UTF-8]
reload(sys)
sys.setdefaultencoding('utf-8')





# -----------------------
# 类型定义
# -----------------------
# HTTP操作类
class HTTP:

    # 构造方法
    def __init__(self):
        # 初始化IP池
        self.__init_ips()
        # 初始化UA池
        self.__init_user_agents()
        # 初始化配置
        self.__init_configs()

    # 初始化IP池
    def __init_ips(self):
        # 是否采用缓存IP池
        if True:
            ### 采用缓存
            self.ips = [u'218.56.132.158:8080', u'114.99.5.199:9000', u'117.90.7.211:9000', u'121.232.147.234:9000', u'121.15.170.171:8080', u'116.226.69.141:9797', u'119.254.84.90:80', u'117.90.2.76:9000', u'121.232.147.236:9000', u'124.193.85.88:8080', u'60.160.128.10:9999', u'221.230.7.54:9000', u'117.90.2.68:9000', u'117.90.2.73:9000', u'14.146.147.140:8998', u'211.138.187.121:8998', u'121.232.144.140:9000', u'113.242.141.33:8998', u'183.27.218.108:8998', u'222.181.24.36:8998', u'121.232.144.230:9000', u'117.90.3.164:9000', u'121.232.146.130:9000', u'121.232.144.154:9000', u'221.239.81.83:8118', u'121.232.145.16:9000', u'182.87.39.44:9000', u'125.67.72.196:9000', u'183.196.168.194:9000', u'111.13.7.42:80', u'113.200.245.158:9999', u'171.217.112.245:9797', u'120.204.85.29:3128', u'121.232.147.184:9000', u'14.109.115.85:8998', u'222.188.96.181:8998', u'125.93.149.47:9000', u'121.232.144.143:9000', u'121.232.146.170:9000', u'182.85.60.90:9000', u'45.221.216.27:8080', u'121.232.144.47:9000', u'125.85.187.194:8998', u'117.90.5.0:9000', u'121.232.145.174:9000', u'125.117.133.215:9000', u'119.135.185.99:9797', u'121.232.144.249:9000', u'117.90.2.36:9000', u'123.7.177.20:9999', u'122.193.14.106:82', u'183.240.87.229:8080', u'218.104.148.157:8080', u'117.90.1.254:9000', u'117.90.3.119:9000', u'222.221.24.21:8998', u'121.232.145.116:9000', u'121.232.145.145:9000', u'112.33.7.9:8081', u'117.90.3.12:9000', u'117.90.1.35:9000', u'110.73.11.196:8123', u'117.28.242.189:80', u'219.149.59.250:9797', u'43.241.225.147:3128', u'121.232.145.200:9000', u'171.92.53.59:9000', u'171.215.226.193:9000', u'121.232.145.53:9000', u'202.194.44.245:8998', u'121.232.147.211:9000', u'58.67.159.50:80', u'61.160.254.23:23', u'121.232.144.150:9000', u'117.90.2.163:9000', u'202.206.244.12:8998', u'121.232.144.116:9000', u'163.125.159.236:8888', u'117.90.5.187:9000', u'122.226.62.90:3128', u'117.90.2.112:9000', u'121.232.144.241:9000', u'121.232.144.253:9000', u'211.68.236.82:8998', u'113.140.25.4:81', u'59.78.47.184:1080', u'117.90.3.209:9000', u'122.96.59.107:83', u'121.232.144.235:9000', u'117.90.2.100:9000', u'117.90.0.200:9000', u'121.232.144.77:9000', u'27.46.32.128:9797', u'122.96.59.105:83', u'121.232.144.138:9000', u'60.178.13.13:8081', u'125.67.72.106:9000', u'222.85.127.130:9797', u'183.31.248.158:9999', u'117.90.7.85:9000']
        else:
            ### 非用缓存
            # IP池
            self.ips = []

            # 遍历网页提供代理IP
            for page in xrange(1, 10+1):
                # 网页内容
                html = requests.get("http://www.kuaidaili.com/proxylist/" + str(page)).text
                # IP数据(re.S:包括匹配换行符,findall:返回list!)
                data = re.findall(r'data-title="IP">(.*?)</td>\s+<td data-title="PORT">(.*?)</td>', html, re.S)

                # 添加到数据池
                for ip in data:
                    self.ips.append(ip[0] + ':' + ip[1])

        # 调试日志
        # print(self.ips)

    # 初始化UA池
    def __init_user_agents(self):
        # UA池
        self.user_agents = [
            'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1',
            'Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6',
            'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6',
            'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1',
            'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5',
            'Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3',
            'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3',
            'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3',
            'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3',
            'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3',
            'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3',
            'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24',
            'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24',
        ]

        # 调试日志
        # print(self.user_agents)

    # 初始化配置
    def __init_configs(self):
        # 重试间隔时间
        self.retry_interval = 10


    # 网络请求[GET]
    # @param url: 网址
    # @param timeout: 超时时间
    # @param proxy: 代理地址
    # @param retry: 错误重试次数
    def get(self, url, timeout=3, proxy=None, retry=6):
        # 请求头
        headers = {'User-Agent': random.choice(self.user_agents)}  # 随机选取User-Agent进行浏览器伪造

        # 判断是否采用代理IP
        if proxy == None:
            ###不用代理
            try:
                # 发起网络请求,获得响应结果
                response = requests.get(url, headers=headers, timeout=timeout)
                # 返回结果
                return response
            except Exception, e:
                # 判断限定的重试次数
                if retry > 0:
                    # 延迟时间
                    time.sleep(self.retry_interval)
                    # 调试日志
                    print(u'网络请求出错,{0}s后将重新请求。 - 重试倒数次数:{1}次'.format(self.retry_interval, retry)) 
                    # 调用自身,重试次数减1
                    return self.get(url, timeout, proxy, retry-1)
                else:
                    # 调试日志
                    print(u'开始启用代理')
                    # 延迟时间
                    time.sleep(self.retry_interval)
                    # 调用自身,启用代理
                    return self.get(url, timeout, True, )
        else:
            ###使用代理
            # 随机一个代理IP
            IP = str(random.choice(self.ips)).strip()
            # 构建一个代理对象
            proxies = {'http': IP}
            # 发起网络请求,获得响应结果
            response = requests.get(url, headers=headers, proxies=proxies, timeout=timeout)
            # 返回结果
            return response

if __name__ == '__main__':
    # 实例化对象
    instance = HTTP()

    # 打印响应头信息
    try:
        # 结果提示
        print(instance.get('http://mzitu.com', 3, True).headers)
    except Exception, e:
        # 打印异常
        print e
        # 异常提示
        print('ERROR OCCOURED! MAYBE PROXY FAIL.')

本例中,代码增强了网络请求超时 timeout 及 请求错误重试次数 retry=6(6次过后使用代理)。

下一步,我们使用代理失败6次后,取消代理,直接上代码:

# -*- coding:utf-8 -*-
# -----------------------
# 载入类型
# -----------------------
# 载入类库[系统]
import sys
# 载入类库[操作系统]
import os
# 载入类库[网络请求]
import requests
# 载入类库[正则表达式]
import re
# 载入类库[随机数]
import random
# 载入类库[时间]
import time





# -----------------------
# 初始化配置
# -----------------------
# 设置文件编码[UTF-8]
reload(sys)
sys.setdefaultencoding('utf-8')





# -----------------------
# 类型定义
# -----------------------
# HTTP操作类
class HTTP:

    # 构造方法
    def __init__(self):
        # 初始化IP池
        self.__init_ips()
        # 初始化UA池
        self.__init_user_agents()
        # 初始化配置
        self.__init_configs()

    # 初始化IP池
    def __init_ips(self):
        # 是否采用缓存IP池
        if True:
            ### 采用缓存
            self.ips = [u'218.56.132.158:8080', u'114.99.5.199:9000', u'117.90.7.211:9000', u'121.232.147.234:9000', u'121.15.170.171:8080', u'116.226.69.141:9797', u'119.254.84.90:80', u'117.90.2.76:9000', u'121.232.147.236:9000', u'124.193.85.88:8080', u'60.160.128.10:9999', u'221.230.7.54:9000', u'117.90.2.68:9000', u'117.90.2.73:9000', u'14.146.147.140:8998', u'211.138.187.121:8998', u'121.232.144.140:9000', u'113.242.141.33:8998', u'183.27.218.108:8998', u'222.181.24.36:8998', u'121.232.144.230:9000', u'117.90.3.164:9000', u'121.232.146.130:9000', u'121.232.144.154:9000', u'221.239.81.83:8118', u'121.232.145.16:9000', u'182.87.39.44:9000', u'125.67.72.196:9000', u'183.196.168.194:9000', u'111.13.7.42:80', u'113.200.245.158:9999', u'171.217.112.245:9797', u'120.204.85.29:3128', u'121.232.147.184:9000', u'14.109.115.85:8998', u'222.188.96.181:8998', u'125.93.149.47:9000', u'121.232.144.143:9000', u'121.232.146.170:9000', u'182.85.60.90:9000', u'45.221.216.27:8080', u'121.232.144.47:9000', u'125.85.187.194:8998', u'117.90.5.0:9000', u'121.232.145.174:9000', u'125.117.133.215:9000', u'119.135.185.99:9797', u'121.232.144.249:9000', u'117.90.2.36:9000', u'123.7.177.20:9999', u'122.193.14.106:82', u'183.240.87.229:8080', u'218.104.148.157:8080', u'117.90.1.254:9000', u'117.90.3.119:9000', u'222.221.24.21:8998', u'121.232.145.116:9000', u'121.232.145.145:9000', u'112.33.7.9:8081', u'117.90.3.12:9000', u'117.90.1.35:9000', u'110.73.11.196:8123', u'117.28.242.189:80', u'219.149.59.250:9797', u'43.241.225.147:3128', u'121.232.145.200:9000', u'171.92.53.59:9000', u'171.215.226.193:9000', u'121.232.145.53:9000', u'202.194.44.245:8998', u'121.232.147.211:9000', u'58.67.159.50:80', u'61.160.254.23:23', u'121.232.144.150:9000', u'117.90.2.163:9000', u'202.206.244.12:8998', u'121.232.144.116:9000', u'163.125.159.236:8888', u'117.90.5.187:9000', u'122.226.62.90:3128', u'117.90.2.112:9000', u'121.232.144.241:9000', u'121.232.144.253:9000', u'211.68.236.82:8998', u'113.140.25.4:81', u'59.78.47.184:1080', u'117.90.3.209:9000', u'122.96.59.107:83', u'121.232.144.235:9000', u'117.90.2.100:9000', u'117.90.0.200:9000', u'121.232.144.77:9000', u'27.46.32.128:9797', u'122.96.59.105:83', u'121.232.144.138:9000', u'60.178.13.13:8081', u'125.67.72.106:9000', u'222.85.127.130:9797', u'183.31.248.158:9999', u'117.90.7.85:9000']
        else:
            ### 非用缓存
            # IP池
            self.ips = []

            # 遍历网页提供代理IP
            for page in xrange(1, 10+1):
                # 网页内容
                html = requests.get("http://www.kuaidaili.com/proxylist/" + str(page)).text
                # IP数据(re.S:包括匹配换行符,findall:返回list!)
                data = re.findall(r'data-title="IP">(.*?)</td>\s+<td data-title="PORT">(.*?)</td>', html, re.S)

                # 添加到数据池
                for ip in data:
                    self.ips.append(ip[0] + ':' + ip[1])

        # 调试日志
        # print(self.ips)

    # 初始化UA池
    def __init_user_agents(self):
        # UA池
        self.user_agents = [
            'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1',
            'Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6',
            'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6',
            'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1',
            'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5',
            'Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3',
            'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3',
            'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3',
            'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3',
            'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3',
            'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3',
            'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24',
            'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24',
        ]

        # 调试日志
        # print(self.user_agents)

    # 初始化配置
    def __init_configs(self):
        # 重试间隔时间
        self.retry_interval = 3


    # 网络请求[GET]
    # @param url: 网址
    # @param timeout: 超时时间
    # @param proxy: 代理地址
    # @param retry: 错误重试次数
    def get(self, url, timeout=3, proxy=None, retry=6):
        # 请求头
        headers = {'User-Agent': random.choice(self.user_agents)}  # 随机选取User-Agent进行浏览器伪造

        # 判断是否采用代理IP
        if proxy == None:
            ###不用代理
            try:
                # 发起网络请求,获得响应结果
                response = requests.get(url, headers=headers, timeout=timeout)
                # 返回结果
                return response
            except Exception, e:
                # 判断限定的重试次数
                if retry > 0:
                    # 延迟时间
                    time.sleep(self.retry_interval)
                    # 调试日志
                    print(u'网络请求出错,{0}s后将重新请求。 - 重试倒数次数:{1}次'.format(self.retry_interval, retry)) 
                    # 调用自身,重试次数减1
                    return self.get(url, timeout, proxy, retry-1)
                else:
                    # 调试日志
                    print(u'开始启用代理')
                    # 延迟时间
                    time.sleep(self.retry_interval)
                    # 调用自身,启用代理
                    return self.get(url, timeout, True, )
        else:
            ###使用代理
            try:
                # 随机一个代理IP
                IP = str(random.choice(self.ips)).strip()
                # 构建一个代理对象
                proxies = {'http': IP}
                # 发起网络请求,获得响应结果
                response = requests.get(url, headers=headers, proxies=proxies, timeout=timeout)
                # 返回结果
                return response
            except Exception, e:
                # 判断限定的重试次数
                if retry > 0:
                    # 延迟时间
                    time.sleep(self.retry_interval)
                    # 随机一个代理IP
                    IP = str(random.choice(self.ips)).strip()
                    # 构建一个代理对象
                    proxy = {'http': IP}
                    # 调试日志
                    print('RECHANGE PROXY!')
                    print(u'网络请求出错,正在更换代理,{0}s后将重新请求。 - 重试倒数次数:{1}次'.format(self.retry_interval, retry)) 
                    print(u'当前代理:{0}'.format(proxy))
                    # 调用自身,重试新代理
                    return self.get(url, timeout, proxy, retry - 1)
                else:
                    # 调试日志
                    print(u'代理也不好使了!取消代理! :(')
                    # 调用自身,无代理请求
                    return self.get(url)

if __name__ == '__main__':
    # 实例化对象
    instance = HTTP()

    # 打印响应头信息
    try:
        # 结果提示
        # print(instance.get('http://mzitu.com', 3, True).headers)
        print(instance.get('http://www.google.com.hk/', 3).headers)
    except Exception, e:
        # 打印异常
        print e
        # 异常提示
        print('ERROR OCCOURED! MAYBE PROXY FAIL.')

截止到现在为止,一个较为健壮的下载模块搞定。当然如果我们要完善,还有很多东西需要做:

  • 判断地址是否是robots.txt文件禁止获取的
  • 判断状态是否是服务器出错
  • 限制爬虫深度防止掉入爬虫陷进
  • ····

今天就先到这里,Bye!