常用工具

10 个爬虫工程师必备的工具：https://mp.weixin.qq.com/s/rpzXGjttpcNGubAy50CWpA

正则

写爬虫，怎么可以不会正则呢？https://mp.weixin.qq.com/s/BLma5UWhc1rtHL4GkG7s7Q

XPath

简单的标签搜索：http://www.spbeen.com/p/e4a032a4-3cf0-467b-9199-098240925504
通过ID和Class检索：http://www.spbeen.com/p/4fef7500-1f9c-471b-a56e-af741bc16012
逐层检索和全局检索：http://www.spbeen.com/p/bb16e09d-511f-4728-af49-752ced909ec1

aiohttp

高阶虫术，异步请求术：https://mp.weixin.qq.com/s/hqsH2KBCnzlqnWkwpQjIMQ
如何让你写的爬虫速度像坐火箭一样快【并发请求】：https://mp.weixin.qq.com/s/ZDsoEVY0T2GI9E34dkfUZw

Ajax

爬虫神器！用它可以实时处理和保存 Ajax 数据：https://mp.weixin.qq.com/s/Nw3ZoBHa4f4Ew8XEfuyY1g
https://github.com/wendux/Ajax-hook

智能解析库

爬虫智能解析库 Readability 和 Newspaper 的用法：https://mp.weixin.qq.com/s/RpN6RrKUSkWYbSRqkJcWJA

自动化

自动化实战专辑：https://mp.weixin.qq.com/s/t2LIf4-EQxitO2-e1vNZbQ

手机自动化

我是如何实现 100% 自动签到领取京豆的：https://mp.weixin.qq.com/s/95UOiXKl9JeVVeWSY2s-Vw
带你用 Python 实现自动化群控（入门篇）：https://mp.weixin.qq.com/s/g-AocWwh3nsTy3ZLJhVcBA
Python刷微信未读消息：https://mp.weixin.qq.com/s/6Xp6VVw_LrTQHfcKD1nq5A

Web自动化

Web自动化：https://mp.weixin.qq.com/mp/appmsgalbum?__biz=MzU1OTI0NjI1NQ==&action=getalbum&album_id=1470572376236032003&subscene=1&scenenote=https%3A%2F%2Fmp.weixin.qq.com%2Fs%3F__biz%3DMzU1OTI0NjI1NQ%3D%3D%26mid%3D2247486205%26idx%3D1%26sn%3D33dd05fc416daf1cc041bc76b53b733a%26chksm%3Dfc1b743dcb6cfd2b3dd1f1585bf0c3a4175e39a6dfdc7fe8a5792a405c50da2d7569fb361454%26mpshare%3D1%26scene%3D1%26srcid%3D0817aKYNEhS2ZlJZyOUnmdUC%26sharer_sharetime%3D1597637292071%26sharer_shareid%3D0ec39b35be59b92de20373a195d487ff%26key%3Dbd016e65e56767012b2c2ed7a4448fe5607970f3fa591f830d45fc39e529c26c601e4391d36f9bc73c9b888528acb0ecd179dcde2efa9e31bf36eabce34435cefc065774cf28443584e812db214cf809e285a1fadc2c79c3a6ede275d0d249ef828302b54e4d8dec811f0dc9b545673071d4b5007ea36021c1a45662cbd39bdb%26ascene%3D1%26uin%3DMjI4MDEzNjMzOQ%253D%253D%26devicetype%3DiMac%2BMacmini8%252C1%2BOSX%2BOSX%2B10.14.6%2Bbuild(18G84)%26version%3D12040293%26nettype%3DWIFI%26lang%3Dzh_CN%26fontScale%3D100%26exportkey%3DA98S15lgIJPWdY9Hmmi3PDg%253D%26pass_ticket%3DvNH6GlFAsFEwxO1JD5BTnSJEQlDFJ0hjHKHPUBVLRPF%252BvhDnbQNXePCVQqHy3K9f%26fontgear%3D2.000000#wechat_redirect

代理、抓包

mitmproxy：https://mp.weixin.qq.com/s/MQgzLCFKd_HBdbZC-Aeosg
Appium：https://www.cnblogs.com/fnng/p/4540731.html

Selenium、Pyppeteer

新神器 Pyppeteer 绕过淘宝更简单！：https://mp.weixin.qq.com/s/Iz-DY1UrSfVFRFh5CyHl3Q
一日一技：如何为 Pyppeteer 设置带有权限验证的代理？：https://mp.weixin.qq.com/s/9fxsBwdUBvAmrMHqk3u-Lg
一日一技：如何正确移除Selenium中window.navigator.webdriver的值：https://mp.weixin.qq.com/s/TqL3OawPe9zW_nneyXvefQ
在Pyppeteer中正确隐藏window.navigator.webdriver：https://mp.weixin.qq.com/s/QVkUABGT7nHd0CTB7Lgp5w
好用的 Puppeteer 辅助工具 Puppeteer Recorder：https://segmentfault.com/a/1190000016073329?utm_medium=referral&utm_source=tuicool
Docker 中运行 Pyppeteer 的那些坑：https://mp.weixin.qq.com/s/zPeVqRJZyV5BPBmaFBykNA

Airtest

Airtest：https://airtest.netease.com
官方博客、使用方法：https://airtest.netease.com/blog/index.html
官方文档：https://airtest.netease.com/docs/cn/1_quick_start.html
图像识别技术来做自动化测试和编写爬虫：https://mp.weixin.qq.com/s/kWMrJ2e9pZLxiTMpo2gv7A
使用Airtest超快速开发App爬虫：https://www.kingname.info/2019/01/19/use-airtest/
win巧用bat文件做Airtest脚本的“批量运行”：https://mp.weixin.qq.com/s/1YlUuiQCmMGb5_64S-si3Q

各种爬虫

10 行代码完成抖音热门视频的爬取

https://mp.weixin.qq.com/s/-GlgX5ODy8yYTKslU762og

微信公众号爬虫（Scrapy、Flask、Echarts、Elasticsearch）

https://mp.weixin.qq.com/s/Beyuv_izDAOVBFvXpW66kA
Github：https://github.com/wonderfulsuccess/weixin_crawler
作者制作的简单工具文档：https://shimo.im/docs/E1IjqOy2cYkPRlZd

50行Python代码，教你获取公众号全部文章：https://mp.weixin.qq.com/s/nkW2sYLcdsNTYTkk-4BeLA
怎么保存公众号历史文章合集到本地的？：https://mp.weixin.qq.com/s/4G1icyWiWPDtAhFPNEfzyA
微信公众号文章爬虫：https://juejin.im/post/5cdf64f76fb9a07ee4633266
Python爬取指定微信公众号所有文章：https://mp.weixin.qq.com/s/ZjqqagGugoR9VMoLOhqYtg

框架

pyspider

文档：http://docs.pyspider.org/en/latest/
中文文档：http://www.pyspider.cn
中文问答：https://segmentfault.com/t/pyspider

资料：
https://cuiqingcai.com/2652.html
https://moshuqi.github.io/2016/08/12/Python%E7%88%AC%E8%99%AB-PySpider%E6%A1%86%E6%9E%B6/

Scrapy

Scrapy文档：https://scrapy-chs.readthedocs.io/zh_CN/0.24/
Scrapy中的Request：https://mp.weixin.qq.com/s/pVIVmRC3sKvQbGT_2pbSbg

ScrapydClient

打包代码为egg文件，再上传到远程Scrapyd

# 安装
pip3 install scrapyd-client

安装完成后有个scrapyd-deploy命令可以用

Scrapyrt(轻量级部署管理工具)

Scrapyrt 为 Scrapy 提供了一个调度的 HTTP 接口，有了它我们不需要再执行 Scrapy 命令而是通过请求一个 HTTP 接口即可调度 Scrapy 任务，Scrapyrt 比 Scrapyd 轻量级，如果不需要分布式多任务的话可以简单使用 Scrapyrt 实现远程 Scrapy 任务的调度。

# 安装
pip3 install scrapyrt

GitHub：https://github.com/scrapinghub/scrapyrt
官方文档：http://scrapyrt.readthedocs.io

Scrapyd(分布式部署管理工具)

接收egg文件并部署运行，配置密码访问用nginx反代。
https://scrapyd.readthedocs.io/en/stable/
https://github.com/scrapy/scrapyd

# 安装
pip3 install scrapyd

# scrapyd模板配置（默认不存在）
mkdir /etc/scrapyd
vim /etc/scrapyd/scrapyd.conf

[scrapyd]
eggs_dir    = eggs
logs_dir    = logs
items_dir   =
jobs_to_keep = 5
dbs_dir     = dbs
max_proc    = 0
max_proc_per_cpu = 10    # 每个 CPU 最多运行 10 个Scrapy Job
finished_to_keep = 100
poll_interval = 5.0
bind_address = 0.0.0.0   # 外网访问，nginx反代改为127.0.0.1
http_port   = 6800       # 端口
debug       = off
runner      = scrapyd.runner
application = scrapyd.app.application
launcher    = scrapyd.launcher.Launcher
webroot     = scrapyd.website.Root

[services]
schedule.json     = scrapyd.webservice.Schedule
cancel.json       = scrapyd.webservice.Cancel
addversion.json   = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json  = scrapyd.webservice.ListSpiders
delproject.json   = scrapyd.webservice.DeleteProject
delversion.json   = scrapyd.webservice.DeleteVersion
listjobs.json     = scrapyd.webservice.ListJobs
daemonstatus.json = scrapyd.webservice.DaemonStatus

**scrapyd系统服务：**https://github.com/scrapy/scrapyd/issues/217

scrapyd-api(对Scrapyd的API再一次封装)

# 安装
pip3 install python-scrapyd-api

可以在本地不通过curl而是执行python语句的方式来调用API
https://github.com/djm/python-scrapyd-api

Scrapy-Redis(分布式爬取)

Github：https://github.com/rolando/scrapy-redis
https://mp.weixin.qq.com/s/ljgEMwRJB2_bbdb37p1UQw

Scrapy-Redis-Cluster(Redis集群版)

Github：https://github.com/thsheep/scrapy_redis_cluster

Gerapy

崔大神开发的可视化爬虫框架。
https://github.com/Gerapy/Gerapy

Crawlab（推荐）

一款支持多语言多框架的可视化爬虫管理后台。用python开发，之后转用go开发。
github：https://github.com/tikazyq/crawlab
文档：https://tikazyq.github.io/crawlab-docs/

WebCollector

介绍：https://www.oschina.net/p/webcollector-python
该框架有Python版和Java版。
https://github.com/CrawlScript/WebCollector-Python

专题

网络爬虫过程中5种网页去重方法简介

https://mp.weixin.qq.com/s/PwAiUZsGDP7Kaau6O2-yqQ

猿人学Python爬虫专题

https://www.yuanrenxue.com/crawler/why-write-python-crawler.html

爬虫从入门到放弃专题

https://piaosanlang.gitbooks.io/spiders/content/
布隆过滤器：https://www.cnblogs.com/tonglin0325/p/7043886.html

Crossin的编程教室爬虫实战

https://crossincode.com/school/course/2/

爬取实时变化的 WebSocket 数据

基于：WebSockets；websocket-client
https://mp.weixin.qq.com/s/fuS3uDvAWOQBQNetLqzO-g
https://github.com/asyncins/aiowebsocket

Google图片爬虫

https://mp.weixin.qq.com/s/nkNo4vvUNCX5YMmB-OsNnQ
https://github.com/hardikvasa/google-images-download

「从0到1」Python爬虫专题完结版

https://mp.weixin.qq.com/s/BUZhmh-3qIe2HCpZrY4Zig
github：https://github.com/pythonchannel/spider_works

Python3网络爬虫开发实战（崔庆才）

https://germey.gitbooks.io/python3webspider/content/

这可能是你见过的最全的网络爬虫干货总结！

分布式爬虫

一文搞懂分布式进程爬虫：https://mp.weixin.qq.com/s/7N6fRAq0tRiuVZeXxgbwLA
如何简单高效地部署和监控分布式爬虫项目：https://mp.weixin.qq.com/s/wh-Ok_iF2LbzKdQMMi-T_A

健壮高效的爬虫

催庆才：https://www.bilibili.com/video/av34379204/

如何免费创建云端爬虫集群

https://mp.weixin.qq.com/s/PSpMVrJV1iOvkLeGc0VZkQ
github：https://github.com/my8100/scrapydweb

用 Scrapyd 打造个人化的爬虫部署管理控制台

此内容是付费的，微信登录
https://juejin.im/book/5bb5d3fa6fb9a05d2a1d819a

智能化解析

https://mp.weixin.qq.com/s/RNRwq9e5HCZtLv3vVonAWQ

找到加密参数

教你一步步扣代码解出你需要找到的加密参数：https://mp.weixin.qq.com/s/2pXa7LQa6HA9pjJE5WnTFw
**安卓从开发到逆向：**公众号：爬虫工程师之家

App逆向

把 Android App 逆向分为几步？三步：https://mp.weixin.qq.com/s/EDeIPDn5yUlerYKCXclkSg
手把手教你搭建完美的Android搞机/逆向环境：https://mp.weixin.qq.com/s/31-w6Hjl1ZvBc2htQfujoQ

App Hack（XP框架）

VirtualXposed框架：https://vxposed.com/
安卓从开发到逆向（五）小白也看得懂的Xposed框架入门：https://mp.weixin.qq.com/s/xfWPGl4ulyEcPfAN1AfWkg

爬虫常见问题

字体反扒

随机User-Agent

一行代码搞定 Scrapy 随机 User-Agent 设置：https://mp.weixin.qq.com/s/6yppLN2c-X3_1Q26_HEOAA

爬虫断了（断点续传）

https://mp.weixin.qq.com/s/k3-yd6TTnCe_uPDH4OxouQ

JS加密、参数破解、Cookie

绕开登陆和访问频率控制

https://mp.weixin.qq.com/s/MatKhJcnDjt6Bcq4BUs_kQ
Python模拟登录淘宝：https://mp.weixin.qq.com/s/VUnY5vFToaxdi6mUFplKKQ

Token破解

美团美食小爬虫(补) | 美团_token参数获取详解：https://mp.weixin.qq.com/s/jY-3RrUdMvHGI3ND7EJWcA

脱壳、解包、反编译

安卓常见脱壳方法：https://mp.weixin.qq.com/s/FcX6C-3mXckmRe0J-TTsfA

其他

实际工作中的 Python 爬虫项目是这样写的：https://mp.weixin.qq.com/s/wBM0yrQzL1IOZzJuCxNPkA
《Python3 反爬虫原理与绕过实战》配套代码：https://github.com/asyncins/antispider

Python爬虫、自动化