暂时迁移被爬虫扒得内存不足的VPS

VPS 内存不足问题

最近一两个月我在查看 VPS 运行日志的时候,经常发现 kernel 日志中会有 Out of memory 内存不足报错,而且报错基本都是 php-fpm 引起的:

root@zoserver:~# cat /var/log/kern.log
Dec 15 20:11:43 zoserver kernel: [55751339.090508] Out of memory in UB 1253: OOM killed process 32239 (php-fpm) score 0 vm:56336kB, rss:29832kB, swap:0kB
Dec 15 20:11:56 zoserver kernel: [55751352.643620] Out of memory in UB 1253: OOM killed process 32238 (php-fpm) score 0 vm:55580kB, rss:29444kB, swap:0kB
Dec 15 20:11:57 zoserver kernel: [55751353.609602] Out of memory in UB 1253: OOM killed process 32242 (php-fpm) score 0 vm:56088kB, rss:29800kB, swap:0kB
Dec 15 20:12:23 zoserver kernel: [55751379.072308] Out of memory in UB 1253: OOM killed process 32240 (php-fpm) score 0 vm:55496kB, rss:29520kB, swap:0kB
Dec 15 20:12:45 zoserver kernel: [55751401.084746] Out of memory in UB 1253: OOM killed process 32225 (php-fpm) score 0 vm:55848kB, rss:29564kB, swap:0kB
Dec 15 20:13:22 zoserver kernel: [55751438.326072] Out of memory in UB 1253: OOM killed process 32266 (php-fpm) score 0 vm:56008kB, rss:29880kB, swap:0kB
Dec 15 20:13:36 zoserver kernel: [55751452.087637] Out of memory in UB 1253: OOM killed process 32278 (php-fpm) score 0 vm:55328kB, rss:29356kB, swap:0kB
Dec 15 20:13:37 zoserver kernel: [55751453.035146] Out of memory in UB 1253: OOM killed process 32241 (php-fpm) score 0 vm:55752kB, rss:29784kB, swap:0kB

从日志里可以看到每个 php-fpm 进程的 rss 内存占用都接近 30 MB,我之前就已经将 LNMP 环境里的 php-fpm.conf 配置文件中的 pm.max_children 改为 8,这样如果碰到同时请求数较多的情况,php-fpm 就可能会占用 240 MB 内存。再加上 MySQL、BTSync 等其它程序也要占用内存,我这个在 HighSpeedWeb 上购买的 256 MB 内存的 VPS 应该就撑不住了,出现 Out of memory 错误也就不足为怪了。

为了找到原因,我决定检查一下出现内存不足时候的 nginx 请求日志:

root@zoserver:~# more /home/wwwlogs/zohead.log
64.79.85.205 - - [15/Dec/2016:20:11:43 +0800] "GET /archives/tcpkill-nfs/ HTTP/1.1" 200
13304 - "-" "Mozilla/5.0 (compatible; SMTBot/1.0; +http://www.similartech.com/smtbot)"
-
64.79.85.205 - - [15/Dec/2016:20:11:43 +0800] "GET /archives/newifi-mini-openwrt/ HTTP/1.1" 200
18841 - "-" "Mozilla/5.0 (compatible; SMTBot/1.0; +http://www.similartech.com/smtbot)"
-
64.79.85.205 - - [15/Dec/2016:20:11:43 +0800] "GET /archives/category/technology/linux/ubuntu/ HTTP/1.1" 200
11921 - "-" "Mozilla/5.0 (compatible; SMTBot/1.0; +http://www.similartech.com/smtbot)"
-
64.79.85.205 - - [15/Dec/2016:20:11:43 +0800] "GET /archives/category/technology/phone/ HTTP/1.1" 200
12800 - "-" "Mozilla/5.0 (compatible; SMTBot/1.0; +http://www.similartech.com/smtbot)"
-
64.79.85.205 - - [15/Dec/2016:20:11:43 +0800] "GET /archives/category/technology/ HTTP/1.1" 200
14862 - "-" "Mozilla/5.0 (compatible; SMTBot/1.0; +http://www.similartech.com/smtbot)"
-
64.79.85.205 - - [15/Dec/2016:20:11:44 +0800] "GET /archives/category/technology/android/ HTTP/1.1" 200
15127 - "-" "Mozilla/5.0 (compatible; SMTBot/1.0; +http://www.similartech.com/smtbot)"
-
64.79.85.205 - - [15/Dec/2016:20:11:44 +0800] "GET /archives/zerotier-container/ HTTP/1.1" 200
16323 - "-" "Mozilla/5.0 (compatible; SMTBot/1.0; +http://www.similartech.com/smtbot)"
-
64.79.85.205 - - [15/Dec/2016:20:11:44 +0800] "GET /archives/tag/bash/ HTTP/1.1" 200
11221 - "-" "Mozilla/5.0 (compatible; SMTBot/1.0; +http://www.similartech.com/smtbot)"
-
64.79.85.205 - - [15/Dec/2016:20:11:44 +0800] "GET /archives/tag/ssh/ HTTP/1.1" 200
11266 - "-" "Mozilla/5.0 (compatible; SMTBot/1.0; +http://www.similartech.com/smtbot)"

这就明显是一个不太友善的爬虫干的好事了,由于请求日志太多这里就不列出来了,统计之后可以发现这个 SMTBot 在十几秒钟的时间里请求了几百次,明显超出了 VPS 能处理的范围了。

另外我在检查日志之后还发现经常有各种初步练习用的爬虫也在不断访问 WordPress 博客数据,这种爬虫的特征就是使用各种不同的 User agent:

root@zoserver:~# more /home/wwwlogs/zohead.log
138.197.19.145 - - [17/Dec/2016:08:28:49 +0800] "GET /robots.txt HTTP/1.1" 200
145 - "-" "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36"
-
138.197.19.145 - - [17/Dec/2016:08:29:00 +0800] "GET /wp-login.php HTTP/1.1" 200
2464 - "http://zohead.com" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36"
-
138.197.19.145 - - [17/Dec/2016:08:29:01 +0800] "GET /archives/category/technology/network-tech/https-ssl/ HTTP/1.1" 200
8604 - "https://zohead.com" "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36"
-
138.197.19.145 - - [17/Dec/2016:08:29:03 +0800] "GET /archives/category/travel/ HTTP/1.1" 502
166 - "https://zohead.com" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:28.0) Gecko/20100101 Firefox/28.0"
-
138.197.19.145 - - [17/Dec/2016:08:29:03 +0800] "GET /archives/tag/video/ HTTP/1.1" 200
11459 - "https://zohead.com" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36"
-
138.197.19.145 - - [17/Dec/2016:08:29:03 +0800] "GET /guestbook/ HTTP/1.1" 200
9962 - "https://zohead.com" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14"
-
138.197.19.145 - - [17/Dec/2016:08:29:03 +0800] "GET /archives/tasker-shell/ HTTP/1.1" 200
13094 - "https://zohead.com" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.137 Safari/537.36"
-
138.197.19.145 - - [17/Dec/2016:08:29:04 +0800] "GET /archives/category/technology/ HTTP/1.1" 200
13717 - "https://zohead.com" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14"
-
138.197.19.145 - - [17/Dec/2016:08:29:04 +0800] "GET /archives/tag/android/ HTTP/1.1" 200
13879 - "https://zohead.com" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36"
-
138.197.19.145 - - [17/Dec/2016:08:29:04 +0800] "GET /archives/category/technology/android/ HTTP/1.1" 200
13842 - "https://zohead.com" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.137 Safari/537.36"
-
138.197.19.145 - - [17/Dec/2016:08:29:05 +0800] "GET /archives/author/admin/ HTTP/1.1" 200
13716 - "https://zohead.com" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0"

然而这些小爬虫发起请求来也是毫不手软,基本没有在多个请求之间加什么延时的。不过还好看起来爬虫还是读了 robots.txt 文件的,因此可以考虑在 robots.txt 和 nginx 配置里做一些限制。

防治措施

修改 robots.txt

首先把原来不太重视的 robots.txt 文件完善一下,增加了一些限制,大概如下:

User-agent: *
Disallow: /wp-admin/
Disallow: /wp-content/
Disallow: /wp-includes/
Disallow: /*?replytocom=*
Crawl-delay: 30
Sitemap: https://zohead.com/sitemap.xml

禁止所有爬虫访问一些 WordPress 内部目录,增加了 Crawl-delay 参数,并设置为 30 秒,防止产生过多的请求。

修改 nginx 配置

由于并不是所有爬虫都会读取并遵守 robots 协议,特别是 Google 和百度这样的搜索巨头也明确表态不支持上面增加的 Crawl-delay 参数,为此还是需要修改 nginx 配置直接限制并发连接数:

root@zoserver:~# more /usr/local/nginx/conf/nginx.conf
http {
	limit_req_zone $anti_spider zone=anti_spider:60m rate=200r/m;
}

server {
	limit_req zone=anti_spider burst=5 nodelay;
	set $anti_spider $http_user_agent;
}

上面只是简单节选列出了 nginx 服务器配置的修改,使用 limit_req_zone 限制每分钟 200 个请求,最大并发为 5。

经过上面两步修改之后,VPS 日志里的内存不足错误看起来是减少了,但是好景不长,直到过几天我再去检查内核日志和 nginx 请求日志时发现来了一个臭名昭著的家伙,其频繁的请求仍然导致 VPS 出现 Out of memory 问题:

root@zoserver:~# more /home/wwwlogs/zohead.log
42.236.99.242 - - [09/Jan/2017:05:41:44 +0800] "GET /archives/tag/keepassdroid/?wpmp_switcher=mobile HTTP/1.1" 503
608 - "https://m.zohead.com/archives/tag/keepassdroid/?wpmp_switcher=mobile" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0); 360Spider"
-
42.236.99.178 - - [09/Jan/2017:05:41:44 +0800] "GET /archives/easymoney-to-feidee/?lang=en&replytocom=641replytocom=641replytocom=641replytocom=640replytocom=641replytocom=641replytocom=640replytocom=640&wpmp_switcher=desktop HTTP/1.1" 503
608 - "https://zohead.com/archives/easymoney-to-feidee/?lang=en&replytocom=641replytocom=641replytocom=641replytocom=640replytocom=641replytocom=641replytocom=640replytocom=640&wpmp_switcher=desktop" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0); 360Spider"
-
42.236.99.230 - - [09/Jan/2017:05:41:45 +0800] "GET / HTTP/1.1" 301
178 - "http://www.zohead.com/" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0); 360Spider"
-
42.236.99.194 - - [09/Jan/2017:05:41:47 +0800] "GET /archives/qiniu-https-tamper/?lang=en&replytocom=2190 HTTP/1.1" 200
12300 - "https://zohead.com/archives/qiniu-https-tamper/?lang=en&replytocom=2190" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0); 360Spider"
-
42.236.99.206 - - [09/Jan/2017:05:41:48 +0800] "GET /archives/category/technology/linux/page/3/?wpmp_switcher=true HTTP/1.1" 503  
608 - "https://m.zohead.com/archives/category/technology/linux/page/3/?wpmp_switcher=true" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0); 360Spider"
-
180.153.236.19 - - [09/Jan/2017:05:41:48 +0800] "GET /archives/category/technology/cplusplus/ HTTP/1.1" 503
608 - "https://m.zohead.com/archives/category/technology/cplusplus/" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0); 360Spider"
-
42.236.99.154 - - [09/Jan/2017:05:41:48 +0800] "GET /comments/feed/?lang=en HTTP/1.1" 503
608 - "http://zohead.com/comments/feed/?lang=en" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0); 360Spider"
-
180.153.236.165 - - [09/Jan/2017:05:41:49 +0800] "GET /archives/tag/start-stop-daemon/?lang=en&wpmp_switcher=mobile HTTP/1.1" 200
8502 - "http://zohead.com/archives/tag/start-stop-daemon/?lang=en&wpmp_switcher=mobile" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0); 360Spider"

上面只是同一时段博客访问日志的很小一部分,360 搜索的 360Spider 爬虫在不断访问博客,而且看起来 360 服务器集群机器也是相当多,360 蜘蛛的 IP 地址列表可以在其官网查看:

https://www.so.com/help/spider_ip.html

经过分析日志我发现最要命的是 360 爬虫根本就没有读取 robots.txt 文件,这样根本谈不上让 Crawl-delay 之类的参数发挥作用。

迁移服务器

经过我差不多一个月的观察,现在 VPS 遇到的内存不足问题基本都是由 360 爬虫引起的,另外偶尔也有一些小爬虫不按规矩狂发请求。只是考虑到现在这个 256 MB 内存的 VPS 始终不是长久之计,因此还是想把博客迁移到其它服务器上。

首先看了看 HighSpeedWeb 现有的套餐,512 MB 内存以上的 OpenVZ 或 KVM 套餐现在价格也都不太便宜。一番参考之后我准备先将博客迁移到 IBM Bluemix 容器平台顶着,因为看起来 Bluemix 容器系统里能使用的突发内存量还是比较多的,而且毕竟目前 Bluemix 平台在我这几个月的使用感受来看除了计费不太清晰之外其它方面还算比较稳定的。

现在博客域名的 A 记录已经修改,你现在看到的页面就是运行在 Bluemix 容器上的了。另外最近碰到好几次 HTTP 的博客网站老是被运营商插入广告代码,于是我也直接禁用了 HTTP 支持,现在必须以 HTTPS 方式访问本博客了。这么一来只是略微可惜了原来 HighSpeedWeb 相当稳定的服务器了:

root@zoserver:~# uptime
 21:29:27 up 325 days,  6:16,  1 user,  load average: 0.00, 0.00, 0.00

HighSpeedWeb VPS 服务器已经稳定运行了将近一年时间了,基本上自从上次续费时不小心重启了一下服务器之后 DNSPod 的监控就没有再报过警了。因此这段时间如果 Bluemix 容器万一出什么幺蛾子还能尽快切换回 HighSpeedWeb,最后还是希望现在的 Bluemix 容器能尽量稳定运行了,祝大家元宵节玩的开心。

暂时迁移被爬虫扒得内存不足的VPS》上的评论

      1. 谢谢。原来你这么快就回复了,我邮箱没收到你的邮件通知…
        我昨天已经按照你的那篇文章创建了一个docker。但居然bluemix生成了2.09刀的费用,我的docker只是分配了256M RAM而已。
        另外,bluemix上写的免费20G外部存储是额外的吗?就是不包括在docker的16G(256M)或30G(512M)里。

        1. Bluemix(SoftLayer)服务器不允许 SMTP 使用 25 端口发邮件,这个之前没改,现在应该可以了。
          你是不是已经验证过信用卡了?我是验证之后把信用卡删除了,目前一直没有超过限额。
          外部存储是不包含 Docker 容器自带的空间的,就是 Docker 的持久化数据卷。

          1. 嗯,收到邮件了。我发了ticket,IBM还没回我。关于baseimage-docker,因为我第一次弄docker,看得我有点晕,我是不是要在dockerfile里用RUN来安装nginx+php+mysql?然后把web目录和mysql的数据库目录放在外部存储上?

    1. 和容器提供商有关,一般像 Bluemix 这种就只有删除重新容器实例才会影响数据;
      像 Arukas 那样根本不支持外部存储的每次启动都是重新构建容器会无法保留数据;
      Bluemix 上我没有用外部存储。

  1. 对于这些垃圾爬虫的恶意频繁爬取,其实博主可以考虑【Nginx 限制单个IP的并发连接数/速度来减缓垃圾蜘蛛爬虫采集】一文的方法来防范内存不足的尴尬!因为这种爬虫的UA和IP是经常变换的,所以直接在Nginx限制并发和速度是最有效的!

yourfans进行回复 取消回复

电子邮件地址不会被公开。 必填项已用*标注

*