<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Soul Of Free Loop &#187; 爬虫</title>
	<atom:link href="https://zohead.com/archives/tag/spider/feed/" rel="self" type="application/rss+xml" />
	<link>https://zohead.com</link>
	<description>Uranus Zhou&#039;s Blog</description>
	<lastBuildDate>Sat, 19 Jul 2025 15:42:46 +0000</lastBuildDate>
	<language>zh-CN</language>
		<sy:updatePeriod>hourly</sy:updatePeriod>
		<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.8</generator>
	<item>
		<title>暂时迁移被爬虫扒得内存不足的VPS</title>
		<link>https://zohead.com/archives/vps-anti-spider/</link>
		<comments>https://zohead.com/archives/vps-anti-spider/#comments</comments>
		<pubDate>Sat, 11 Feb 2017 14:44:57 +0000</pubDate>
		<dc:creator><![CDATA[Uranus Zhou]]></dc:creator>
				<category><![CDATA[主机空间]]></category>
		<category><![CDATA[360Spider]]></category>
		<category><![CDATA[Bluemix]]></category>
		<category><![CDATA[HighSpeedWeb]]></category>
		<category><![CDATA[nginx]]></category>
		<category><![CDATA[robots]]></category>
		<category><![CDATA[VPS]]></category>
		<category><![CDATA[内存不足]]></category>
		<category><![CDATA[容器]]></category>
		<category><![CDATA[爬虫]]></category>

		<guid isPermaLink="false">https://zohead.com/?p=1356</guid>
		<description><![CDATA[VPS 内存不足问题 最近一两个月我在查看 VPS 运行日志的时候，经常发现 kernel 日志中会有 Out of memory 内存不足报错，而且报错基本都是 php-fpm 引起的： 从日志里可以看到每个 php-fpm 进程的 rss 内存占用都接近 30 MB，我之前就已经将 LNMP 环境里的 php-fpm.conf 配置文件中的 pm.max_children 改为 8，这样如果碰到同时请求数较多的情况，php-fpm 就可能会占用 240 MB 内存。再加上 MySQL、BTSync 等其它程序也要占用内存，我这个在 HighSpeedWeb 上购买的 256 MB 内存的  [&#8230;]]]></description>
				<content:encoded><![CDATA[<h2 id="vps-out-of-memory">VPS 内存不足问题</h2>
<p>最近一两个月我在查看 VPS 运行日志的时候，经常发现 kernel 日志中会有 Out of memory 内存不足报错，而且报错基本都是 <code>php-fpm</code> 引起的：</p>
<pre class="brush: bash; title: ; notranslate">
root@zoserver:~# cat /var/log/kern.log
Dec 15 20:11:43 zoserver kernel: [55751339.090508] Out of memory in UB 1253: OOM killed process 32239 (php-fpm) score 0 vm:56336kB, rss:29832kB, swap:0kB
Dec 15 20:11:56 zoserver kernel: [55751352.643620] Out of memory in UB 1253: OOM killed process 32238 (php-fpm) score 0 vm:55580kB, rss:29444kB, swap:0kB
Dec 15 20:11:57 zoserver kernel: [55751353.609602] Out of memory in UB 1253: OOM killed process 32242 (php-fpm) score 0 vm:56088kB, rss:29800kB, swap:0kB
Dec 15 20:12:23 zoserver kernel: [55751379.072308] Out of memory in UB 1253: OOM killed process 32240 (php-fpm) score 0 vm:55496kB, rss:29520kB, swap:0kB
Dec 15 20:12:45 zoserver kernel: [55751401.084746] Out of memory in UB 1253: OOM killed process 32225 (php-fpm) score 0 vm:55848kB, rss:29564kB, swap:0kB
Dec 15 20:13:22 zoserver kernel: [55751438.326072] Out of memory in UB 1253: OOM killed process 32266 (php-fpm) score 0 vm:56008kB, rss:29880kB, swap:0kB
Dec 15 20:13:36 zoserver kernel: [55751452.087637] Out of memory in UB 1253: OOM killed process 32278 (php-fpm) score 0 vm:55328kB, rss:29356kB, swap:0kB
Dec 15 20:13:37 zoserver kernel: [55751453.035146] Out of memory in UB 1253: OOM killed process 32241 (php-fpm) score 0 vm:55752kB, rss:29784kB, swap:0kB
</pre>
<p>从日志里可以看到每个 <code>php-fpm</code> 进程的 rss 内存占用都接近 30 MB，我之前就已经将 LNMP 环境里的 <code>php-fpm.conf</code> 配置文件中的 <code>pm.max_children</code> 改为 8，这样如果碰到同时请求数较多的情况，<code>php-fpm</code> 就可能会占用 240 MB 内存。再加上 MySQL、BTSync 等其它程序也要占用内存，我这个在 <a href="https://zohead.com/archives/blog-hswvps/">HighSpeedWeb</a> 上购买的 256 MB 内存的 VPS 应该就撑不住了，出现 Out of memory 错误也就不足为怪了。</p>
<p>为了找到原因，我决定检查一下出现内存不足时候的 nginx 请求日志：</p>
<pre class="brush: bash; title: ; notranslate">
root@zoserver:~# more /home/wwwlogs/zohead.log
64.79.85.205 - - [15/Dec/2016:20:11:43 +0800] &quot;GET /archives/tcpkill-nfs/ HTTP/1.1&quot; 200
13304 - &quot;-&quot; &quot;Mozilla/5.0 (compatible; SMTBot/1.0; +http://www.similartech.com/smtbot)&quot;
-
64.79.85.205 - - [15/Dec/2016:20:11:43 +0800] &quot;GET /archives/newifi-mini-openwrt/ HTTP/1.1&quot; 200
18841 - &quot;-&quot; &quot;Mozilla/5.0 (compatible; SMTBot/1.0; +http://www.similartech.com/smtbot)&quot;
-
64.79.85.205 - - [15/Dec/2016:20:11:43 +0800] &quot;GET /archives/category/technology/linux/ubuntu/ HTTP/1.1&quot; 200
11921 - &quot;-&quot; &quot;Mozilla/5.0 (compatible; SMTBot/1.0; +http://www.similartech.com/smtbot)&quot;
-
64.79.85.205 - - [15/Dec/2016:20:11:43 +0800] &quot;GET /archives/category/technology/phone/ HTTP/1.1&quot; 200
12800 - &quot;-&quot; &quot;Mozilla/5.0 (compatible; SMTBot/1.0; +http://www.similartech.com/smtbot)&quot;
-
64.79.85.205 - - [15/Dec/2016:20:11:43 +0800] &quot;GET /archives/category/technology/ HTTP/1.1&quot; 200
14862 - &quot;-&quot; &quot;Mozilla/5.0 (compatible; SMTBot/1.0; +http://www.similartech.com/smtbot)&quot;
-
64.79.85.205 - - [15/Dec/2016:20:11:44 +0800] &quot;GET /archives/category/technology/android/ HTTP/1.1&quot; 200
15127 - &quot;-&quot; &quot;Mozilla/5.0 (compatible; SMTBot/1.0; +http://www.similartech.com/smtbot)&quot;
-
64.79.85.205 - - [15/Dec/2016:20:11:44 +0800] &quot;GET /archives/zerotier-container/ HTTP/1.1&quot; 200
16323 - &quot;-&quot; &quot;Mozilla/5.0 (compatible; SMTBot/1.0; +http://www.similartech.com/smtbot)&quot;
-
64.79.85.205 - - [15/Dec/2016:20:11:44 +0800] &quot;GET /archives/tag/bash/ HTTP/1.1&quot; 200
11221 - &quot;-&quot; &quot;Mozilla/5.0 (compatible; SMTBot/1.0; +http://www.similartech.com/smtbot)&quot;
-
64.79.85.205 - - [15/Dec/2016:20:11:44 +0800] &quot;GET /archives/tag/ssh/ HTTP/1.1&quot; 200
11266 - &quot;-&quot; &quot;Mozilla/5.0 (compatible; SMTBot/1.0; +http://www.similartech.com/smtbot)&quot;
</pre>
<p>这就明显是一个不太友善的爬虫干的好事了，由于请求日志太多这里就不列出来了，统计之后可以发现这个 SMTBot 在十几秒钟的时间里请求了几百次，明显超出了 VPS 能处理的范围了。</p>
<p>另外我在检查日志之后还发现经常有各种初步练习用的爬虫也在不断访问 WordPress 博客数据，这种爬虫的特征就是使用各种不同的 User agent：</p>
<pre class="brush: bash; title: ; notranslate">
root@zoserver:~# more /home/wwwlogs/zohead.log
138.197.19.145 - - [17/Dec/2016:08:28:49 +0800] &quot;GET /robots.txt HTTP/1.1&quot; 200
145 - &quot;-&quot; &quot;Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36&quot;
-
138.197.19.145 - - [17/Dec/2016:08:29:00 +0800] &quot;GET /wp-login.php HTTP/1.1&quot; 200
2464 - &quot;http://zohead.com&quot; &quot;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36&quot;
-
138.197.19.145 - - [17/Dec/2016:08:29:01 +0800] &quot;GET /archives/category/technology/network-tech/https-ssl/ HTTP/1.1&quot; 200
8604 - &quot;https://zohead.com&quot; &quot;Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36&quot;
-
138.197.19.145 - - [17/Dec/2016:08:29:03 +0800] &quot;GET /archives/category/travel/ HTTP/1.1&quot; 502
166 - &quot;https://zohead.com&quot; &quot;Mozilla/5.0 (Windows NT 6.1; WOW64; rv:28.0) Gecko/20100101 Firefox/28.0&quot;
-
138.197.19.145 - - [17/Dec/2016:08:29:03 +0800] &quot;GET /archives/tag/video/ HTTP/1.1&quot; 200
11459 - &quot;https://zohead.com&quot; &quot;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36&quot;
-
138.197.19.145 - - [17/Dec/2016:08:29:03 +0800] &quot;GET /guestbook/ HTTP/1.1&quot; 200
9962 - &quot;https://zohead.com&quot; &quot;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14&quot;
-
138.197.19.145 - - [17/Dec/2016:08:29:03 +0800] &quot;GET /archives/tasker-shell/ HTTP/1.1&quot; 200
13094 - &quot;https://zohead.com&quot; &quot;Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.137 Safari/537.36&quot;
-
138.197.19.145 - - [17/Dec/2016:08:29:04 +0800] &quot;GET /archives/category/technology/ HTTP/1.1&quot; 200
13717 - &quot;https://zohead.com&quot; &quot;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14&quot;
-
138.197.19.145 - - [17/Dec/2016:08:29:04 +0800] &quot;GET /archives/tag/android/ HTTP/1.1&quot; 200
13879 - &quot;https://zohead.com&quot; &quot;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36&quot;
-
138.197.19.145 - - [17/Dec/2016:08:29:04 +0800] &quot;GET /archives/category/technology/android/ HTTP/1.1&quot; 200
13842 - &quot;https://zohead.com&quot; &quot;Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.137 Safari/537.36&quot;
-
138.197.19.145 - - [17/Dec/2016:08:29:05 +0800] &quot;GET /archives/author/admin/ HTTP/1.1&quot; 200
13716 - &quot;https://zohead.com&quot; &quot;Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0&quot;
</pre>
<p>然而这些小爬虫发起请求来也是毫不手软，基本没有在多个请求之间加什么延时的。不过还好看起来爬虫还是读了 <code>robots.txt</code> 文件的，因此可以考虑在 <code>robots.txt</code> 和 nginx 配置里做一些限制。</p>
<h2 id="treatment">防治措施</h2>
<h3 id="mod-robots-txt">修改 robots.txt</h3>
<p>首先把原来不太重视的 <code>robots.txt</code> 文件完善一下，增加了一些限制，大概如下：</p>
<pre class="brush: plain; title: ; notranslate">
User-agent: *
Disallow: /wp-admin/
Disallow: /wp-content/
Disallow: /wp-includes/
Disallow: /*?replytocom=*
Crawl-delay: 30
Sitemap: https://zohead.com/sitemap.xml
</pre>
<p>禁止所有爬虫访问一些 WordPress 内部目录，增加了 <code>Crawl-delay</code> 参数，并设置为 30 秒，防止产生过多的请求。</p>
<h3 id="mod-nginx-config">修改 nginx 配置</h3>
<p>由于并不是所有爬虫都会读取并遵守 robots 协议，特别是 Google 和百度这样的搜索巨头也明确表态不支持上面增加的 <code>Crawl-delay</code> 参数，为此还是需要修改 nginx 配置直接限制并发连接数：</p>
<pre class="brush: bash; title: ; notranslate">
root@zoserver:~# more /usr/local/nginx/conf/nginx.conf
http {
	limit_req_zone $anti_spider zone=anti_spider:60m rate=200r/m;
}

server {
	limit_req zone=anti_spider burst=5 nodelay;
	set $anti_spider $http_user_agent;
}
</pre>
<p>上面只是简单节选列出了 nginx 服务器配置的修改，使用 <code>limit_req_zone</code> 限制每分钟 200 个请求，最大并发为 5。</p>
<p>经过上面两步修改之后，VPS 日志里的内存不足错误看起来是减少了，但是好景不长，直到过几天我再去检查内核日志和 nginx 请求日志时发现来了一个臭名昭著的家伙，其频繁的请求仍然导致 VPS 出现 Out of memory 问题：</p>
<pre class="brush: bash; title: ; notranslate">
root@zoserver:~# more /home/wwwlogs/zohead.log
42.236.99.242 - - [09/Jan/2017:05:41:44 +0800] &quot;GET /archives/tag/keepassdroid/?wpmp_switcher=mobile HTTP/1.1&quot; 503
608 - &quot;https://m.zohead.com/archives/tag/keepassdroid/?wpmp_switcher=mobile&quot; &quot;Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0); 360Spider&quot;
-
42.236.99.178 - - [09/Jan/2017:05:41:44 +0800] &quot;GET /archives/easymoney-to-feidee/?lang=en&amp;replytocom=641replytocom=641replytocom=641replytocom=640replytocom=641replytocom=641replytocom=640replytocom=640&amp;wpmp_switcher=desktop HTTP/1.1&quot; 503
608 - &quot;https://zohead.com/archives/easymoney-to-feidee/?lang=en&amp;replytocom=641replytocom=641replytocom=641replytocom=640replytocom=641replytocom=641replytocom=640replytocom=640&amp;wpmp_switcher=desktop&quot; &quot;Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0); 360Spider&quot;
-
42.236.99.230 - - [09/Jan/2017:05:41:45 +0800] &quot;GET / HTTP/1.1&quot; 301
178 - &quot;http://www.zohead.com/&quot; &quot;Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0); 360Spider&quot;
-
42.236.99.194 - - [09/Jan/2017:05:41:47 +0800] &quot;GET /archives/qiniu-https-tamper/?lang=en&amp;replytocom=2190 HTTP/1.1&quot; 200
12300 - &quot;https://zohead.com/archives/qiniu-https-tamper/?lang=en&amp;replytocom=2190&quot; &quot;Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0); 360Spider&quot;
-
42.236.99.206 - - [09/Jan/2017:05:41:48 +0800] &quot;GET /archives/category/technology/linux/page/3/?wpmp_switcher=true HTTP/1.1&quot; 503  
608 - &quot;https://m.zohead.com/archives/category/technology/linux/page/3/?wpmp_switcher=true&quot; &quot;Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0); 360Spider&quot;
-
180.153.236.19 - - [09/Jan/2017:05:41:48 +0800] &quot;GET /archives/category/technology/cplusplus/ HTTP/1.1&quot; 503
608 - &quot;https://m.zohead.com/archives/category/technology/cplusplus/&quot; &quot;Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0); 360Spider&quot;
-
42.236.99.154 - - [09/Jan/2017:05:41:48 +0800] &quot;GET /comments/feed/?lang=en HTTP/1.1&quot; 503
608 - &quot;http://zohead.com/comments/feed/?lang=en&quot; &quot;Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0); 360Spider&quot;
-
180.153.236.165 - - [09/Jan/2017:05:41:49 +0800] &quot;GET /archives/tag/start-stop-daemon/?lang=en&amp;wpmp_switcher=mobile HTTP/1.1&quot; 200
8502 - &quot;http://zohead.com/archives/tag/start-stop-daemon/?lang=en&amp;wpmp_switcher=mobile&quot; &quot;Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0); 360Spider&quot;
</pre>
<p>上面只是同一时段博客访问日志的很小一部分，360 搜索的 360Spider 爬虫在不断访问博客，而且看起来 360 服务器集群机器也是相当多，360 蜘蛛的 IP 地址列表可以在其官网查看：</p>
<p><a href="https://www.so.com/help/spider_ip.html">https://www.so.com/help/spider_ip.html</a></p>
<p>经过分析日志我发现最要命的是 360 爬虫根本就没有读取 <code>robots.txt</code> 文件，这样根本谈不上让 <code>Crawl-delay</code> 之类的参数发挥作用。</p>
<h2 id="migrate-server">迁移服务器</h2>
<p>经过我差不多一个月的观察，现在 VPS 遇到的内存不足问题基本都是由 360 爬虫引起的，另外偶尔也有一些小爬虫不按规矩狂发请求。只是考虑到现在这个 256 MB 内存的 VPS 始终不是长久之计，因此还是想把博客迁移到其它服务器上。</p>
<p>首先看了看 HighSpeedWeb 现有的<a href="https://billing.highspeedweb.net/cart.php?gid=19">套餐</a>，512 MB 内存以上的 OpenVZ 或 KVM 套餐现在价格也都不太便宜。一番参考之后我准备先将博客迁移到 IBM <a href="https://zohead.com/archives/ibm-bluemix-docker/">Bluemix</a> 容器平台顶着，因为看起来 Bluemix 容器系统里能使用的突发内存量还是比较多的，而且毕竟目前 Bluemix 平台在我这几个月的使用感受来看除了计费不太清晰之外其它方面还算比较稳定的。</p>
<p>现在博客域名的 A 记录已经修改，你现在看到的页面就是运行在 Bluemix 容器上的了。另外最近碰到好几次 HTTP 的博客网站老是被运营商插入广告代码，于是我也直接禁用了 HTTP 支持，现在必须以 HTTPS 方式访问本博客了。这么一来只是略微可惜了原来 HighSpeedWeb 相当稳定的服务器了：</p>
<pre class="brush: bash; title: ; notranslate">
root@zoserver:~# uptime
 21:29:27 up 325 days,  6:16,  1 user,  load average: 0.00, 0.00, 0.00
</pre>
<p>HighSpeedWeb VPS 服务器已经稳定运行了将近一年时间了，基本上自从上次续费时不小心重启了一下服务器之后 DNSPod 的监控就没有再报过警了。因此这段时间如果 Bluemix 容器万一出什么幺蛾子还能尽快切换回 HighSpeedWeb，最后还是希望现在的 Bluemix 容器能尽量稳定运行了，祝大家元宵节玩的开心。</p>
]]></content:encoded>
			<wfw:commentRss>https://zohead.com/archives/vps-anti-spider/feed/</wfw:commentRss>
		<slash:comments>16</slash:comments>
		</item>
	</channel>
</rss>
