爬取猫途鹰网站上所有中国景点的数据

先说结果

  • 爬取了🦉猫途鹰网站上34,237个景点的所有中文评论共25,165

    😝基本上我敢肯定景点一定是很全很全的(费了老大劲的);

    ☹️评论我查了几个景点,基本也都是全的. 但是我没有IP池,所以后来被反爬了😭但我觉得代码是没问题的,理论上说能爬到所有中文评论

源代码请见 OliveDS-GitHub

访问我的数据库

  • IP: 47.93.238.102
  • Port: 3306
  • MySQL账号: dsjk
  • MySQL密码: dsjk

  • Database: scrapy_sies

  • Table: sites_all_url_copy4 (中国所有景点)
  • Table: sites_all_review (所有中文景点评论)

网站分析

本来以为从 中国->景点这个页面

https://www.tripadvisor.cn/Attractions-g294211-Activities-China.html

下应该能找到一个总表,直接爬取即可,但是看了一下才发现并没有这样一个表,而且给出的分类数据十分不全,总共才296个景点(应该都是精华景点吧…)所以肯定不能想着一次把所有景点爬全了

所以我打算采用的方案是,先从中国->目的地页面

https://www.tripadvisor.cn/Tourism-g294211-China-Vacations.html

把中国所有目的地(按排行)爬下来,再按照目的地把当地所有景点爬取下来,以青岛为例,就是其这个页面https://www.tripadvisor.cn/Attractions-g297458-Activities-Qingdao_Shandong.html#ATTRACTION_SORT_WRAPPER(同样是按排行)

推测这样应该能爬到比较完整的数据了

爬取所有目的地链接

依然是这个链接

https://www.tripadvisor.cn/TourismChildrenAjax?geo=294211&offset=11&desktop=true

受欢迎的目的地标签下有个中国的热门目的地排行榜,虽然榜单只显示了部分内容,但可以点击查看更多中国热门目的地加载,我试了一下,貌似很多很多页,能一直加载下去,目测是包含了所有目的地的

打开Chrome的开发者工具,点击查看更多中国热门目的地查看页面发送的Request,发现是通过这个请求中的offset值在加载新的内容的

https://www.tripadvisor.cn/TourismChildrenAjax?geo=294211&offset=1&desktop=true

用浏览器直接打开这个链接,可以发现,offset为1时,加载的就是第七名之后的目的地,所以修改offset就可以获得所有目的地,而1~6在原网页

我们的任务就是首先爬取所有的中国目的地(的链接),然后再爬目的地页面的景点

动手

安装Scrapy

(已经安装了Python3 & pip3)

1
pip3 install scrapy

安装完成后可以在所需位置直接

1
scrapy startproject maotuying

新建工程

配置items

maotuying->maotuying->items.py中配置需要通过Pipeline存储的内容(即爬虫获取的对象)

先考虑做第一级爬虫(目的地HTML),所以

1
2
3
4
5
6
class MaotuyingItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
des_name = scrapy.Field()
des_html = scrapy.Field()
pass

创建爬虫

1
scrapy genspider sites www.tripadvisor.cn

deshtmls是爬虫任务的名称,www.tripadvisor.cn是domain name

通常domain.name都是不带www的,但是猫途鹰网站的都带,查阅一下论坛发现

The “www.” convention is redundant, old fashioned and … (IMO) … ugly. Most places who use “www.example.com" will also have a server at “example.com”, and one will redirect to the other as appropriate.

果然尝试一下在浏览器输入http://tripadvisor.cn会直接跳转成https://www.tripadvisor.cn

另: https的s代表secure,是使用SSL加密传输的HTTP,更加安全(同样的,使用https的网站会自动将http协议更正为https)

建立爬虫后,打开sites.py文件,将start_urls默认为http://www.tripadvisor.cn修改成了

1
2
3
4
5
6
7
class DeshtmlsSpider(scrapy.Spider):
name = 'deshtmls'
allowed_domains = ['www.tripadvisor.cn']
start_urls = ['https://www.tripadvisor.cn/Tourism-g294211-China-Vacations.html']

def parse(self, response):
pass

检查所需页面元素

使用ChromeDeveloper Tools

尝试爬取

尝试爬取后发现response为空,使用scrapy shell进行调试,发现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
% scrapy shell 'https://www.tripadvisor.cn/Tourism-g294211-China-Vacations.html' --nolog
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x1065289e8>
[s] item {}
[s] request <GET https://www.tripadvisor.cn/Tourism-g294211-China-Vacations.html>
[s] settings <scrapy.settings.Settings object at 0x106528748>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
In [1]: view(response)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-1-c68f42c3ba27> in <module>
----> 1 view(response)

/usr/local/lib/python3.7/site-packages/scrapy/utils/response.py in open_in_browser(response, _openfunc)
68 from scrapy.http import HtmlResponse, TextResponse
69 # XXX: this implementation is a bit dirty and could be improved
---> 70 body = response.body
71 if isinstance(response, HtmlResponse):
72 if b'<base' not in body:

AttributeError: 'NoneType' object has no attribute 'body'

找到一个解答:

https://stackoverflow.com/questions/43973898/scrapy-shell-return-without-response

将setting.py中修改为

1
ROBOTSTXT_OBEY=False

修改后,可以在shell 爬取的界面中

1
2
3
4
5
6
7
8
In [4]: response.xpath('//div[@class="popularCities"]/a/@href').extract()
Out[4]:
['/Tourism-g294217-Hong_Kong-Vacations.html',
'/Tourism-g294212-Beijing-Vacations.html',
'/Tourism-g308272-Shanghai-Vacations.html',
'/Tourism-g297463-Chengdu_Sichuan-Vacations.html',
'/Tourism-g297415-Shenzhen_Guangdong-Vacations.html',
'/Tourism-g298557-Xi_an_Shaanxi-Vacations.html']

正式爬取

爬取逻辑

  1. 爬取中国所有景点的URL,并存储到MySQL数据库中
  2. 依次爬取所有景点的中文评论

爬取中国所有景点的URL

具体又可以分为:

  1. start_requests方法构造初始请求(目的地)的request

    需要设置{offset}值请求所有目的地网页

  2. parse_des方法解析目的地的URL

  3. parse_sites_firstpage方法解析目的地的景点列表的页数,并构造下一步请求景点列表的URL

    同样需要设置{pages}

  4. parse_sites_1page解析景点的名称和URL,存入MySQL

代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
# -*- coding: utf-8 -*-
import scrapy
from maotuying.items import MaotuyingItem

class SitesSpider(scrapy.Spider):
name = 'sites'
allowed_domains = ['www.tripadvisor.cn']
# start_urls = ['http://www.tripadvisor.cn/']
'''
两层爬虫,先爬取所有目的地的URL,再在改URL下爬取所有景点
parse先将目的地html解析出来,然后调用parse_sites
parse_des 将目的地的中的所有景点URL爬取下来
parse_sites 将每个景点的具体内容爬取并存储下来
'''
custom_settings = {
'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',# 允许爬取重复的页面 因为我的程序中对目的地首页先解析了pages页数,所以会重复一次!!否则爬第一页景点失败
'ITEM_PIPELINES' : {'maotuying.pipelines.MaotuyingPipeline': 300,}
}

# 通常情况下给出 start_urls 即可开始爬虫,但此处需要构造,专门设置一个函数确定需要被爬取的URL,存到requests中,逐个被请求
def start_requests(self):
requests = []
url = '''https://www.tripadvisor.cn/TourismChildrenAjax?geo=294211&offset={offset}&desktop=true'''
for i in range(212):# 通过尝试,发现共0~211页 也可以while循环直到没有新的目的地
request = scrapy.Request(url.format(offset=i), callback=self.parse_des)
# print(request)
# print("start_requests",i)
requests.append(request)
return requests

def parse_des(self, response): # 解析目的地链接
des_s = response.xpath('//a/@href').extract()# 页面中所有a的数组,即所有目的地
print("this is parse_des")
for des in des_s:
des_parts = des.split('-') # e.g. /Tourism-g298559-Hangzhou_Zhejiang-Vacations.html
print(des_parts[2])
# https://www.tripadvisor.cn/Attractions-g298559-Activities-Hangzhou_Zhejiang.html
url = 'https://www.tripadvisor.cn/Attractions-' + des_parts[1] + '-Activities-' + des_parts[2] + '.html'
print("parse_des",url)
yield scrapy.Request(url, self.parse_sites_firstpage)

def parse_sites_firstpage(self, response):# 解析目的地的景点页数
# url = 'https://www.tripadvisor.cn/Attractions-' + des_parts[1] + '-Activities-oa' + str(i*30) + '-' + des_parts[2] + '.html'
desurl = response.xpath('//li[@data-element=".masthead-dropdown-attractions"]/a/@href').extract()# 页面中所有页码的数组
# desurl = response.url
# e.g. /Attractions-g294217-Activities-Hong_Kong.html
print(desurl[0])
des_parts = desurl[0].split('-')
# https://www.tripadvisor.cn/Attractions-g294217-Activities-Hong_Kong.html
# url = 'https://www.tripadvisor.cn/Attractions-' + des_parts[1] + '-Activities-' + des_parts[3] #+ '.html'
# print("only parse page 1",url)
try:
page_s = response.xpath('//div[@class="pageNumbers"]/a/@data-page-number').extract()# 页面中所有页码的数组
pages = int(page_s[-1])# 最后一个页码就是最大页码
except:
pages = 1;
print("this is parse_sites_pages") #,pages
for i in range(pages):
# print("this is i in range(pages)",i)
url = 'https://www.tripadvisor.cn/Attractions-' + des_parts[1] + '-Activities-oa' + str(i*30) + '-' + des_parts[3] #+ '.html'
print("this is i in range(pages)",i,url)
yield scrapy.Request(url, self.parse_sites_1page)

def parse_sites_1page(self, response):
site_s = response.xpath('//div[@class="listing_title "]/a/@href').extract()# 页面中所有a的数组
sites_items = []
print("this is parse_sites_1page",site_s)
for site in site_s:
# for example: /Attraction_Review-g298559-d386917-Reviews-West_Lake_Xi_Hu-Hangzhou_Zhejiang.html
# /Attraction_Review-g297431-d501536-Reviews-Cangyan_Mountain-Shijiazhuang_Hebei.html
site_parts = site.split('-')
# if url is not None:
url = 'https://www.tripadvisor.cn' + site
print(site_parts[4],url)

# 实现数据库pipeline读写
item = MaotuyingItem()
item['site_name'] = site_parts[4]
item['site_url'] = url
# TypeError: can only concatenate str (not "MaotuyingItem") to str
# print("this is parse_sites's item" + item['site_url']) # 检查item是否成功生成
sites_items.append(item)
# yield scrapy.Request(url, self.parse_site)
return sites_items# 必须return not pass 否则des_s中后续str不执行(for循环只执行一次)

依次爬取所有景点的中文评论

具体步骤又为:

  1. start_requests方法从MySQL中取出景点的URL,构造初始请求

  2. parse_sites_firstpage解析景点的评论页数,,并构造下一步请求评论页面的URL

    同样需要配置URL中的{pages}

  3. parse_site_reviews方法解析评论的详细内容,并存储到数据库中

代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
# -*- coding: utf-8 -*-
import scrapy
import pymysql
from maotuying.items import MaotuyingReviewItem

class SitedSpider(scrapy.Spider):
name = 'sited'
allowed_domains = ['www.tripadvisor.cn']
# start_urls = ['http://www.tripadvisor.cn/']

custom_settings = {
'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',# 允许爬取重复的页面 因为我的程序中对目的地首页先解析了pages页数,所以会重复一次!!否则爬第一页景点失败
'ITEM_PIPELINES' : {'maotuying.pipelines.MaotuyingReviewPipeline': 300,}
}

def start_requests(self):
# 从数据库中找到所有景点的URL
self.connect = pymysql.connect(
host = 'localhost',
port = 3306,
user = 'dsjk',
password = 'dsjk',
database = 'scrapy_sies',
charset='utf8',
use_unicode=True)
self.cursor = self.connect.cursor()
self.connect.autocommit(True)
sql = "select site_url from sites_all_url_copy4"# table: sites_all_url_copy4
self.cursor.execute(sql)
results = self.cursor.fetchall()
print(len(results)," sites total.")
requests = []
i = 0
for url in results:
# i += 1
# print(url[0])
request = scrapy.Request(url[0], callback=self.parse_sites_firstpage)#Request(url, callback=self.parse_review, meta={'movieId': movieId}, dont_filter=True)
requests.append(request)
# if i > 100:
# break
return requests

def parse_sites_firstpage(self, response):# 解析目的地的景点页数
desurl = response.url
# print(desurl) #https://www.tripadvisor.cn/Attraction_Review-g298557-d1801584-Reviews---Xiangzi_Temple-Xi_an_Shaanxi.html
des_parts = desurl.split('-')
try:
pages = response.xpath('//div[@class="pageNumbers"]/a/@data-page-number').extract()# 页面中所有页码的数组
# print("pages is",pages)
page = int(pages[-1])
# print("transfer pages to int",pages)
except:
page = 1;
for i in range(page):
url = des_parts[0] + '-' + des_parts[1] + '-' + des_parts[2] + '-' + des_parts[3] + '-or' + str(i*10) + '-' + des_parts[4] + '-' + des_parts[5]#+ '.html'
##print("this is i in range(pages)",i,url)
# yield scrapy.Request(
# url,
# formname = language,
# formdata = {
# 'language': 'ALL', # or any other year value
# }
# callback=self.parse_site_reviews
# )
yield scrapy.Request(url, self.parse_site_reviews)


def parse_site_reviews(self, response): # 解析目的地详情
desurl = response.url
# print(desurl) #https://www.tripadvisor.cn/Attraction_Review-g298557-d1801584-Reviews---Xiangzi_Temple-Xi_an_Shaanxi.html
des_parts = desurl.split('-')

review_quote = response.xpath('//span[@class="noQuotes"]').xpath('string(.)').extract()
review_detail = response.xpath('//p[@class="partial_entry"]').xpath('string(.)').extract()
review_user = response.xpath('//div[@class="info_text"]/div[1]').xpath('string(.)').extract()# div[1]表示提取其中的第一个div
review_time = response.xpath('//span[@class="ratingDate"]/@title').extract()
review_url = response.xpath('//div[@class="quote"]/a/@href').extract()
review_items = []
# print("this is parse_site_reviews",desurl)
for i in range(len(review_quote)):
item = MaotuyingReviewItem()
item['review_site'] = des_parts[-2]# 倒数第二为景点名称
try:
url_r = review_url[i]
except:
url_r = ""
item['review_url'] = url_r
item['review_quote'] = review_quote[i]
item['review_detail'] = review_detail[i]
item['review_user'] = review_user[i]
item['review_time'] = review_time[i]
review_items.append(item)
return review_items# 必须return not pass 否则des_s中后续str不执行(for循环只执行一次)

连接MySQL

参考这篇教程

https://www.accordbox.com/blog/scrapy-tutorial-9-how-use-scrapy-item/

注意还需要

1
pip3 install mysqlclient

才能执行db_connect()

其中models.py的内容中的QuoteDB需要修改为自己的DB,然后测试代码中的DB名称和字段也需要根据自己的DB修改. In my case 测试代码和结果为

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from sqlalchemy.orm import sessionmaker
...: from maotuying.models import SitesDB, db_connect, create_tab
...: le
...:
...: engine = db_connect()
...: create_table(engine)
...: Session = sessionmaker(bind=engine)
...:
...: session = Session()
...: quotedb = SitesDB()
...: quotedb.sites_name = "test name"
...: quotedb.sites_url = "test url"
...:
...: try:
...: session.add(quotedb)
...: session.commit()
...:
...: #query again
...: obj = session.query(SitesDB).first()
...: print(obj.sites_name,obj.sites_url)
...: except:
...: session.rollback()
...: raise
...: finally:
...: session.close()
...:
test name test url

在执行这段测试程序时,出现了lib ImportError,

1
2
3
ImportError: dlopen(/usr/local/lib/python3.7/site-packages/MySQLdb/_mysql.cpython-37m-darwin.so, 2): Library not loaded: libcrypto.1.0.0.dylib
Referenced from: /usr/local/lib/python3.7/site-packages/MySQLdb/_mysql.cpython-37m-darwin.so
Reason: image not found
1
% sudo ln -s /usr/local/mysql-8.0.15-macos10.14-x86_64/lib/libcrypto.1.0.0.dylib /usr/local/lib/libcrypto.1.0.0.dylib

我的解决方案是把需要的lib逐个link到了/usr/local/lib/下. 如果有更好的解决方案,please let me know

也可以使用张老师课件上pymysql插件进行连接(需要自己创建数据库表和编写insert语句,稍微麻烦一点)

创建数据库可以使用以下语句(示例)

1
2
3
4
5
6
create table sites_trial(
id INT NOT NULL AUTO_INCREMENT,
site_name VARCHAR(100) DEFAULT NULL,
site_url VARCHAR(300) DEFAULT NULL,
PRIMARY KEY (id)
)ENGINE=InNoDB DEFAULT CHARSET=utf8;

最终,我的items.py文件如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class MaotuyingItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
site_name = scrapy.Field()
site_url = scrapy.Field()

class MaotuyingReviewItem(scrapy.Item):
review_site = scrapy.Field()
review_url = scrapy.Field()
review_quote = scrapy.Field()
review_detail = scrapy.Field()
review_user = scrapy.Field()
review_time = scrapy.Field()
#review_site,review_url,review_quote,review_detail,review_user,review_time

pipelines.py文件如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
# -*- coding: utf-8 -*-

# Define your item pipelines here
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

'''
class MaotuyingPipeline(object):
def process_item(self, item, spider):
with open("sites.txt",'a') as fp:
fp.write( item['site_url'] + '\n') # item['site_name'].encode("utf8") + ' ' +
# return item
'''
import pymysql

class MaotuyingPipeline(object):

siteInsert = '''insert into sites_all_url_copy4(site_name,site_url) values ('{site_name}','{site_url}')'''

def process_item(self, item, spider):
"""Save deals in the database.

This method is called for every item pipeline component.
"""
sqlinsert = self.siteInsert.format(
site_name = pymysql.escape_string(item['site_name']),
site_url = pymysql.escape_string(item['site_url'])
)

self.cursor.execute(sqlinsert)
return item

def open_spider(self, spider):
self.connect = pymysql.connect(
host = '47.93.238.102',
port = 3306,
user = 'dsjk',
password = 'dsjk',
database = 'scrapy_sies',
charset='utf8',
use_unicode=True)
self.cursor = self.connect.cursor()
self.connect.autocommit(True)

def close_spider(self, spider):
self.cursor.close()
self.connect.close()

class MaotuyingReviewPipeline(object):

reviewInsert = '''insert into sites_all_review_copy2(review_site,review_url,review_quote,review_detail,review_user,review_time) values ('{review_site}','{review_url}','{review_quote}','{review_detail}','{review_user}','{review_time}')'''

def process_item(self, item, spider):
"""Save deals in the database.

This method is called for every item pipeline component.
"""
sqlinsert = self.reviewInsert.format(
# site_name = pymysql.escape_string(item['site_name']),
# site_url = pymysql.escape_string(item['site_url'])
review_site = pymysql.escape_string(item['review_site']),
review_url = pymysql.escape_string(item['review_url']),
review_quote = pymysql.escape_string(item['review_quote']),
review_detail = pymysql.escape_string(item['review_detail']),
review_user = pymysql.escape_string(item['review_user']),
review_time = pymysql.escape_string(item['review_time']),
)

self.cursor.execute(sqlinsert)
return item

def open_spider(self, spider):
self.connect = pymysql.connect(
host = '47.93.238.102',
port = 3306,
user = 'dsjk',
password = 'dsjk',
database = 'scrapy_sies',
charset='utf8',
use_unicode=True)
self.cursor = self.connect.cursor()
self.connect.autocommit(True)

def close_spider(self, spider):
self.cursor.close()
self.connect.close()

MySQL数据库服务器

阿里云ECS服务器

因为需要将数据库提供给其他同学访问,需要搭建一个MySQL服务器,我用学生优惠开了一台9.5元/月的ECS云服务器,通过其公网IP,可以远程访问

1
sudo ssh 47.93.238.102

ECS刚拿到后需要

1
2
sudo apt-get update
apt undate

安装MySQL-Server我根据阿里云官方的教程遇到了很多错误,反而后来按照

官方: https://yq.aliyun.com/articles/654980

1
2
3
sudo apt-get install mysql-server
sudo service mysql start # 注意Linux上不是mysqld!!!
ps -ef | grep mysql | grep -v grep # 发现服务已经启动了

配置安全组规则

需要给ECS配置允许公网访问其3306端口,如下图所示

数据库迁移

一开始我把数据都存在了本地的MySQL中,所以需要迁移上ECS

将Database存为 .sql文件

1
mysqldump -u dsjk -p scrapy_sies>scrapy_sies.sql

将文件上传到ECS

1
scp /Users/oliveds/Documents/web/maotuying/scrapy_sies.sql root@47.93.238.102:/home/oliveds/

这个代码在本机执行,前面是源IP(省略)源路径+文件名 后面是目标IP(即ECS的IP) 目标路径

果然可以在我的ECS命令行中查找到它了

1
2
root@...:/home/oliveds# ls
mysql57-community-release-el7-11.noarch.rpm scrapy_sies.sql

ECS中恢复Database

1
2
3
create database scrapy_sies; # 创建原名称的数据库
use scrapy_sies; # 进入
source /home/oliveds/scrapy_sies.sql # 恢复

配置MySQL数据库可被远程访问

创建新用户

首先创建一个新用户,提供给同学和老师们使用

1
mysql> create user dsjk identified by 'dsjk';

第一个dsjk为用户名,第二个为密码

配置用户权限

1
2
3
4
mysql> grant all privileges on scrapy_sies.* to dsjk@'%';
Query OK, 0 rows affected (0.00 sec)
mysql> flush privileges;
mysql> exit;

这里all表示赋予了所有权限,包括SELECT,INSERT,UPDATE`等,也可以只赋予部分

scrapy_sies是database的名称,这里只赋予了此database下所有table(.*表示所有),也可以使用*.*赋予该用户访问所有database的权限,

@'%'中的%表示dsjk可以在任何ip访问,如果想要限制只有某个ip可以访问,将'%'换成那个ip地址(xxx.xxx.xxx.xxx or localhost)就可以了

1
2
vim /etc/mysql/mysql.conf.d/mysqld.cnf
# 注释掉 bind_address 行 或改成0.0.0.0(往下滑,我第一次都没有找到)

然后重启一下ECS机器

远程访问-Navicat

使用Navicat访问远程的数据库比较方便,安装Navicat for MySQL,新建连接,输入ip(也就是MySQL安装机器的IP地址,可以通过ifconfig查看),port(通常是3306),刚才创建的可从任意ip访问的新用户(dsjk)的用户名和密码,即可建立连接

Open Connection后,可以看到授权了该用户访问的scrapy_siesdatabase中的内容

SideNote

重复页面爬取

我的程序需要先对目的地页面中的页码页码进行提取(解析目的地的景点首页),然后再解析目的地景点的每个页面,这样就造成目的地的景点首页被scrapy解析了两次,而scrapy是默认去重的,所以我一直没有爬到每个目的地首页的30个景点😭

这个问题困扰了一整天,因为问题实在不太好找,我又这么混乱,不熟悉,为了找这个bug我凌晨3点才睡…然并卵😡后来才发现是因为被去重了,好蠢😞

1
2
3
custom_settings = {
'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',# 允许爬取重复的页面 因为我的程序中对目的地首页先解析了pages页数,所以会重复一次!!否则爬第一页景点失败
}

如上,配置其可以重复爬取

Chrome读取缓存

被反爬虫了,但是code还要继续写,这时候访问不了网站了,于是只能先看缓存的内容了

访问chrome://chrome-urls/,选择chrome://cache,就可以找之前缓存的页面了

Spider’s Custom Settings 妙用

针对不同的spider有不同配置时,分别写在其custom_settings中,而不是写在project->settings.py,可以避免相互影响,比如两者使用不同Pipeline时,可以

1
2
3
4
custom_settings = {
'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
'ITEM_PIPELINES' : {'maotuying.pipelines.MaotuyingReviewPipeline': 300,}
}

IP代理

虽然通过 设置request Header请求头能够在很大程度上防止被反爬,但是数据量太大的时候,依然很容易会被禁

这时候如果能有一个IP池来不断换IP地址就更加高枕无忧了.当然,免费的IP池里能用的不多,最好还是购买一个IP代理

配置方法见这个教程:

https://blog.csdn.net/u010978757/article/details/83409571

结果展示

Table: sites_all_url_copy4 (中国所有景点)

Table: sites_all_review (所有中文景点评论)

0%