• 热门专题

pyspider爬虫的一个应用

作者:  发布日期:2015-03-27 20:30:34
Tag标签:爬虫  
  • 1.为了能够将爬取到的数据存入本地数据库,现在本地创建一个mysql数据库example,然后
    在数据库中建立一张表格test,示例如下:

    DROP TABLE IF EXISTS `test`;
    CREATE TABLE `douban_db` (
      `id` int(11) NOT NULL AUTO_INCREMENT,
      `url` varchar(20) NOT NULL,
      `direct`  varchar(30),
      `performer`  date,
      `type`  varchar(30),
      `district` varchar(20) NOT NULL,
      `language`  varchar(30),
      `date`  varchar(30),
      `time`  varchar(30),
      `alias` varchar(20) NOT NULL,
      `score`  varchar(30),
      `comments`  varchar(300),
      `scenario`  varchar(300),
      `IMDb`  varchar(30),
      PRIMARY KEY (`id`)
    ) ENGINE=MyISAM  DEFAULT CHARSET=utf8;

    2.如果使用开源框架pyspider来进行爬虫的话,默认情况下,会把爬取到的结果存放到result.db这个sqilite数据库中,但是为了方便操作,我们将结果存放到mysql中。接下
    来要做的一个操作就是重写on_result方法,实例化调用我们自己实现的SQL方法,具体
    实例如下:

    #!/usr/bin/env python
    # -*- encoding: utf-8 -*-
    # Created on 2015-03-20 09:46:20
    # Project: fly_spider
    
    import re
    from pyspider.database.mysql.mysqldb import SQL
    from pyspider.libs.base_handler import *
    
    
    class Handler(BaseHandler):
    
        headers= {
                              'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Accept-Encoding':'gzip, deflate, sdch',
        'Accept-Language':'zh-CN,zh;q=0.8',
        'Cache-Control':'max-age=0',
        'Connection':'keep-alive',
        'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36'
        }
    
        crawl_config = {
            'headers' : headers,
            'timeout' : 100
        }
        @every(minutes=24 * 60)
        def on_start(self):
            self.crawl('http://movie.douban.com/tag/', callback=self.index_page)
    
        @config(age=10 * 24 * 60 * 60)
        def index_page(self, response):
            for each in response.doc('a[href^='http']').items():
                if re.match('http://movie.douban.com/tag/w+', each.attr.href, re.U):
                    self.crawl(each.attr.href, callback=self.list_page)  
        @config(age=10*24*60*60, priority=2)                
        def list_page(self, response):
            for each in response.doc('html > body > div#wrapper > div#content > div.grid-16-8.clearfix > div.article > div > table tr.item > td > div.pl2 > a').items():
                self.crawl(each.attr.href, priority=9, callback=self.detail_page)  
    
        @config(priority=3)
        def detail_page(self, response):
            return {
                'url': response.url,
                'title': response.doc('html > body > #wrapper > #content > h3 > span').text(),
                'direct': ','.join(x.text() for x in response.doc('a[rel='v:directedBy']').items()),
                'performer': ','.join(x.text() for x in response.doc('a[rel='v:starring']').items()),
                'type': ','.join(x.text() for x in response.doc('span[property='v:genre']').items()),
    #            'district':  ''.join(x.text() for x in response.doc('a[rel='v:starring']').items()),
    #            'language':  ''.join(x.text() for x in response.doc('a[rel='v:starring']').items()),
                'date':  ','.join(x.text() for x in response.doc('span[property='v:initialReleaseDate']').items()),
                'time':  ','.join(x.text() for x in response.doc('span[property='v:runtime']').items()),
    #            'alias':  ''.join(x.text() for x in response.doc('a[rel='v:starring']').items()),
                'score': response.doc('.rating_num').text(),
                'comments': response.doc('html > body > div#wrapper > div#content > div.grid-16-8.clearfix > div.article > div#comments-section > div.mod-hd > h4 > i').text(),
                'scenario': response.doc('html > body > div#wrapper > div#content > div.grid-16-8.clearfix > div.article > div.related-info > div#link-report.indent').text(),
                'IMDb':  ''.join(x.text() for x in response.doc('span[href]').items()),
                }
    
        def on_result(self, result):
            if not result or not result['title']:
                return
            sql = SQL()
            sql.replace('douban_db',**result)    
    

    关于上面这段代码,有下面几点需要说明的:
    a. 为了避免服务器判断出客户端在进行爬虫操作,从而禁止ip访问(具体表现为出现403禁止访问),我们需要在发出请求的时候加上一个http头,伪装成使用浏览器访问,具体用法如下:

        headers= {
                              'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Accept-Encoding':'gzip, deflate, sdch',
        'Accept-Language':'zh-CN,zh;q=0.8',
        'Cache-Control':'max-age=0',
        'Connection':'keep-alive',
        'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36'
        }
    
        crawl_config = {
            'headers' : headers,
            'timeout' : 100
        }

    b. @every(minutes=24 * 60)表示每天执行一次
    @config(age=10 * 24 * 60 * 60)表示数据10天后就过期了

    c. 接下来是一个比较重要的地方,重写on_result方法,相当于实现了一个多态,程序在最后返回时,会执行on_result方法,默认的情况下,on_result是将数据刷入sqlite中,但是如果我们需要将数据插入mysql中,就需要重写on_result方法,具体使用如下:

    
        def on_result(self, result):
            if not result or not result['title']:
                return
            sql = SQL()
            sql.replace('test',**result)    
    

    注意这里的if not result or not result[‘title’]:这句判断很重要,不然的会会报错,提示result是未定义类型的。

    3.在上面的额代码中,提到了实例化调用我们自己实现的SQL方法,并且引用了from pyspider.database.mysql.mysqldb import SQL这个库文件,那么就必须在这个目录下实现这个库,具体如下:
    把下面内容文文放到pyspider/pyspider/database/mysql/目录下命名为mysqldb.py

    from six import itervalues
    import mysql.connector
    from datetime import date, datetime, timedelta
    
    class SQL:
    
            username = 'root'  #数据库用户名
            password = 'root'  #数据库密码
            database = 'test'  #数据库
            host = '172.30.25.231'  #数据库主机地址
            connection = ''
            connect = True
        placeholder = '%s'
    
            def __init__(self):
                    if self.connect:
                            SQL.connect(self)
        def escape(self,string):
            return '`%s`' % string
            def connect(self):
                config = {
                    'user':SQL.username,
                    'password':SQL.password,
                    'host':SQL.host
                }
                if SQL.database != None:
                    config['database'] = SQL.database
    
                try:
                    cnx = mysql.connector.connect(**config)
                    SQL.connection = cnx
                    return True
                except mysql.connector.Error as err:
    
                if (err.errno == errorcode.ER_ACCESS_DENIED_ERROR):
                    print 'The credentials you provided are not correct.'
                elif (err.errno == errorcode.ER_BAD_DB_ERROR):
                    print 'The database you provided does not exist.'
                else:
                    print 'Something went wrong: ' , err
                return False
    
    
        def replace(self,tablename=None,**values):
            if SQL.connection == '':
                        print 'Please connect first'
                        return False
    
                    tablename = self.escape(tablename )
                    if values:
                            _keys = ', '.join(self.escape(k) for k in values)
                            _values = ', '.join([self.placeholder, ] * len(values))
                            sql_query = 'REPLACE INTO %s (%s) VALUES (%s)' % (tablename, _keys, _values)
                    else:
                            sql_query = 'REPLACE INTO %s DEFAULT VALUES' % tablename
    
    
            cur = SQL.connection.cursor()
                    try:
                        if values:
                                cur.execute(sql_query, list(itervalues(values)))
                        else:
                                cur.execute(sql_query)
                        SQL.connection.commit()
                        return True
                    except mysql.connector.Error as err:
                        print ('An error occured: {}'.format(err))
                        return False
    

    学习文档:http://blog.binux.me/2015/01/pyspider-tutorial-level-1-html-and-css-selector/
    测试环境:http://demo.pyspider.org/

About IT165 - 广告服务 - 隐私声明 - 版权申明 - 免责条款 - 网站地图 - 网友投稿 - 联系方式
本站内容来自于互联网,仅供用于网络技术学习,学习中请遵循相关法律法规