安装scrapy包:
pip install scrapy
安装时会报错…如果是py3需要手动下载依赖包Twisted

image.png

下载地址:https://pypi.org/simple/twisted/
或者:https://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted

下载后放在桌面Twisted-19.2.1-cp37-cp37m-win_amd64.whl

pip install C:UsersAdministratorDesktopTwisted-19.2.1-cp37-cp37m-win_amd64.whl

再次pip isntall scrapy,显示下面的表示依赖包都安装完成

D:>pip install scrapy
Looking in indexes: https://mirrors.aliyun.com/pypi/simple/
Requirement already satisfied: scrapy in d:pythonlibsite-packages (1.6.0)
Requirement already satisfied: Twisted>=13.1.0 in d:pythonlibsite-packages (from scrapy) (19.2.1)
Requirement already satisfied: parsel>=1.5 in d:pythonlibsite-packages (from scrapy) (1.5.1)
Requirement already satisfied: PyDispatcher>=2.0.5 in d:pythonlibsite-packages (from scrapy) (2.0.5)
Requirement already satisfied: w3lib>=1.17.0 in d:pythonlibsite-packages (from scrapy) (1.20.0)
Requirement already satisfied: queuelib in d:pythonlibsite-packages (from scrapy) (1.5.0)
Requirement already satisfied: cssselect>=0.9 in d:pythonlibsite-packages (from scrapy) (1.0.3)
Requirement already satisfied: pyOpenSSL in d:pythonlibsite-packages (from scrapy) (19.0.0)
Requirement already satisfied: lxml in d:pythonlibsite-packages (from scrapy) (4.3.4)
Requirement already satisfied: service-identity in d:pythonlibsite-packages (from scrapy) (18.1.0)
Requirement already satisfied: six>=1.5.2 in d:pythonlibsite-packages (from scrapy) (1.12.0)
Requirement already satisfied: hyperlink>=17.1.1 in d:pythonlibsite-packages (from Twisted>=13.1.0->scrapy) (19.0.0)
Requirement already satisfied: zope.interface>=4.4.2 in d:pythonlibsite-packages (from Twisted>=13.1.0->scrapy) (4.6.0)
Requirement already satisfied: attrs>=17.4.0 in d:pythonlibsite-packages (from Twisted>=13.1.0->scrapy) (19.1.0)
Requirement already satisfied: PyHamcrest>=1.9.0 in d:pythonlibsite-packages (from Twisted>=13.1.0->scrapy) (1.9.0)
Requirement already satisfied: constantly>=15.1 in d:pythonlibsite-packages (from Twisted>=13.1.0->scrapy) (15.1.0)
Requirement already satisfied: incremental>=16.10.1 in d:pythonlibsite-packages (from Twisted>=13.1.0->scrapy) (17.5.0)
Requirement already satisfied: Automat>=0.3.0 in d:pythonlibsite-packages (from Twisted>=13.1.0->scrapy) (0.7.0)
Requirement already satisfied: cryptography>=2.3 in d:pythonlibsite-packages (from pyOpenSSL->scrapy) (2.7)
Requirement already satisfied: pyasn1-modules in d:pythonlibsite-packages (from service-identity->scrapy) (0.2.5)
Requirement already satisfied: pyasn1 in d:pythonlibsite-packages (from service-identity->scrapy) (0.4.5)
Requirement already satisfied: idna>=2.5 in d:pythonlibsite-packages (from hyperlink>=17.1.1->Twisted>=13.1.0->scrapy) (2.8)
Requirement already satisfied: setuptools in d:pythonlibsite-packages (from zope.interface>=4.4.2->Twisted>=13.1.0->scrapy) (40.8.0)
Requirement already satisfied: asn1crypto>=0.21.0 in d:pythonlibsite-packages (from cryptography>=2.3->pyOpenSSL->scrapy) (0.24.0)
Requirement already satisfied: cffi!=1.11.3,>=1.8 in d:pythonlibsite-packages (from cryptography>=2.3->pyOpenSSL->scrapy) (1.12.3)
Requirement already satisfied: pycparser in d:pythonlibsite-packages (from cffi!=1.11.3,>=1.8->cryptography>=2.3->pyOpenSSL->scrapy) (2.19)

如果显示下面图片,表示scrapy安装完成:

image.png

安装scrapy工程可以放在任意磁盘目录下
先切换到如D盘下,运行scrapy startproject Tencent

它的含义是用scrapy安装tencent工程,此时在D盘下生成Tencent文件夹

image.png

工程配置文件在d盘下的Tencent下的Tencent中

image.png

这个下面包含了scrapy框架的主要文件

  1. item.py:定义需要爬取的item,明确目标,如职位名称,工作地点等(需要自己设置)

    image.png
  2. middlewares.py,爬虫中间件 ,很少用到,创建后都已经自定义好,不需要更改(不需要自己设置)
# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals


class TencentSpiderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, dict or Item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Response, dict
        # or Item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class TencentDownloaderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)
  1. pipelines.py:管道文件,需要对文件格式存储方式做修改(需要自己配置)
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json

class TencentPipeline(object):
    def __init__(self):
        self.f = open("tencent.csv","w",encoding='utf8')
    def process_item(self,item,spider):
        content=json.dumps(dict(item),ensure_ascii=False) + ",n"
        self.f.write(content)
        return item
    
    def close_spider(self,spider):
        self.f.close()
  1. settings.py:对需要的设置做打开或者关闭处理默认大部分关闭,如需要打开管道设置:

    image.png

默认关闭状态

创建爬虫文件:scrapy genspider tencent “tencent.com”

我们需要编写的爬虫文件在spiders里面的tencent.py

image.png

tencent.py:

# -*- coding: utf-8 -*-
import scrapy
from Tencent.items import TencentItem
import json
class TencentSpider(scrapy.Spider):
    name = 'tencent'
    baseurl="https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1562249003305&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex={}&pageSize=10&language=zh-cn&area=cn"
    # allowed_domains = ['tencent.com']
    offset  = 1
  #   url='https://careers.tencent.com/tencentcareer/api/post/Query?'   
    start_urls=[baseurl.format(offset)]
    
    def parse(self, response):
        
        
        job_items=json.loads(response.body.decode())['Data']['Posts']
        
        for job_item in job_items:
            
            item = TencentItem()

            item['positionName'] = job_item["RecruitPostName"]
            
            item['positionLink'] = job_item["PostURL"] + job_item["PostId"]

            item['positionType'] = job_item["Responsibility"]

            item['worklocation'] = job_item["LocationName"]

            item['publishTime'] = job_item["LastUpdateTime"]

            yield item

        if self.offset 

运行爬虫:scrapy crawl tencent

11
D:TencentTencentspiders>scrapy crawl tencent
2019-07-05 13:56:28 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: Tencent)
2019-07-05 13:56:28 [scrapy.utils.log] INFO: Versions: lxml 4.3.4.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 19.2.1, Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 22:22:05) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1c  28 May 2019), cryptography 2.7, Platform Windows-10-10.0.18362-SP0
2019-07-05 13:56:28 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'Tencent', 'NEWSPIDER_MODULE': 'Tencent.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['Tencent.spiders']}
2019-07-05 13:56:28 [scrapy.extensions.telnet] INFO: Telnet Password: 1f4cb6e4d1fc4caa
2019-07-05 13:56:28 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2019-07-05 13:56:28 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-07-05 13:56:28 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-07-05 13:56:28 [scrapy.middleware] INFO: Enabled item pipelines:
['Tencent.pipelines.TencentPipeline']
2019-07-05 13:56:28 [scrapy.core.engine] INFO: Spider opened
2019-07-05 13:56:28 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-07-05 13:56:28 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-07-05 13:56:28 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to  from 
2019-07-05 13:56:28 [scrapy.core.engine] DEBUG: Crawled (200)  (referer: None)
2019-07-05 13:56:28 [scrapy.core.engine] DEBUG: Crawled (200)  (referer: None)
2019-07-05 13:56:28 [scrapy.core.scraper] DEBUG: Scraped from 
{'positionLink': 'http://careers.tencent.com/jobdesc.html?postId=01147013579229106176',
 'positionName': '22989-Serverless前端架构师',
 'positionType': '负责腾讯 Serverless 平台战略目标规划、整体平台产品能力设计;n'
                 '负责探索前端技术与 Serverless 的结合落地,包括不限于腾讯大前端架构建设,公共组件的设计, '
                 'Serverless 的前端应用场景落地;n'
                 '负责分析 Serverless 客户复杂应用场景的具体实现(小程序,Node.js);n'
                 '负责 Serverless 场景中 Node.js 以及微信小程序相关生态建设。',
 'publishTime': '2019年07月05日',
 'worklocation': '深圳'}
2019-07-05 13:56:28 [scrapy.core.scraper] DEBUG: Scraped from 
{'positionLink': 'http://careers.tencent.com/jobdesc.html?postId=01147013576054018048',
 'positionName': '22989-语音通信研发工程师(深圳)',
 'positionType': '负责腾讯云通信号码保护、企业总机、呼叫中心、融合通信产品开发;n'
                 '负责融合通信PaaS平台的构建和优化;n'
                 '负责通话质量分析和调优;',
 'publishTime': '2019年07月05日',
 'worklocation': '深圳'}
2019-07-05 13:56:28 [scrapy.core.scraper] DEBUG: Scraped from 
{'positionLink': 'http://careers.tencent.com/jobdesc.html?postId=11231766955960606721123176695596060672',
 'positionName': '18435-合规反洗钱岗',
 'positionType': '1、根据反洗钱法律法规及监管规定的要求,完善落实反洗钱工作,指导各业务部门、分支机构开展反洗钱工作,支 持反洗钱监管沟通及监管报告反馈工作;n'
                 '2、制定与完善内部反洗钱配套制度与流程,推动公司反洗钱标准化及流程化建设;n'
                 '3、熟悉监管部门各项反洗钱政策制度要求,能就日常产品业务及合同及时进行反洗钱合规评审;n'
                 '4、开展对各业务部门、分支机构的反洗钱合规自查工作,跟进缺陷问题;n'
                 '5、根据反洗钱法律法规及监管规定的更新情况,及时对各业务部门进行法规解读,并追踪落实;n'
                 '6、重点项目的跟进及推动工作;n'
                 '7、领导交办的其他工作。',
 'publishTime': '2019年07月05日',
 'worklocation': '深圳总部'}
2019-07-05 13:56:28 [scrapy.core.scraper] DEBUG: Scraped from 
{'positionLink': 'http://careers.tencent.com/jobdesc.html?postId=11231779032200683521123177903220068352',
 'positionName': '25927-游戏测试项目经理',
 'positionType': '负责项目计划和迭代计划的制定、跟进和总结回顾,推动产品需求、运营需求和技术需求的落地执行,排除障碍,确保交付时间和质量;n'
                 '负责跟合作有关部门和团队对接,确保内部外部团队高效协同工作;n'
                 '不断优化项目流程规范;,及时发现并跟踪解决项目问题,有效管理项目风险。',
 'publishTime': '2019年07月05日',
 'worklocation': '深圳总部'}
image.png

篇幅有限,只能图片展示

image.png

总结:
编写scrapy步骤:
scrapy project XXXX
scrapy genspider xxxx “xxx.com”
编写item.py 明确需要提取的数据
编写spider文件下面的xxxx.py编写爬虫文件处理,处理请求和响应,以及提取数据(yield item)
编写pipelines.py编写管道文件处理spider返回的item数据,比如本地持久化存储等…
编写settings.py启动管道组件 ITEM_PIPLELINES = {…..},以及其他相关设置
执行爬虫

文章来源于互联网:2019-07-05scrapy 爬虫框架搭建

发表评论