rm新时代靠谱吗,RM新时代是什么平台

Scrapy怎么爬取Python文件

一.項目背景

之前文章[Scrapy爬蟲框架初步使用介紹](http://mp.weixin.qq.com/s?__biz=MzIzODI4ODM2MA==&mid=2247484881&idx=1&sn=5d205c3315927845fed5aa4dfbb4f4da&chksm=e93ae956de4d604052e6d18ca10fc081f32cd8479a11420cd13fe20bbb963044b13d55b15390&scene=21#wechat_redirect)我們介紹了Scrapy框架運(yùn)行基本原理,緊接著我們介紹了如何利用Scrapy爬取文本數(shù)據(jù)[Scrapy+MySQL+MongoDB爬取豆瓣讀書做簡單數(shù)據(jù)分析](http://mp.weixin.qq.com/s?__biz=MzIzODI4ODM2MA==&mid=2247484898&idx=1&sn=763a73b7d4b7c991d1aeb2ceb389b686&chksm=e93ae965de4d6073da55c6db07bfe142c1d18ca744dae33214a2dba8940db348616e256a7e50&scene=21#wechat_redirect),以及如何利用Scrapy爬取圖片[Scrapy爬取某網(wǎng)站美女圖片](http://mp.weixin.qq.com/s?__biz=MzIzODI4ODM2MA==&mid=2247486610&idx=1&sn=e05d207e965d7bcc0507a195f25da2b9&chksm=e93ae015de4d69031ae847bf5f12adef61e82d263aa8366e9533a58c7011b6396b4a05051cea&scene=21#wechat_redirect),本次我們分享如何利用Scrapy爬取文件。

本次我們爬取目標(biāo)網(wǎng)頁為：https://matplotlib.org/2.0.2/examples/index.html

二.實(shí)現(xiàn)過程

1.創(chuàng)建項目
   》》scrapy startproject matplot_file
   》》進(jìn)入該目錄 cd matplot_file
   》》生成爬蟲 scrapy genspider mat  matplotlib.org
   》》運(yùn)行爬蟲 scrapy crawl mat -o mat_file.json

2.數(shù)據(jù)爬取
  》》解析數(shù)據(jù)
  》》存儲數(shù)據(jù)

# -*- coding: utf-8 -*-
import scrapy
from matplot_file.items import MatplotFileItem




class MatSpider(scrapy.Spider):
    name = 'mat'
    allowed_domains = ['matplotlib.org']
    start_urls = ['https://matplotlib.org/2.0.2/examples/index.html']


    def parse(self, response):
        #獲取所有l(wèi)i元素
        for lis in response.xpath('//*[@id="matplotlib-examples"]/div/ul/li'):
            #遍歷li元素
            for li in lis.xpath('.//ul/li'):
                #獲取鏈接
                url=li.xpath('.//a/@href').get()
                #拼接鏈接
                url = response.urljoin(url)
                #爬取文本
                yield scrapy.Request(url, callback=self.parse_html)


    #解析文本
    def parse_html(self,response):
        #獲取文件鏈接
        href = response.xpath('//div[@class="section"]/p/a/@href').get()
        #拼接鏈接
        url=response.urljoin(href)
        #打印控制臺
        print(url)
        #初始化對象
        matfile=MatplotFileItem()
        #存儲對象
        matfile['file_urls']=[url]
        #返回數(shù)據(jù)
        yield   matfile

【注】以上是mat.py中代碼

# -*- coding: utf-8 -*-
BOT_NAME = 'matplot_file'


SPIDER_MODULES = ['matplot_file.spiders']
NEWSPIDER_MODULE = 'matplot_file.spiders'




#設(shè)置FilePipeline
ITEM_PIPELINES = {
    'scrapy.pipelines.files.FilesPipeline':1,
    }
#設(shè)設(shè)置文件保存路徑
FILES_STORE = 'mat_file'
ROBOTSTXT_OBEY = False


【注】以上是settings.py中代碼

import scrapy




class MatplotFileItem(scrapy.Item):
    # define the fields for your item here like:


    #文件url
    file_urls = scrapy.Field()
    #下載文件信息
    files = scrapy.Field()


【注】以上是items.py中代碼

聲明：本文內(nèi)容及配圖由入駐作者撰寫或者入駐合作網(wǎng)站授權(quán)轉(zhuǎn)載。文章觀點(diǎn)僅代表作者本人，不代表電子發(fā)燒友網(wǎng)立場。文章及其配圖僅供工程師學(xué)習(xí)之用，如有內(nèi)容侵權(quán)或者其他違規(guī)問題，請聯(lián)系本站處理。舉報投訴

數(shù)據(jù)

數(shù)據(jù)

+關(guān)注

關(guān)注
8

文章
7002

瀏覽量
88941
框架

框架

+關(guān)注

關(guān)注
0

文章
403

瀏覽量
17475
運(yùn)行

運(yùn)行

+關(guān)注

關(guān)注
0

文章
25

瀏覽量
15399

python實(shí)現(xiàn)網(wǎng)頁爬蟲爬取圖片

來實(shí)現(xiàn)這樣一個簡單的爬蟲功能，把我們想要的代碼爬取到本地，功能有點(diǎn)類似我們之前學(xué)過的批處理。下面就看看如何使用python來實(shí)現(xiàn)這樣一個功能，主要分為三步，如下：一. 獲取整個頁面數(shù)據(jù)首先我們可以先

發(fā)表于 04-05 15:32

Python爬蟲與Web開發(fā)庫盤點(diǎn)

Python爬蟲和Web開發(fā)均是與網(wǎng)頁相關(guān)的知識技能，無論是自己搭建的網(wǎng)站還是爬蟲爬去別人的網(wǎng)站，都離不開相應(yīng)的Python庫，以下是常用的Python爬蟲與Web開發(fā)庫。1.爬蟲庫

發(fā)表于 05-10 15:21

python

python學(xué)習(xí)1.爬取數(shù)據(jù)2.爬取圖片

發(fā)表于 09-21 18:18

采用xpath爬取網(wǎng)站內(nèi)容

xpath爬取mooc網(wǎng)課程

發(fā)表于 04-11 12:01

基于Python3對攜程網(wǎng)頁上北京五星級酒店列表的爬取

Python3 爬取攜程網(wǎng)[1] 根據(jù)好評優(yōu)先順序，獲取北京五星級酒店列表

發(fā)表于 04-19 16:25

基于Python實(shí)現(xiàn)一只小爬蟲爬取拉勾網(wǎng)職位信息的方法

通俗易懂的分析如何用Python實(shí)現(xiàn)一只小爬蟲，爬取拉勾網(wǎng)的職位信息

發(fā)表于 05-17 06:54

python學(xué)習(xí)筆記-安裝scrapy

的。。下載后會自動安裝 OK，Scrapy終于完全安裝完畢了我將所有安裝文件以及上文提到的python代碼也一起打包，下載地址在下面 http://download.csdn.net/detail/tkfeng29/900266

發(fā)表于 07-10 07:49

Python3安裝scrapy時pip install twisted失敗

Python3安裝scrapy的玄學(xué)

發(fā)表于 08-14 07:22

python爬取音頻文件的步驟

python爬蟲爬取音頻文件

發(fā)表于 08-22 14:23

scrapy爬蟲小說方法

scrapy爬取小說(一）

發(fā)表于 09-19 06:29

Python爬取豆瓣電影信息和存儲數(shù)據(jù)庫

Python——爬取豆瓣電影信息并存儲數(shù)據(jù)庫

發(fā)表于 03-11 11:19

Python 爬取CSDN的極客頭條

Python 如何爬取CSDN的極客頭條呢？

發(fā)表于 03-21 14:58 ?4823次閱讀

<b class='flag-5'>Python</b> <b class='flag-5'>爬</b><b class='flag-5'>取</b>CSDN的極客頭條

如何使用Scrapy爬取網(wǎng)站數(shù)據(jù)

網(wǎng)頁抓取的主要目標(biāo)是從無結(jié)構(gòu)的來源提取出結(jié)構(gòu)信息。Scrapy爬蟲以Python字典的形式返回提取數(shù)據(jù)。盡管Python字典既方便又熟悉，但仍然不夠結(jié)構(gòu)化：字段名容易出現(xiàn)拼寫錯誤，返回不一致的信息，特別是在有多個爬蟲的大型項目中

發(fā)表于 07-26 09:06 ?5167次閱讀