2
votes

Comment se débarrasser des liens en double lors de l'exploration d'un site Web à l'aide de Python Scrapy?

J'ai le code suivant qui explore l'adresse de site Web donnée, mais le problème est qu'il duplique l'URL lors de l'exploration. J'ai besoin d'une liste unique et complète d'URL accessibles depuis la page d'accueil du site Web.

Aidez-moi à le faire.

C:\Users\Carthaginian\Desktop\projectLink\crawler\crawler\spiders>python stacklink.py
2019-08-22 14:40:17 [scrapy.utils.log] INFO: Scrapy 1.7.3 started (bot: scrapybot)
2019-08-22 14:40:17 [scrapy.utils.log] INFO: Versions: lxml 4.3.0.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 19.7.0, Python 3.7.2 (tags/v3.7.2:9a3ffc0492, Dec 23 2018, 22:20:52) [MSC v.1916 32 bit (Intel)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1a  20 Nov 2018), cryptography 2.5, Platform Windows-10-10.0.17134-SP0
2019-08-22 14:40:17 [scrapy.crawler] INFO: Overridden settings: {'FEED_FORMAT': 'csv', 'FEED_URI': 'file://C:/Tmp1/output.csv', 'USER_AGENT': 'Mozilla/5.0'}
2019-08-22 14:40:17 [scrapy.extensions.telnet] INFO: Telnet Password: 2feebff3115b2d5b
2019-08-22 14:40:17 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2019-08-22 14:40:17 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-08-22 14:40:17 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-08-22 14:40:17 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-08-22 14:40:17 [scrapy.core.engine] INFO: Spider opened
2019-08-22 14:40:17 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-08-22 14:40:17 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-08-22 14:40:17 [scrapy.crawler] INFO: Overridden settings: {'FEED_FORMAT': 'csv', 'FEED_URI': 'file://C:/Tmp1/output.csv', 'USER_AGENT': 'Mozilla/5.0'}
2019-08-22 14:40:17 [scrapy.extensions.telnet] INFO: Telnet Password: b27fd364782f9b57
2019-08-22 14:40:17 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2019-08-22 14:40:17 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-08-22 14:40:17 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-08-22 14:40:17 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-08-22 14:40:17 [scrapy.core.engine] INFO: Spider opened
2019-08-22 14:40:17 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-08-22 14:40:17 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6024
2019-08-22 14:40:17 [scrapy.core.engine] INFO: Closing spider (finished)
2019-08-22 14:40:17 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'elapsed_time_seconds': 0.025426,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 8, 22, 9, 10, 17, 695429),
 'log_count/INFO': 19,
 'start_time': datetime.datetime(2019, 8, 22, 9, 10, 17, 670003)}
2019-08-22 14:40:17 [scrapy.core.engine] INFO: Spider closed (finished)
2019-08-22 14:40:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.copperpodip.com> (referer: None)
[parse] url: https://www.copperpodip.com
[+] link: https://www.copperpodip.com/due-diligence
----------Full Link: https://www.copperpodip.com/due-diligence
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/single-post/2019/04/22/Patent-Alert-PayPal-Patent-Can-Protect-PCs-From-Ransomware-Attacks
----------Full Link: https://www.copperpodip.com/single-post/2019/04/22/Patent-Alert-PayPal-Patent-Can-Protect-PCs-From-Ransomware-Attacks
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/leadership
----------Full Link: https://www.copperpodip.com/leadership
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses
----------Full Link: https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/single-post/2019/05/20/Patent-Alert-Teslas-New-Patent-Application-Proposes-a-Sunroof-with-Electric-Tinting
----------Full Link: https://www.copperpodip.com/single-post/2019/05/20/Patent-Alert-Teslas-New-Patent-Application-Proposes-a-Sunroof-with-Electric-Tinting
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/prior-art-search
----------Full Link: https://www.copperpodip.com/prior-art-search
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/single-post/2019/08/08/Patent-Alert-Amazon-wins-patent-for-spoilage-sniffing-refrigerator
----------Full Link: https://www.copperpodip.com/single-post/2019/08/08/Patent-Alert-Amazon-wins-patent-for-spoilage-sniffing-refrigerator
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/single-post/2019/04/25/Patent-Alert-IBM-Awarded-DLT-Patent-for-Data-Sharing-and-Validation
----------Full Link: https://www.copperpodip.com/single-post/2019/04/25/Patent-Alert-IBM-Awarded-DLT-Patent-for-Data-Sharing-and-Validation
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/patent-monetization
----------Full Link: https://www.copperpodip.com/patent-monetization
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/single-post/2019/04/12/The-Future-is-Green-Energy---HyperSolars-Environment-Friendly-Hydrogen-Generator
----------Full Link: https://www.copperpodip.com/single-post/2019/04/12/The-Future-is-Green-Energy---HyperSolars-Environment-Friendly-Hydrogen-Generator
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/privacy-policy
----------Full Link: https://www.copperpodip.com/privacy-policy
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/ip-news
----------Full Link: https://www.copperpodip.com/ip-news
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com
----------Full Link: https://www.copperpodip.com
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/contact-us
----------Full Link: https://www.copperpodip.com/contact-us
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/blog
----------Full Link: https://www.copperpodip.com/blog
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.linkedin.com/company/copperpod-ip
----------Full Link: https://www.linkedin.com/company/copperpod-ip
2019-08-22 14:40:18 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.linkedin.com': <GET https://www.linkedin.com/company/copperpod-ip>
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/source-code-review
----------Full Link: https://www.copperpodip.com/source-code-review
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/request-for-samples
----------Full Link: https://www.copperpodip.com/request-for-samples
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/single-post/2019/01/07/Making-Amends-Chinas-New-Intellectual-Property-Appeals-Court
----------Full Link: https://www.copperpodip.com/single-post/2019/01/07/Making-Amends-Chinas-New-Intellectual-Property-Appeals-Court
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/case-study-due-diligence
----------Full Link: https://www.copperpodip.com/case-study-due-diligence
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.siliconindiamagazine.com/magazine/patent-and-trademark-law-special-july-2018/#page=28
----------Full Link: https://www.siliconindiamagazine.com/magazine/patent-and-trademark-law-special-july-2018/#page=28
2019-08-22 14:40:18 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.siliconindiamagazine.com': <GET https://www.siliconindiamagazine.com/magazine/patent-and-trademark-law-special-july-2018/#page=28>
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/reverse-engineering
----------Full Link: https://www.copperpodip.com/reverse-engineering
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/careers
----------Full Link: https://www.copperpodip.com/careers
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/infringement-claim-charts
----------Full Link: https://www.copperpodip.com/infringement-claim-charts
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/case-study-infringement-analysis
----------Full Link: https://www.copperpodip.com/case-study-infringement-analysis
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/single-post/2019/04/30/Tokenization-Future-of-Payment-Security
----------Full Link: https://www.copperpodip.com/single-post/2019/04/30/Tokenization-Future-of-Payment-Security
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/case-study-source-code-review
----------Full Link: https://www.copperpodip.com/case-study-source-code-review
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/single-post/2019/08/21/Patent-Alert-WINDGO-granted-IoT-wearable-products-patent-having-sensing-and-response-components
----------Full Link: https://www.copperpodip.com/single-post/2019/08/21/Patent-Alert-WINDGO-granted-IoT-wearable-products-patent-having-sensing-and-response-components
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
2019-08-22 14:40:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.copperpodip.com/due-diligence> (referer: https://www.copperpodip.com)
[parse] url: https://www.copperpodip.com/due-diligence
[+] link: https://www.facebook.com/copperpodip/
----------Full Link: https://www.facebook.com/copperpodip/
2019-08-22 14:40:18 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.facebook.com': <GET https://www.facebook.com/copperpodip/>
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/due-diligence>
{'url': 'https://www.copperpodip.com/due-diligence'}
2019-08-22 14:40:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses> (referer: https://www.copperpodip.com)
[parse] url: https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses
[+] link: https://www.copperpodip.com/blog/date/2017-03
----------Full Link: https://www.copperpodip.com/blog/date/2017-03
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/emergingtech
----------Full Link: https://www.copperpodip.com/blog/tag/emergingtech
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/date/2018-09
----------Full Link: https://www.copperpodip.com/blog/date/2018-09
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/date/2018-02
----------Full Link: https://www.copperpodip.com/blog/date/2018-02
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/itc
----------Full Link: https://www.copperpodip.com/blog/tag/itc
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/intel
----------Full Link: https://www.copperpodip.com/blog/tag/intel
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/iot
----------Full Link: https://www.copperpodip.com/blog/tag/iot
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/nokia
----------Full Link: https://www.copperpodip.com/blog/tag/nokia
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/fintech
----------Full Link: https://www.copperpodip.com/blog/tag/fintech
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/patents
----------Full Link: https://www.copperpodip.com/blog/tag/patents
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/uber
----------Full Link: https://www.copperpodip.com/blog/tag/uber
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/home%20automation
----------Full Link: https://www.copperpodip.com/blog/tag/home%20automation
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/duediligence
----------Full Link: https://www.copperpodip.com/blog/tag/duediligence
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/fake%20news
----------Full Link: https://www.copperpodip.com/blog/tag/fake%20news
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/paypal
----------Full Link: https://www.copperpodip.com/blog/tag/paypal
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/virtualreality
----------Full Link: https://www.copperpodip.com/blog/tag/virtualreality
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/author/Arjunvir-Singh
----------Full Link: https://www.copperpodip.com/blog/author/Arjunvir-Singh
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/trademarks
----------Full Link: https://www.copperpodip.com/blog/tag/trademarks
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/qualcomm
----------Full Link: https://www.copperpodip.com/blog/tag/qualcomm
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/Apple
----------Full Link: https://www.copperpodip.com/blog/tag/Apple
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/5g
----------Full Link: https://www.copperpodip.com/blog/tag/5g
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/code%20review
----------Full Link: https://www.copperpodip.com/blog/tag/code%20review
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/licensing
----------Full Link: https://www.copperpodip.com/blog/tag/licensing
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/internet%20of%20things
----------Full Link: https://www.copperpodip.com/blog/tag/internet%20of%20things
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/date/2018-03
----------Full Link: https://www.copperpodip.com/blog/date/2018-03
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/technology
----------Full Link: https://www.copperpodip.com/blog/tag/technology

Exécutez simplement ce code tel quel est. sortie du code ci-dessus

Sortie de l'exécution du code:

############################################################################################
import scrapy
urlset = set()
class MySpider(scrapy.Spider):

    name = "MySpider"

    def __init__(self, allowed_domains=None, start_urls=None):
        super().__init__()

        if allowed_domains is None:
            self.allowed_domains = []
        else:
            self.allowed_domains = allowed_domains
        if start_urls is None:
            self.start_urls = []
        else:
            self.start_urls = start_urls  

    def parse(self, response):
        print('[parse] url:', response.url)
        # extract all links from page
        all_links = response.xpath('*//a/@href').extract()
        all_links = set(all_links)
        all_links = list(all_links)
        # iterate over links
        for link in all_links:
            if("https:" in link or "http:" in link):
                    if(link not in urlset):
                        print('[+] link:', link)

                        full_link = response.urljoin(link)
                        urlset.add(full_link)
                        print("----------Full Link: "+full_link)
                        request = response.follow(full_link, callback=self.parse)
                        yield request
                        yield {'url': response.url}                        



    # def print_this_link(self, response):
    #     print('[print_this_link] url:', response.url)
    #     title = response.xpath('//title/text()').get() # get() will replace extract() in the future
    #     # text = response.xpath('//body/text()').get()
    #     yield {'url': response.url, 'title': title}


# --- run without creating project and save in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',
    # save in file as CSV, JSON or XML
    'FEED_FORMAT': 'csv',     # csv, json, xml
    'FEED_URI': 'file://C:/Tmp1/output.csv', # 
})
c.crawl(MySpider)
c.crawl(MySpider, allowed_domains=["copperpodip.com"], start_urls=["https://www.copperpodip.com"])
c.start()

python scrapy

4 commentaires

Utilisez une liste. si l'URL n'est pas dans la liste liée: liste liée.append (url)

L'ensemble python résout le même objectif que la liste de liens

Avez-vous vraiment besoin de créer un ensemble (), bien que vous ayez déjà vos liens dans votre all_links ? Vous pouvez simplement exécuter if ("https:" pas dans le lien ou "http:" pas dans le lien): et supprimer le reste comme all_links.remove (lien)

J'utilisais en fait la propriété de set pour supprimer la visite en double de scrapy mais maintenant le problème est résolu grâce

3 Réponses :

2
votes

Je ne connais pas du tout scrapy, mais ne pouvez-vous pas utiliser une liste (ou un ensemble, c'est plus simple) et vérifier s'il existe déjà un enregistrement du même lien?

all_links = set(all_links)
all_links = list(all_links)

Edit: vous semblez déjà utiliser un ensemble, que vous changez pour une liste juste après:

link_list = list
if link not in link_list :
   link_list.append(link)

1 commentaires

Exactement @Oka. Une liste c'est mieux

1
votes

Cela fonctionnera car Scrapy traitera les URL en double pour vous:

def parse(self, response):
    yield {'url': response.url}     
    print('[parse] url:', response.url)
    # extract all links from page
    all_links = response.xpath('*//a/@href').extract()
    # iterate over links
    for link in all_links:
        if("https:" in link or "http:" in link):
            print('[+] link:', link)
            full_link = response.urljoin(link)
            print("----------Full Link: "+full_link)
            request = response.follow(full_link, callback=self.parse)
            yield request

1 commentaires

peut-être devriez-vous également expliquer un peu, car la vôtre est une meilleure réponse.

2
votes

scrapy devrait automatiquement éviter de revoir les URL précédemment visitées (en utilisant classe dupefilter ). Ce que vous voulez faire ici n'est pas tout à fait clair pour moi, mais je pense que vous voulez explorer le site Web et trouver tous les liens? Dans ce cas, vous devez déplacer votre deuxième rendement ( yield {'url': response.url} ) plus tôt dans votre fonction d'analyse.

Je pense que ce qui suit vous donne ce que vous voulez:

scrapy runspider scrapy_test.py -o test.json

si je l'exécute en tant que:

import scrapy

class MySpider(scrapy.Spider):

    name = "copperpodip"
    start_urls = ["https://copperpodip.com"]
    allowed_domains = ["copperpodip.com"]

    def parse(self, response):
        yield {'url': response.url}
        for link in response.xpath('*//a/@href').getall():
            yield response.follow(link, self.parse)

alors le fichier json résultant ne contient aucun lien en double.

2 commentaires

J'ai une question que si j'écris webpagetext = response.xpath ('// body / text ()'). Getall () et que j'applique une expression régulière pour supprimer les espaces vides dans le texte, cela s'affiche une erreur.

Ok - posez peut-être une autre question expliquant clairement ce que vous essayez de faire exactement et l'erreur que vous obtenez?