J'ai le code suivant qui explore l'adresse de site Web donnée, mais le problème est qu'il duplique l'URL lors de l'exploration. J'ai besoin d'une liste unique et complète d'URL accessibles depuis la page d'accueil du site Web.
Aidez-moi à le faire.
C:\Users\Carthaginian\Desktop\projectLink\crawler\crawler\spiders>python stacklink.py 2019-08-22 14:40:17 [scrapy.utils.log] INFO: Scrapy 1.7.3 started (bot: scrapybot) 2019-08-22 14:40:17 [scrapy.utils.log] INFO: Versions: lxml 4.3.0.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 19.7.0, Python 3.7.2 (tags/v3.7.2:9a3ffc0492, Dec 23 2018, 22:20:52) [MSC v.1916 32 bit (Intel)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1a 20 Nov 2018), cryptography 2.5, Platform Windows-10-10.0.17134-SP0 2019-08-22 14:40:17 [scrapy.crawler] INFO: Overridden settings: {'FEED_FORMAT': 'csv', 'FEED_URI': 'file://C:/Tmp1/output.csv', 'USER_AGENT': 'Mozilla/5.0'} 2019-08-22 14:40:17 [scrapy.extensions.telnet] INFO: Telnet Password: 2feebff3115b2d5b 2019-08-22 14:40:17 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.feedexport.FeedExporter', 'scrapy.extensions.logstats.LogStats'] 2019-08-22 14:40:17 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2019-08-22 14:40:17 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2019-08-22 14:40:17 [scrapy.middleware] INFO: Enabled item pipelines: [] 2019-08-22 14:40:17 [scrapy.core.engine] INFO: Spider opened 2019-08-22 14:40:17 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2019-08-22 14:40:17 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2019-08-22 14:40:17 [scrapy.crawler] INFO: Overridden settings: {'FEED_FORMAT': 'csv', 'FEED_URI': 'file://C:/Tmp1/output.csv', 'USER_AGENT': 'Mozilla/5.0'} 2019-08-22 14:40:17 [scrapy.extensions.telnet] INFO: Telnet Password: b27fd364782f9b57 2019-08-22 14:40:17 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.feedexport.FeedExporter', 'scrapy.extensions.logstats.LogStats'] 2019-08-22 14:40:17 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2019-08-22 14:40:17 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2019-08-22 14:40:17 [scrapy.middleware] INFO: Enabled item pipelines: [] 2019-08-22 14:40:17 [scrapy.core.engine] INFO: Spider opened 2019-08-22 14:40:17 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2019-08-22 14:40:17 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6024 2019-08-22 14:40:17 [scrapy.core.engine] INFO: Closing spider (finished) 2019-08-22 14:40:17 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'elapsed_time_seconds': 0.025426, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2019, 8, 22, 9, 10, 17, 695429), 'log_count/INFO': 19, 'start_time': datetime.datetime(2019, 8, 22, 9, 10, 17, 670003)} 2019-08-22 14:40:17 [scrapy.core.engine] INFO: Spider closed (finished) 2019-08-22 14:40:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.copperpodip.com> (referer: None) [parse] url: https://www.copperpodip.com [+] link: https://www.copperpodip.com/due-diligence ----------Full Link: https://www.copperpodip.com/due-diligence 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com> {'url': 'https://www.copperpodip.com'} [+] link: https://www.copperpodip.com/single-post/2019/04/22/Patent-Alert-PayPal-Patent-Can-Protect-PCs-From-Ransomware-Attacks ----------Full Link: https://www.copperpodip.com/single-post/2019/04/22/Patent-Alert-PayPal-Patent-Can-Protect-PCs-From-Ransomware-Attacks 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com> {'url': 'https://www.copperpodip.com'} [+] link: https://www.copperpodip.com/leadership ----------Full Link: https://www.copperpodip.com/leadership 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com> {'url': 'https://www.copperpodip.com'} [+] link: https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses ----------Full Link: https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com> {'url': 'https://www.copperpodip.com'} [+] link: https://www.copperpodip.com/single-post/2019/05/20/Patent-Alert-Teslas-New-Patent-Application-Proposes-a-Sunroof-with-Electric-Tinting ----------Full Link: https://www.copperpodip.com/single-post/2019/05/20/Patent-Alert-Teslas-New-Patent-Application-Proposes-a-Sunroof-with-Electric-Tinting 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com> {'url': 'https://www.copperpodip.com'} [+] link: https://www.copperpodip.com/prior-art-search ----------Full Link: https://www.copperpodip.com/prior-art-search 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com> {'url': 'https://www.copperpodip.com'} [+] link: https://www.copperpodip.com/single-post/2019/08/08/Patent-Alert-Amazon-wins-patent-for-spoilage-sniffing-refrigerator ----------Full Link: https://www.copperpodip.com/single-post/2019/08/08/Patent-Alert-Amazon-wins-patent-for-spoilage-sniffing-refrigerator 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com> {'url': 'https://www.copperpodip.com'} [+] link: https://www.copperpodip.com/single-post/2019/04/25/Patent-Alert-IBM-Awarded-DLT-Patent-for-Data-Sharing-and-Validation ----------Full Link: https://www.copperpodip.com/single-post/2019/04/25/Patent-Alert-IBM-Awarded-DLT-Patent-for-Data-Sharing-and-Validation 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com> {'url': 'https://www.copperpodip.com'} [+] link: https://www.copperpodip.com/patent-monetization ----------Full Link: https://www.copperpodip.com/patent-monetization 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com> {'url': 'https://www.copperpodip.com'} [+] link: https://www.copperpodip.com/single-post/2019/04/12/The-Future-is-Green-Energy---HyperSolars-Environment-Friendly-Hydrogen-Generator ----------Full Link: https://www.copperpodip.com/single-post/2019/04/12/The-Future-is-Green-Energy---HyperSolars-Environment-Friendly-Hydrogen-Generator 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com> {'url': 'https://www.copperpodip.com'} [+] link: https://www.copperpodip.com/privacy-policy ----------Full Link: https://www.copperpodip.com/privacy-policy 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com> {'url': 'https://www.copperpodip.com'} [+] link: https://www.copperpodip.com/ip-news ----------Full Link: https://www.copperpodip.com/ip-news 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com> {'url': 'https://www.copperpodip.com'} [+] link: https://www.copperpodip.com ----------Full Link: https://www.copperpodip.com 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com> {'url': 'https://www.copperpodip.com'} [+] link: https://www.copperpodip.com/contact-us ----------Full Link: https://www.copperpodip.com/contact-us 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com> {'url': 'https://www.copperpodip.com'} [+] link: https://www.copperpodip.com/blog ----------Full Link: https://www.copperpodip.com/blog 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com> {'url': 'https://www.copperpodip.com'} [+] link: https://www.linkedin.com/company/copperpod-ip ----------Full Link: https://www.linkedin.com/company/copperpod-ip 2019-08-22 14:40:18 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.linkedin.com': <GET https://www.linkedin.com/company/copperpod-ip> 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com> {'url': 'https://www.copperpodip.com'} [+] link: https://www.copperpodip.com/source-code-review ----------Full Link: https://www.copperpodip.com/source-code-review 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com> {'url': 'https://www.copperpodip.com'} [+] link: https://www.copperpodip.com/request-for-samples ----------Full Link: https://www.copperpodip.com/request-for-samples 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com> {'url': 'https://www.copperpodip.com'} [+] link: https://www.copperpodip.com/single-post/2019/01/07/Making-Amends-Chinas-New-Intellectual-Property-Appeals-Court ----------Full Link: https://www.copperpodip.com/single-post/2019/01/07/Making-Amends-Chinas-New-Intellectual-Property-Appeals-Court 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com> {'url': 'https://www.copperpodip.com'} [+] link: https://www.copperpodip.com/case-study-due-diligence ----------Full Link: https://www.copperpodip.com/case-study-due-diligence 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com> {'url': 'https://www.copperpodip.com'} [+] link: https://www.siliconindiamagazine.com/magazine/patent-and-trademark-law-special-july-2018/#page=28 ----------Full Link: https://www.siliconindiamagazine.com/magazine/patent-and-trademark-law-special-july-2018/#page=28 2019-08-22 14:40:18 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.siliconindiamagazine.com': <GET https://www.siliconindiamagazine.com/magazine/patent-and-trademark-law-special-july-2018/#page=28> 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com> {'url': 'https://www.copperpodip.com'} [+] link: https://www.copperpodip.com/reverse-engineering ----------Full Link: https://www.copperpodip.com/reverse-engineering 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com> {'url': 'https://www.copperpodip.com'} [+] link: https://www.copperpodip.com/careers ----------Full Link: https://www.copperpodip.com/careers 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com> {'url': 'https://www.copperpodip.com'} [+] link: https://www.copperpodip.com/infringement-claim-charts ----------Full Link: https://www.copperpodip.com/infringement-claim-charts 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com> {'url': 'https://www.copperpodip.com'} [+] link: https://www.copperpodip.com/case-study-infringement-analysis ----------Full Link: https://www.copperpodip.com/case-study-infringement-analysis 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com> {'url': 'https://www.copperpodip.com'} [+] link: https://www.copperpodip.com/single-post/2019/04/30/Tokenization-Future-of-Payment-Security ----------Full Link: https://www.copperpodip.com/single-post/2019/04/30/Tokenization-Future-of-Payment-Security 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com> {'url': 'https://www.copperpodip.com'} [+] link: https://www.copperpodip.com/case-study-source-code-review ----------Full Link: https://www.copperpodip.com/case-study-source-code-review 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com> {'url': 'https://www.copperpodip.com'} [+] link: https://www.copperpodip.com/single-post/2019/08/21/Patent-Alert-WINDGO-granted-IoT-wearable-products-patent-having-sensing-and-response-components ----------Full Link: https://www.copperpodip.com/single-post/2019/08/21/Patent-Alert-WINDGO-granted-IoT-wearable-products-patent-having-sensing-and-response-components 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com> {'url': 'https://www.copperpodip.com'} 2019-08-22 14:40:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.copperpodip.com/due-diligence> (referer: https://www.copperpodip.com) [parse] url: https://www.copperpodip.com/due-diligence [+] link: https://www.facebook.com/copperpodip/ ----------Full Link: https://www.facebook.com/copperpodip/ 2019-08-22 14:40:18 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.facebook.com': <GET https://www.facebook.com/copperpodip/> 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/due-diligence> {'url': 'https://www.copperpodip.com/due-diligence'} 2019-08-22 14:40:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses> (referer: https://www.copperpodip.com) [parse] url: https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses [+] link: https://www.copperpodip.com/blog/date/2017-03 ----------Full Link: https://www.copperpodip.com/blog/date/2017-03 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses> {'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'} [+] link: https://www.copperpodip.com/blog/tag/emergingtech ----------Full Link: https://www.copperpodip.com/blog/tag/emergingtech 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses> {'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'} [+] link: https://www.copperpodip.com/blog/date/2018-09 ----------Full Link: https://www.copperpodip.com/blog/date/2018-09 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses> {'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'} [+] link: https://www.copperpodip.com/blog/date/2018-02 ----------Full Link: https://www.copperpodip.com/blog/date/2018-02 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses> {'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'} [+] link: https://www.copperpodip.com/blog/tag/itc ----------Full Link: https://www.copperpodip.com/blog/tag/itc 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses> {'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'} [+] link: https://www.copperpodip.com/blog/tag/intel ----------Full Link: https://www.copperpodip.com/blog/tag/intel 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses> {'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'} [+] link: https://www.copperpodip.com/blog/tag/iot ----------Full Link: https://www.copperpodip.com/blog/tag/iot 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses> {'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'} [+] link: https://www.copperpodip.com/blog/tag/nokia ----------Full Link: https://www.copperpodip.com/blog/tag/nokia 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses> {'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'} [+] link: https://www.copperpodip.com/blog/tag/fintech ----------Full Link: https://www.copperpodip.com/blog/tag/fintech 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses> {'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'} [+] link: https://www.copperpodip.com/blog/tag/patents ----------Full Link: https://www.copperpodip.com/blog/tag/patents 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses> {'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'} [+] link: https://www.copperpodip.com/blog/tag/uber ----------Full Link: https://www.copperpodip.com/blog/tag/uber 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses> {'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'} [+] link: https://www.copperpodip.com/blog/tag/home%20automation ----------Full Link: https://www.copperpodip.com/blog/tag/home%20automation 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses> {'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'} [+] link: https://www.copperpodip.com/blog/tag/duediligence ----------Full Link: https://www.copperpodip.com/blog/tag/duediligence 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses> {'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'} [+] link: https://www.copperpodip.com/blog/tag/fake%20news ----------Full Link: https://www.copperpodip.com/blog/tag/fake%20news 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses> {'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'} [+] link: https://www.copperpodip.com/blog/tag/paypal ----------Full Link: https://www.copperpodip.com/blog/tag/paypal 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses> {'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'} [+] link: https://www.copperpodip.com/blog/tag/virtualreality ----------Full Link: https://www.copperpodip.com/blog/tag/virtualreality 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses> {'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'} [+] link: https://www.copperpodip.com/blog/author/Arjunvir-Singh ----------Full Link: https://www.copperpodip.com/blog/author/Arjunvir-Singh 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses> {'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'} [+] link: https://www.copperpodip.com/blog/tag/trademarks ----------Full Link: https://www.copperpodip.com/blog/tag/trademarks 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses> {'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'} [+] link: https://www.copperpodip.com/blog/tag/qualcomm ----------Full Link: https://www.copperpodip.com/blog/tag/qualcomm 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses> {'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'} [+] link: https://www.copperpodip.com/blog/tag/Apple ----------Full Link: https://www.copperpodip.com/blog/tag/Apple 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses> {'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'} [+] link: https://www.copperpodip.com/blog/tag/5g ----------Full Link: https://www.copperpodip.com/blog/tag/5g 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses> {'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'} [+] link: https://www.copperpodip.com/blog/tag/code%20review ----------Full Link: https://www.copperpodip.com/blog/tag/code%20review 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses> {'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'} [+] link: https://www.copperpodip.com/blog/tag/licensing ----------Full Link: https://www.copperpodip.com/blog/tag/licensing 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses> {'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'} [+] link: https://www.copperpodip.com/blog/tag/internet%20of%20things ----------Full Link: https://www.copperpodip.com/blog/tag/internet%20of%20things 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses> {'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'} [+] link: https://www.copperpodip.com/blog/date/2018-03 ----------Full Link: https://www.copperpodip.com/blog/date/2018-03 2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses> {'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'} [+] link: https://www.copperpodip.com/blog/tag/technology ----------Full Link: https://www.copperpodip.com/blog/tag/technology
Exécutez simplement ce code tel quel est. sortie du code ci-dessus
Sortie de l'exécution du code:
############################################################################################ import scrapy urlset = set() class MySpider(scrapy.Spider): name = "MySpider" def __init__(self, allowed_domains=None, start_urls=None): super().__init__() if allowed_domains is None: self.allowed_domains = [] else: self.allowed_domains = allowed_domains if start_urls is None: self.start_urls = [] else: self.start_urls = start_urls def parse(self, response): print('[parse] url:', response.url) # extract all links from page all_links = response.xpath('*//a/@href').extract() all_links = set(all_links) all_links = list(all_links) # iterate over links for link in all_links: if("https:" in link or "http:" in link): if(link not in urlset): print('[+] link:', link) full_link = response.urljoin(link) urlset.add(full_link) print("----------Full Link: "+full_link) request = response.follow(full_link, callback=self.parse) yield request yield {'url': response.url} # def print_this_link(self, response): # print('[print_this_link] url:', response.url) # title = response.xpath('//title/text()').get() # get() will replace extract() in the future # # text = response.xpath('//body/text()').get() # yield {'url': response.url, 'title': title} # --- run without creating project and save in `output.csv` --- from scrapy.crawler import CrawlerProcess c = CrawlerProcess({ 'USER_AGENT': 'Mozilla/5.0', # save in file as CSV, JSON or XML 'FEED_FORMAT': 'csv', # csv, json, xml 'FEED_URI': 'file://C:/Tmp1/output.csv', # }) c.crawl(MySpider) c.crawl(MySpider, allowed_domains=["copperpodip.com"], start_urls=["https://www.copperpodip.com"]) c.start()
3 Réponses :
Je ne connais pas du tout scrapy, mais ne pouvez-vous pas utiliser une liste (ou un ensemble, c'est plus simple) et vérifier s'il existe déjà un enregistrement du même lien?
all_links = set(all_links) all_links = list(all_links)
Edit: vous semblez déjà utiliser un ensemble, que vous changez pour une liste juste après:
link_list = list if link not in link_list : link_list.append(link)
Exactement @Oka. Une liste c'est mieux
Cela fonctionnera car Scrapy traitera les URL en double pour vous:
def parse(self, response): yield {'url': response.url} print('[parse] url:', response.url) # extract all links from page all_links = response.xpath('*//a/@href').extract() # iterate over links for link in all_links: if("https:" in link or "http:" in link): print('[+] link:', link) full_link = response.urljoin(link) print("----------Full Link: "+full_link) request = response.follow(full_link, callback=self.parse) yield request
peut-être devriez-vous également expliquer un peu, car la vôtre est une meilleure réponse.
scrapy devrait automatiquement éviter de revoir les URL précédemment visitées (en utilisant classe dupefilter ). Ce que vous voulez faire ici n'est pas tout à fait clair pour moi, mais je pense que vous voulez explorer le site Web et trouver tous les liens? Dans ce cas, vous devez déplacer votre deuxième rendement ( yield {'url': response.url}
) plus tôt dans votre fonction d'analyse.
Je pense que ce qui suit vous donne ce que vous voulez:
scrapy runspider scrapy_test.py -o test.json
si je l'exécute en tant que:
import scrapy class MySpider(scrapy.Spider): name = "copperpodip" start_urls = ["https://copperpodip.com"] allowed_domains = ["copperpodip.com"] def parse(self, response): yield {'url': response.url} for link in response.xpath('*//a/@href').getall(): yield response.follow(link, self.parse)
alors le fichier json résultant ne contient aucun lien en double.
J'ai une question que si j'écris webpagetext = response.xpath ('// body / text ()'). Getall ()
et que j'applique une expression régulière pour supprimer les espaces vides dans le texte, cela s'affiche une erreur.
Ok - posez peut-être une autre question expliquant clairement ce que vous essayez de faire exactement et l'erreur que vous obtenez?
Utilisez une liste.
si l'URL n'est pas dans la liste liée: liste liée.append (url)
L'ensemble python résout le même objectif que la liste de liens
Avez-vous vraiment besoin de créer un ensemble (), bien que vous ayez déjà vos liens dans votre
all_links
? Vous pouvez simplement exécuterif ("https:" pas dans le lien ou "http:" pas dans le lien):
et supprimer le reste commeall_links.remove (lien)
J'utilisais en fait la propriété de set pour supprimer la visite en double de scrapy mais maintenant le problème est résolu grâce