J'essayais de gratter Amazon.com avec les requêtes de python et les bibliothèques BeautifulSoup mais je suis tombé sur des problèmes. Je sais que je pourrais utiliser Selenium et je l'ai essayé et cela a fonctionné, mais je suis toujours curieux de savoir pourquoi cela s'est produit et s'il existe une solution. Voici mon code:
# Searching python on Amazon url = "https://www.amazon.com/s?k=Python" # Deceiving Amazon that I am trying to reach them from a browser headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36' } page = requests.get(url, headers=headers) soup = BeautifulSoup(page.content, "html.parser") # Trying to get the element I need but prints "None" print(soup.find("div", class_="s-main-slot s-result-list s-search-results sg-row"))
Merci d'avance.
3 Réponses :
En utilisant selenium.webdriver
vous avez un navigateur pour vous. Par exemple, ci-dessous en utilisant le pilote Web Google-Chrome
.
Ensuite, vous obtenez la page de résultats html en utilisant driver.page_source
.
<div class="s-main-slot s-result-list s-search-results sg-row"> <div class="sg-col-20-of-24 s-result-item s-asin sg-col-0-of-12 sg-col-28-of-32 sg-col-16-of-20 sg-col sg-col-32-of-36 sg-col-12-of-16 sg-col-24-of-28" data-asin="1593279280" data-component-id="6" data-component-type="s-search-result" data-index="0" data-uuid="c5f5837a-1f2e-4243-a520-a1936aac014e"><div class="sg-col-inner"> ... etc.
sorties
from selenium.webdriver import Chrome from selenium.webdriver import ChromeOptions from bs4 import BeautifulSoup as Soup options = ChromeOptions() options.add_argument("headless") # to hide window in 'background' driver = Chrome(options=options) driver.get("https://www.amazon.com/s?k=Python") html = driver.page_source soup = Soup(html) soup.find("div", class_="s-main-slot s-result-list s-search-results sg-row")
Installation de Selenium python ici
Changez l'analyseur en lxml
cela devrait fonctionner.
<div class="s-main-slot s-result-list s-search-results sg-row"> <div class="sg-col-20-of-24 s-result-item s-asin sg-col-0-of-12 sg-col-28-of-32 sg-col-16-of-20 sg-col sg-col-32-of-36 sg-col-12-of-16 sg-col-24-of-28" data-asin="1593279280" data-component-type="s-search-result" data-index="0" data-uuid="ae6080d7-b07e-4558-b38f-613931584787"><div class="sg-col-inner"> <span cel_widget_id="MAIN-SEARCH_RESULTS" class="celwidget slot=MAIN template=SEARCH_RESULTS widgetId=search-results"> <div class="s-include-content-margin s-border-bottom s-latency-cf-section"> <div class="a-section a-spacing-medium"> <div class="sg-row"> <div class="a-section a-spacing-micro s-min-height-small"> <a class="a-link-normal" href="/gp/bestsellers/books/285856/ref=sr_bs_0_285856_1"> <span class="rush-component" data-component-props='{"badgeType":"best-seller","asin":"1593279280"}' data-component-type="s-status-badge-component"> <div class="a-row a-badge-region"><span aria-labelledby="1593279280-best-seller-label 1593279280-best-seller-supplementary" class="a-badge" data-a-badge-supplementary-position="right" data-a-badge-type="status" id="1593279280-best-seller" tabindex="0"><span aria-hidden="true" class="a-badge-label" data-a-badge-color="sx-orange" id="1593279280-best-seller-label"><span class="a-badge-label-inner a-text-ellipsis"> <span class="a-badge-text" data-a-badge-color="sx-cloud">Best Seller</span> </span></span><span aria-hidden="true" class="a-badge-supplementary-text a-text-ellipsis" id="1593279280-best-seller-supplementary">in Python Programming</span></span></div> </span> </a> </div> </div> <div class="sg-row"> <div class="sg-col-4-of-24 sg-col-4-of-12 sg-col-4-of-36 sg-col-4-of-28 sg-col-4-of-16 sg-col sg-col-4-of-20 sg-col-4-of-32"><div class="sg-col-inner"> <div class="a-section a-spacing-none"> <span class="rush-component" data-component-type="s-product-image"> <a class="a-link-normal" href="/Python-Crash-Course-2nd-Edition/dp/1593279280/ref=sr_1_1?dchild=1&keywords=Python&qid=1592423942&sr=8-1"> <div class="a-section aok-relative s-image-fixed-height"> <img alt="Python Crash Course, 2nd Edition: A Hands-On, Project-Based Introduction to Programming" class="s-image" data-image-index="0" data-image-latency="s-product-image" data-image-load="" data-image-source-density="1" src="https://m.media-amazon.com/images/I/81f8XACISAL._AC_UY218_.jpg" srcset="https://m.media-amazon.com/images/I/81f8XACISAL._AC_UY218_.jpg 1x, https://m.media-amazon.com/images/I/81f8XACISAL._AC_UY327_QL65_.jpg 1.5x, https://m.media-amazon.com/images/I/81f8XACISAL._AC_UY436_QL65_.jpg 2x, https://m.media-amazon.com/images/I/81f8XACISAL._AC_UY545_QL65_.jpg 2.5x, https://m.media-amazon.com/images/I/81f8XACISAL._AC_UY654_QL65_.jpg 3x"/> </div> </a> </span> </div> </div></div> <div class="sg-col-4-of-12 sg-col-8-of-16 sg-col-16-of-24 sg-col-12-of-20 sg-col-24-of-32 sg-col sg-col-28-of-36 sg-col-20-of-28"><div class="sg-col-inner"> <div class="sg-row"> <div class="sg-col-4-of-12 sg-col-8-of-16 sg-col-12-of-32 sg-col-12-of-20 sg-col-12-of-36 sg-col sg-col-12-of-24 sg-col-12-of-28"><div class="sg-col-inner"> <div class="a-section a-spacing-none"> <h2 class="a-size-mini a-spacing-none a-color-base s-line-clamp-2"> <a class="a-link-normal a-text-normal" href="/Python-Crash-Course-2nd-Edition/dp/1593279280/ref=sr_1_1?dchild=1&keywords=Python&qid=1592423942&sr=8-1"> <span class="a-size-medium a-color-base a-text-normal" dir="auto">Python Crash Course, 2nd Edition: A Hands-On, Project-Based Introduction to Programming</span> </a> </h2> <div class="a-row a-size-base a-color-secondary"><span class="a-size-base" dir="auto">by </span> <a class="a-size-base a-link-normal" href="/Eric-Matthes/e/B01DPU378I?ref=sr_ntt_srch_lnk_1&qid=1592423942&sr=8-1"> Eric Matthes </a> <span class="a-letter-space"></span><span class="a-size-base a-color-secondary" dir="auto"> | </span><span class="a-letter-space"></span><span class="a-size-base a-color-secondary a-text-normal" dir="auto">May 3, 2019</span></div> </div> <div class="a-section a-spacing-none a-spacing-top-micro"> <div class="a-row a-size-small"> <span aria-label="4.6 out of 5 stars"> <span class="a-declarative" data-a-popover='{"max-width":"700","closeButton":false,"position":"triggerBottom","url":"/review/widgets/average-customer-review/popover/ref=acr_search__popover?ie=UTF8&asin=1593279280&ref=acr_search__popover&contextId=search"}' data-action="a-popover"> <a class="a-popover-trigger a-declarative" href="javascript:void(0)"><i class="a-icon a-icon-star-small a-star-small-4-5 aok-align-bottom"><span class="a-icon-alt">4.6 out of 5 stars</span></i><i class="a-icon a-icon-popover"></i></a> </span> </span> <span aria-label="555"> <a class="a-link-normal" href="/Python-Crash-Course-2nd-Edition/dp/1593279280/ref=sr_1_1?dchild=1&keywords=Python&qid=1592423942&sr=8-1#customerReviews"> <span class="a-size-base" dir="auto">555</span> </a> </span> </div> </div> </div></div> </div> <div class="sg-row"> <div class="sg-col-4-of-24 sg-col-4-of-12 sg-col-4-of-36 sg-col-4-of-28 sg-col-4-of-16 sg-col sg-col-4-of-20 sg-col-4-of-32"><div class="sg-col-inner"> <div class="a-section a-spacing-none a-spacing-top-small"> <div class="a-row a-size-base a-color-base"> <a class="a-size-base a-link-normal a-text-bold" href="/Python-Crash-Course-2nd-Edition/dp/1593279280/ref=sr_1_1?dchild=1&keywords=Python&qid=1592423942&sr=8-1"> Paperback </a> </div><div class="a-row a-size-base a-color-base"><div class="a-row"> <a class="a-size-base a-link-normal a-text-normal" href="/Python-Crash-Course-2nd-Edition/dp/1593279280/ref=sr_1_1?dchild=1&keywords=Python&qid=1592423942&sr=8-1"> <span class="a-price" data-a-color="base" data-a-size="l"><span class="a-offscreen">$22.99</span><span aria-hidden="true"><span class="a-price-symbol">$</span><span class="a-price-whole">22<span class="a-price-decimal">.</span></span><span class="a-price-fraction">99</span></span></span> <span class="a-price a-text-price" data-a-color="secondary" data-a-size="b" data-a-strike="true"><span class="a-offscreen">$39.95</span><span aria-hidden="true">$39.95</span></span> </a> </div></div><div class="a-row a-size-small a-color-secondary"><span dir="auto">Get 3 for the price of 2</span></div> </div> <div class="a-section a-spacing-none a-spacing-top-micro"> <div class="a-row a-size-base a-color-secondary s-align-children-center"><span class="a-size-small a-color-secondary" dir="auto">Ships to United Kingdom</span></div> </div> <div class="a-section a-spacing-none a-spacing-top-mini"> <div class="a-row a-size-base a-color-secondary"><span class="a-size-base a-color-secondary" dir="auto">More Buying Choices</span><br/><span class="a-color-base" dir="auto">$22.82</span><span class="a-letter-space"></span> <a class="a-link-normal" href="/gp/offer-listing/1593279280/ref=sr_1_1?keywords=Python&qid=1592423942&sr=8-1&dchild=1"> (39 used & new offers) </a> </div> </div> <div class="a-section a-spacing-none a-spacing-top-mini"> <div class="a-row"><div class="a-row a-spacing-mini"><hr aria-hidden="true" class="a-spacing-mini a-divider-normal"/><div class="a-row a-size-base a-color-base"> <a class="a-size-base a-link-normal a-text-bold" href="/Python-Crash-Course-Eric-Matthes-ebook/dp/B07J4521M3/ref=sr_1_1?keywords=Python&qid=1592423942&sr=8-1"> Kindle </a> </div><div class="a-row a-size-base a-color-base"><div class="a-row"> <a class="a-size-base a-link-normal a-text-normal" href="/Python-Crash-Course-Eric-Matthes-ebook/dp/B07J4521M3/ref=sr_1_1?keywords=Python&qid=1592423942&sr=8-1"> <span class="a-price" data-a-color="base" data-a-size="l"><span class="a-offscreen">$23.99</span><span aria-hidden="true"><span class="a-price-symbol">$</span><span class="a-price-whole">23<span class="a-price-decimal">.</span></span><span class="a-price-fraction">99</span></span></span> <span class="a-price a-text-price" data-a-color="secondary" data-a-size="b" data-a-strike="true"><span class="a-offscreen">$39.95</span><span aria-hidden="true">$39.95</span></span> </a> </div></div></div></div> </div> </div></div> <div class="sg-col-4-of-12 sg-col-8-of-28 sg-col-4-of-16 sg-col-8-of-32 sg-col sg-col-8-of-20 sg-col-8-of-36 sg-col-8-of-24"><div class="sg-col-inner"> </div></div> </div> <div class="sg-row"> <div class="sg-col-20-of-24 sg-col-28-of-32 sg-col-16-of-20 sg-col sg-col-32-of-36 sg-col-8-of-12 sg-col-12-of-16 sg-col-24-of-28"><div class="sg-col-inner"> </div></div> </div> <div class="sg-row"> <div class="sg-col-20-of-24 sg-col-28-of-32 sg-col-16-of-20 sg-col sg-col-32-of-36 sg-col-8-of-12 sg-col-12-of-16 sg-col-24-of-28"><div class="sg-col-inner"> </div></div> </div> </div></div> </div> </div> </div> </span> </div></div> <div class="sg-col-20-of-24 s-result-item s-asin sg-col-0-of-12 sg-col-28-of-32 sg-col-16-of-20 sg-col sg-col-32-of-36 sg-col-12-of-16 sg-col-24-of-28" data-asin="1449355730" data-component-type="s-search-result" data-index="1" data-uuid="047b9c10-2a93-4895-97f7-83778651c3f6"><div class="sg-col-inner"> <span cel_widget_id="MAIN-SEARCH_RESULTS" class="celwidget slot=MAIN template=SEARCH_RESULTS widgetId=search-results"> <div class="s-include-content-margin s-border-bottom s-latency-cf-section"> <div class="a-section a-spacing-medium"> <div class="sg-row"> <div class="a-section a-spacing-micro s-min-height-small"> <a class="a-link-normal" href="/gp/bestsellers/books/132561011/ref=sr_bs_1_132561011_1"> <span class="rush-component" data-component-props='{"badgeType":"best-seller","asin":"1449355730"}' data-component-type="s-status-badge-component"> <div class="a-row a-badge-region"><span aria-labelledby="1449355730-best-seller-label 1449355730-best-seller-supplementary" class="a-badge" data-a-badge-supplementary-position="right" data-a-badge-type="status" id="1449355730-best-seller" tabindex="0"><span aria-hidden="true" class="a-badge-label" data-a-badge-color="sx-orange" id="1449355730-best-seller-label"><span class="a-badge-label-inner a-text-ellipsis"> <span class="a-badge-text" data-a-badge-color="sx-cloud">Best Seller</span> </span></span><span aria-hidden="true" class="a-badge-supplementary-text a-text-ellipsis" id="1449355730-best-seller-supplementary">in Functional Software Programming</span></span></div> </span> </a> </div> </div>
Sortie sur ma console:
url = "https://www.amazon.com/s?k=Python" # Deceiving Amazon that I am trying to reach them from a browser headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36' } page = requests.get(url, headers=headers) soup = BeautifulSoup(page.content, "lxml") # Trying to get the element I need but prints "None" print(soup.find("div", class_="s-main-slot s-result-list s-search-results sg-row"))
p >
@eusoubrasileiro Vous obtenez probablement une page captcha, essayez d'ajouter 'Accept-Language': 'en-US, en; q = 0.5'
http en-tête en plus de User-Agent code >
ajoutez-le à votre réponse, vous avez raison! clarifiez également à @ MátéVeres qu'il n'a pas réussi au début aussi.
Une bonne solution à cela en utilisant des requêtes
et BeautifulSoup
est:
import requests from bs4 import BeautifulSoup as bs headers = { 'authority': 'www.amazon.com', 'cache-control': 'max-age=0', 'rtt': '300', 'downlink': '1.35', 'ect': '3g', 'sec-ch-ua': '"Google Chrome"; v="83"', 'sec-ch-ua-mobile': '?0', 'upgrade-insecure-requests': '1', 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36', 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9', 'sec-fetch-site': 'none', 'sec-fetch-mode': 'navigate', 'sec-fetch-user': '?1', 'sec-fetch-dest': 'document', 'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8', 'cookie': 'aws-priv=eyJ2IjoxLCJldSI6MCwic3QiOjB9; session-id=139-7350741-1081713; ubid-main=135-9894765-6184621; lc-main=en_US; s_fid=0A4730DDD06B62E4-1DB478AB62143F35; regStatus=pre-register; x-main=hd2N9IEBuVL7il1dbkhEEHTQSf4Q7uviwjc2eikr0hRGGOyI2RYIiRsk3GvDKLSx; at-main=Atza|IwEBIJdoAZ4Y6j2IIGvC29t1ha634aK-p2kAl8rHhQRCSGMSU_nwQvM6fakAbYEjpVLPU4Jj0TwKvX70d6QnlouKPh0QwpHJG8rHUNVb-gmhS9shHM8fCJk45r1XW2FOSpLoM1iAO9kYIpOoW2M5We9xfdqlLuQBB-D5fQeO5Vqew4RnHesPNZuF4DQNlcqL7wrGjDY1JQKzlzARfATAuwaCy4jMD5bNmxpcWtTgNGrTtLpGv1Y-4Mnx2axxQYFgwpRNv_sPNZrMAfHdU7MX67HbyPyV3V21KAl8QNl0xE-lNl3myxnfyWH68Z5D-j501S7HWzkKxopy3SfGuwwZTjSVSVlnH4RmTwvEnW8W3tndcX6X1ETysYYXmO7TudIjtq7aUZqPBJe_MViePcWL3OV4q2b5; sess-at-main="TjcvTeXAA2dP6HOMGcG/n+Cdkr+peDBlNMOvfBz6oE0="; sst-main=Sst1|PQGR5AF9x4yS-iMft3B9aBzJC8v-e4M1kmB_3KS0pxtVTj1cH8hl3fajgigt6xEYhan-kUJuY5KNbteBgbiyDIRCs4ISve5MdRhDdoy7XKrVD1g5McZTyvdwYLfbTJbTUov51hOyPcE8BKpFL1bGpJiiJbZ0TV7Pyc6tkndogjneZATDErc4U08WE4LwPJxCiF-I-7Av4-JEfwH1ZQ81mz6rqy-K1o6bCMRRZ8kWuzrl0wobKsr4Sz0-m1K0waguIewhXNm4V4DLe8mn-_6I8_k9p9v3NiFRpp04v0Ptzw8V1ARo2U18t5f2nx54EXwHzvzOQlpeBVY2U0WpXDcKsU3C8Q; session-id-time=2082787201l; i18n-prefs=USD; x-wl-uid=1MwJyD7dRnGiVdHw1PKiwmoNP9S/0xy+3KAKCJl2fM5VOthLzEW3dzyeW4zdKAepcIxkXpJFkxWcafUXXcS0MeSyLyFoBkl3xnNPLiRK0Rq33AHw0gL3W1FDBUn9OcakOzJGVGKZRc5E=; s_vn=1614974634531%26vn%3D4; s_nr=1590823888871-Repeat; s_vnum=2022823888872%26vn%3D1; s_dslv=1590823888874; sp-cdn="L5Z9:FR"; session-token=3AIPjoIrP8ITt1e/KXLZGSlnOPpirrWotNpCpCEfNRCY9mCfAV169URMcAX8XECtxt/qJujUn66Oyz8KIFDMieNmSdzEKA0K8I4AqbzplslzVGtZ6rNg+XsX/Bdc3hxnB7tUqQhrbrtVUncdzUMN1c95vhL7p+AEog3iiDkhLch0VO+Sl8HkAdZ/63xrp0stAaUsYo1GgsOFGI8+3wJUp4CHrJnoj/0lqjCJCpgXTZfxJcfWy9KarcGAPkno+fuMQqMoShJdi8R+DZ9XmIMib1bsLwXnerZa; csm-hit=tb:GVY0F2K4G05TXW59KB9M+s-GVY0F2K4G05TXW59KB9M|1592424615451&t:1592424615452&adb:adblk_yes', } params = ( ('k', 'Python'), ('ref', 'nb_sb_noss'), ) response = requests.get('https://www.amazon.com/s', headers=headers, params=params) soup = bs(response.text,'lxml') print(soup.find('div',class_='s-main-slot s-result-list s-search-results sg-row'))
attention, vous ne devriez pas passer de cookies comme ça, utilisez plutôt session
pour les gérer pour vous
Copiez-vous les en-têtes de quelque part dans l'onglet réseau de votre navigateur ou les générez-vous? @ahmadfaraz
@politicalscientist J'utilise habituellement session
mais oui, parfois je copie les en-têtes du navigateur.
@AhmadFaraz, merci! Je vais lire sur session
(jamais utilisé auparavant)
Le div est peut-être rempli à l'aide de JS - qui ne s'exécute pas lorsque vous utilisez requests + bs4. Vous aurez besoin d'un navigateur.
Merci beaucoup.
confirmé!
soup.select ("[class = 's-main-slot s-result-list s-search-results sg-row']")
ne fonctionne pas non plusJ'ai essayé l'option
lxml
sur la réponse ci-dessous, ne fonctionne pas. Quelqu'un pareil?Lxml ne fonctionne pas non plus pour moi.