J'essaie de gratter les adresses de certains éléments html en utilisant la bibliothèque BeautifulSoup. Mon intention est de récupérer les adresses jusqu'au dernier County
. Le problème auquel je suis confronté ici est qu'il y a deux County
dans toutes les adresses, donc je ne peux pas faire fonctionner mon script.
Les sources des trois adresses:
from bs4 import BeautifulSoup html = """ <div class="col_biz bizgrid_hdr_address"> <strong>Business Address:</strong><br> 39829 County Road 452<br><a title="Eco Sciences, LLC located in Leesburg, FL " href="/fl/lake/leesburg">Leesburg</a>, <a title="Eco Sciences, LLC located in FL " href="/fl">FL</a> 32788<br><a title="Eco Sciences, LLC located in Lake County, FL " href="/fl/lake">Lake County</a><br><br><div class="bizbtn"><a title="Eco Sciences, LLC" href="http://www.ecosciencesllc.com/" rel="nofollow" target="_blank">Eco Sciences, LLC Website</a></div> </div> """ soup = BeautifulSoup(html,"lxml") address = [] for i in soup.select_one(".bizgrid_hdr_address"): if not i.string:continue if 'County' in i.string.strip():break address.append(i.string.strip()) print(' '.join(address).strip())
Voici comment ils sont là-dedans:
Business Address: 39829 County Road 452 Leesburg , FL 32788 Business Address: 28 County Road 884 Rainsville , AL 35986 Business Address: 650 County Road 375 Jarrell , TX 76537
Production attendue:
['', 'Business Address:', '39829 County Road 452', 'Leesburg', ',', 'FL', '32788', 'Lake County', 'Eco Sciences, LLC Website', ''] ['', 'Business Address:', '28 County Road 884', 'Rainsville', ',', 'AL', '35986', 'DeKalb County', ''] ['', 'Business Address:', '650 County Road 375', 'Jarrell', ',', 'TX', '76537', 'Williamson County', 'YOUnity Clothing Website', '']
J'ai essayé jusqu'à présent:
<div class="col_biz bizgrid_hdr_address"> <strong>Business Address:</strong><br> 650 County Road 375<br><a title="YOUnity Clothing located in Jarrell, TX " href="/tx/williamson/jarrell">Jarrell</a>, <a title="YOUnity Clothing located in TX " href="/tx">TX</a> 76537<br><a title="YOUnity Clothing located in Williamson County, TX " href="/tx/williamson">Williamson County</a><br><br><div class="bizbtn"><a title="YOUnity Clothing" href="http://www.younityclothing.com" rel="nofollow" target="_blank">YOUnity Clothing Website</a></div> </div> <div class="col_biz bizgrid_hdr_address"> <strong>Business Address:</strong><br> 39829 County Road 452<br><a title="Eco Sciences, LLC located in Leesburg, FL " href="/fl/lake/leesburg">Leesburg</a>, <a title="Eco Sciences, LLC located in FL " href="/fl">FL</a> 32788<br><a title="Eco Sciences, LLC located in Lake County, FL " href="/fl/lake">Lake County</a><br><br><div class="bizbtn"><a title="Eco Sciences, LLC" href="http://www.ecosciencesllc.com/" rel="nofollow" target="_blank">Eco Sciences, LLC Website</a></div> </div> <div class="col_biz bizgrid_hdr_address"> <strong>Business Address:</strong><br> 28 County Road 884<br><a title="R&P Painting located in Rainsville, AL " href="/al/dekalb/rainsville">Rainsville</a>, <a title="R&P Painting located in AL " href="/al">AL</a> 35986<br><a title="R&P Painting located in DeKalb County, AL " href="/al/dekalb">DeKalb County</a> </div>
Malheureusement, la tentative ci-dessus ne produit que Business Address:
parce qu'elle rencontre le premier County
et rompt la boucle alors que mon objectif ici est de saisir le dernier County
.
Comment puis-je capturer la portion d'adresse souhaitée?
3 Réponses :
Je n'ai pas vérifié le code, mais j'ai essayé de donner une idée pour utiliser une sorte d'indicateur. La première rencontre changera le drapeau à 1. Et la deuxième rencontre sortira de la boucle.
... soup = BeautifulSoup(html,"lxml") address = [] flag = 0 for i in soup.select_one(".bizgrid_hdr_address"): if not i.string:continue if 'County' in i.string.strip() and flag: break if 'County' in i.string.strip(): flag = 1 address.append(i.string.strip()) print(' '.join(address).strip())
Que faire s'il y a un seul County
situé le plus proche de la fin de la longue adresse. Je souhaite toujours obtenir l'adresse dans ce County
.
Vous n'avez pas besoin de ce point-virgule dans flag = 0;
De plus, en Python, il est bon de mettre en retrait continue
et break
@MITHU y a-t-il des cas avec 1, 2 ou plus de Country
?
[('650 County Road 375', 'Jarrell, TX'), ('39829 County Road 452', 'Leesburg, FL'), ('28 County Road 884', 'Rainsville, AL')]
Je ne sais pas si cela fonctionnera pour une plus grande partie du HTML, mais il y a le mot Website
dans chaque ancre, vous pouvez donc filtrer par cela.
Par exemple:
Business Address: 650 County Road 375 Jarrell, TX 76537 Williamson County Business Address: 39829 County Road 452 Leesburg, FL 32788 Lake County Business Address: 28 County Road 884 Rainsville, AL 35986 DeKalb County
Cela vous permet:
from bs4 import BeautifulSoup html = """<div class="col_biz bizgrid_hdr_address"> <strong>Business Address:</strong><br> 650 County Road 375<br><a title="YOUnity Clothing located in Jarrell, TX " href="/tx/williamson/jarrell">Jarrell</a>, <a title="YOUnity Clothing located in TX " href="/tx">TX</a> 76537<br><a title="YOUnity Clothing located in Williamson County, TX " href="/tx/williamson">Williamson County</a><br><br><div class="bizbtn"><a title="YOUnity Clothing" href="http://www.younityclothing.com" rel="nofollow" target="_blank">YOUnity Clothing Website</a></div> </div> <div class="col_biz bizgrid_hdr_address"> <strong>Business Address:</strong><br> 39829 County Road 452<br><a title="Eco Sciences, LLC located in Leesburg, FL " href="/fl/lake/leesburg">Leesburg</a>, <a title="Eco Sciences, LLC located in FL " href="/fl">FL</a> 32788<br><a title="Eco Sciences, LLC located in Lake County, FL " href="/fl/lake">Lake County</a><br><br><div class="bizbtn"><a title="Eco Sciences, LLC" href="http://www.ecosciencesllc.com/" rel="nofollow" target="_blank">Eco Sciences, LLC Website</a></div> </div> <div class="col_biz bizgrid_hdr_address"> <strong>Business Address:</strong><br> 28 County Road 884<br><a title="R&P Painting located in Rainsville, AL " href="/al/dekalb/rainsville">Rainsville</a>, <a title="R&P Painting located in AL " href="/al">AL</a> 35986<br><a title="R&P Painting located in DeKalb County, AL " href="/al/dekalb">DeKalb County</a> </div> """ output = [] for div in BeautifulSoup(html, "lxml").select(".bizgrid_hdr_address"): for item in div: if item.string and item.string.strip(): text = item.string.strip() if "Website" in text: continue output.append(text) addresses = [output[i:i+7] for i in range(0, len(output), 7)] for address in addresses: print(" ".join(address).replace(" ,", ","))
Essayez
soup.select_one(".bizgrid_hdr_address").text.replace('\n', '')
Soit vous n'avez pas lu la description, soit vous n'avez pas compris ce que je cherchais à @JaSON. Merci quand même.