J'essaie de gratter les adresses de certains éléments html en utilisant la bibliothèque BeautifulSoup. Mon intention est de récupérer les adresses jusqu'au dernier County . Le problème auquel je suis confronté ici est qu'il y a deux County dans toutes les adresses, donc je ne peux pas faire fonctionner mon script.
Les sources des trois adresses:
from bs4 import BeautifulSoup
html = """
<div class="col_biz bizgrid_hdr_address">
<strong>Business Address:</strong><br>
39829 County Road 452<br><a title="Eco Sciences, LLC located in Leesburg, FL " href="/fl/lake/leesburg">Leesburg</a>, <a title="Eco Sciences, LLC located in FL " href="/fl">FL</a> 32788<br><a title="Eco Sciences, LLC located in Lake County, FL " href="/fl/lake">Lake County</a><br><br><div class="bizbtn"><a title="Eco Sciences, LLC" href="http://www.ecosciencesllc.com/" rel="nofollow" target="_blank">Eco Sciences, LLC Website</a></div>
</div>
"""
soup = BeautifulSoup(html,"lxml")
address = []
for i in soup.select_one(".bizgrid_hdr_address"):
if not i.string:continue
if 'County' in i.string.strip():break
address.append(i.string.strip())
print(' '.join(address).strip())
Voici comment ils sont là-dedans:
Business Address: 39829 County Road 452 Leesburg , FL 32788 Business Address: 28 County Road 884 Rainsville , AL 35986 Business Address: 650 County Road 375 Jarrell , TX 76537
Production attendue:
['', 'Business Address:', '39829 County Road 452', 'Leesburg', ',', 'FL', '32788', 'Lake County', 'Eco Sciences, LLC Website', ''] ['', 'Business Address:', '28 County Road 884', 'Rainsville', ',', 'AL', '35986', 'DeKalb County', ''] ['', 'Business Address:', '650 County Road 375', 'Jarrell', ',', 'TX', '76537', 'Williamson County', 'YOUnity Clothing Website', '']
J'ai essayé jusqu'à présent:
<div class="col_biz bizgrid_hdr_address"> <strong>Business Address:</strong><br> 650 County Road 375<br><a title="YOUnity Clothing located in Jarrell, TX " href="/tx/williamson/jarrell">Jarrell</a>, <a title="YOUnity Clothing located in TX " href="/tx">TX</a> 76537<br><a title="YOUnity Clothing located in Williamson County, TX " href="/tx/williamson">Williamson County</a><br><br><div class="bizbtn"><a title="YOUnity Clothing" href="http://www.younityclothing.com" rel="nofollow" target="_blank">YOUnity Clothing Website</a></div> </div> <div class="col_biz bizgrid_hdr_address"> <strong>Business Address:</strong><br> 39829 County Road 452<br><a title="Eco Sciences, LLC located in Leesburg, FL " href="/fl/lake/leesburg">Leesburg</a>, <a title="Eco Sciences, LLC located in FL " href="/fl">FL</a> 32788<br><a title="Eco Sciences, LLC located in Lake County, FL " href="/fl/lake">Lake County</a><br><br><div class="bizbtn"><a title="Eco Sciences, LLC" href="http://www.ecosciencesllc.com/" rel="nofollow" target="_blank">Eco Sciences, LLC Website</a></div> </div> <div class="col_biz bizgrid_hdr_address"> <strong>Business Address:</strong><br> 28 County Road 884<br><a title="R&P Painting located in Rainsville, AL " href="/al/dekalb/rainsville">Rainsville</a>, <a title="R&P Painting located in AL " href="/al">AL</a> 35986<br><a title="R&P Painting located in DeKalb County, AL " href="/al/dekalb">DeKalb County</a> </div>
Malheureusement, la tentative ci-dessus ne produit que Business Address: parce qu'elle rencontre le premier County et rompt la boucle alors que mon objectif ici est de saisir le dernier County .
Comment puis-je capturer la portion d'adresse souhaitée?
3 Réponses :
Je n'ai pas vérifié le code, mais j'ai essayé de donner une idée pour utiliser une sorte d'indicateur. La première rencontre changera le drapeau à 1. Et la deuxième rencontre sortira de la boucle.
...
soup = BeautifulSoup(html,"lxml")
address = []
flag = 0
for i in soup.select_one(".bizgrid_hdr_address"):
if not i.string:continue
if 'County' in i.string.strip() and flag:
break
if 'County' in i.string.strip():
flag = 1
address.append(i.string.strip())
print(' '.join(address).strip())
Que faire s'il y a un seul County situé le plus proche de la fin de la longue adresse. Je souhaite toujours obtenir l'adresse dans ce County .
Vous n'avez pas besoin de ce point-virgule dans flag = 0; De plus, en Python, il est bon de mettre en retrait continue et break
@MITHU y a-t-il des cas avec 1, 2 ou plus de Country ?
[('650 County Road 375', 'Jarrell, TX'), ('39829 County Road 452', 'Leesburg, FL'), ('28 County Road 884', 'Rainsville, AL')]
Je ne sais pas si cela fonctionnera pour une plus grande partie du HTML, mais il y a le mot Website dans chaque ancre, vous pouvez donc filtrer par cela.
Par exemple:
Business Address: 650 County Road 375 Jarrell, TX 76537 Williamson County Business Address: 39829 County Road 452 Leesburg, FL 32788 Lake County Business Address: 28 County Road 884 Rainsville, AL 35986 DeKalb County
Cela vous permet:
from bs4 import BeautifulSoup
html = """<div class="col_biz bizgrid_hdr_address">
<strong>Business Address:</strong><br>
650 County Road 375<br><a title="YOUnity Clothing located in Jarrell, TX " href="/tx/williamson/jarrell">Jarrell</a>, <a title="YOUnity Clothing located in TX " href="/tx">TX</a> 76537<br><a title="YOUnity Clothing located in Williamson County, TX " href="/tx/williamson">Williamson County</a><br><br><div class="bizbtn"><a title="YOUnity Clothing" href="http://www.younityclothing.com" rel="nofollow" target="_blank">YOUnity Clothing Website</a></div>
</div>
<div class="col_biz bizgrid_hdr_address">
<strong>Business Address:</strong><br>
39829 County Road 452<br><a title="Eco Sciences, LLC located in Leesburg, FL " href="/fl/lake/leesburg">Leesburg</a>, <a title="Eco Sciences, LLC located in FL " href="/fl">FL</a> 32788<br><a title="Eco Sciences, LLC located in Lake County, FL " href="/fl/lake">Lake County</a><br><br><div class="bizbtn"><a title="Eco Sciences, LLC" href="http://www.ecosciencesllc.com/" rel="nofollow" target="_blank">Eco Sciences, LLC Website</a></div>
</div>
<div class="col_biz bizgrid_hdr_address">
<strong>Business Address:</strong><br>
28 County Road 884<br><a title="R&P Painting located in Rainsville, AL " href="/al/dekalb/rainsville">Rainsville</a>, <a title="R&P Painting located in AL " href="/al">AL</a> 35986<br><a title="R&P Painting located in DeKalb County, AL " href="/al/dekalb">DeKalb County</a>
</div>
"""
output = []
for div in BeautifulSoup(html, "lxml").select(".bizgrid_hdr_address"):
for item in div:
if item.string and item.string.strip():
text = item.string.strip()
if "Website" in text:
continue
output.append(text)
addresses = [output[i:i+7] for i in range(0, len(output), 7)]
for address in addresses:
print(" ".join(address).replace(" ,", ","))
Essayez
soup.select_one(".bizgrid_hdr_address").text.replace('\n', '')Soit vous n'avez pas lu la description, soit vous n'avez pas compris ce que je cherchais à @JaSON. Merci quand même.