0
votes

Impossible de récupérer la portion d'adresse souhaitée parmi les longues

J'essaie de gratter les adresses de certains éléments html en utilisant la bibliothèque BeautifulSoup. Mon intention est de récupérer les adresses jusqu'au dernier County . Le problème auquel je suis confronté ici est qu'il y a deux County dans toutes les adresses, donc je ne peux pas faire fonctionner mon script.

Les sources des trois adresses:

from bs4 import BeautifulSoup

html = """
<div class="col_biz bizgrid_hdr_address">
<strong>Business Address:</strong><br>
39829 County Road 452<br><a title="Eco Sciences, LLC located in Leesburg, FL " href="/fl/lake/leesburg">Leesburg</a>, <a title="Eco Sciences, LLC located in FL " href="/fl">FL</a> 32788<br><a title="Eco Sciences, LLC located in Lake County, FL " href="/fl/lake">Lake County</a><br><br><div class="bizbtn"><a title="Eco Sciences, LLC" href="http://www.ecosciencesllc.com/" rel="nofollow" target="_blank">Eco Sciences, LLC Website</a></div>
</div>
"""
soup = BeautifulSoup(html,"lxml")
address = []
for i in soup.select_one(".bizgrid_hdr_address"):
    if not i.string:continue
    if 'County' in i.string.strip():break
    address.append(i.string.strip())
print(' '.join(address).strip())

Voici comment ils sont là-dedans:

Business Address: 39829 County Road 452 Leesburg , FL 32788
Business Address: 28 County Road 884 Rainsville , AL 35986
Business Address: 650 County Road 375 Jarrell , TX 76537

Production attendue:

['', 'Business Address:', '39829 County Road 452', 'Leesburg', ',', 'FL', '32788', 'Lake County', 'Eco Sciences, LLC Website', '']

['', 'Business Address:', '28 County Road 884', 'Rainsville', ',', 'AL', '35986', 'DeKalb County', '']

['', 'Business Address:', '650 County Road 375', 'Jarrell', ',', 'TX', '76537', 'Williamson County', 'YOUnity Clothing Website', '']

J'ai essayé jusqu'à présent:

<div class="col_biz bizgrid_hdr_address">
<strong>Business Address:</strong><br>
650 County Road 375<br><a title="YOUnity Clothing located in Jarrell, TX " href="/tx/williamson/jarrell">Jarrell</a>, <a title="YOUnity Clothing located in TX " href="/tx">TX</a> 76537<br><a title="YOUnity Clothing located in Williamson County, TX " href="/tx/williamson">Williamson County</a><br><br><div class="bizbtn"><a title="YOUnity Clothing" href="http://www.younityclothing.com" rel="nofollow" target="_blank">YOUnity Clothing Website</a></div>
</div>


<div class="col_biz bizgrid_hdr_address">
<strong>Business Address:</strong><br>
39829 County Road 452<br><a title="Eco Sciences, LLC located in Leesburg, FL " href="/fl/lake/leesburg">Leesburg</a>, <a title="Eco Sciences, LLC located in FL " href="/fl">FL</a> 32788<br><a title="Eco Sciences, LLC located in Lake County, FL " href="/fl/lake">Lake County</a><br><br><div class="bizbtn"><a title="Eco Sciences, LLC" href="http://www.ecosciencesllc.com/" rel="nofollow" target="_blank">Eco Sciences, LLC Website</a></div>
</div>


<div class="col_biz bizgrid_hdr_address">
<strong>Business Address:</strong><br>
28 County Road 884<br><a title="R&amp;P Painting located in Rainsville, AL " href="/al/dekalb/rainsville">Rainsville</a>, <a title="R&amp;P Painting located in AL " href="/al">AL</a> 35986<br><a title="R&amp;P Painting located in DeKalb County, AL " href="/al/dekalb">DeKalb County</a>
</div>

Malheureusement, la tentative ci-dessus ne produit que Business Address: parce qu'elle rencontre le premier County et rompt la boucle alors que mon objectif ici est de saisir le dernier County .

Comment puis-je capturer la portion d'adresse souhaitée?

python python-3.x web-scraping beautifulsoup

2 commentaires

Essayez soup.select_one(".bizgrid_hdr_address").text.replace('\n', '')

Soit vous n'avez pas lu la description, soit vous n'avez pas compris ce que je cherchais à @JaSON. Merci quand même.

3 Réponses :

1
votes

Je n'ai pas vérifié le code, mais j'ai essayé de donner une idée pour utiliser une sorte d'indicateur. La première rencontre changera le drapeau à 1. Et la deuxième rencontre sortira de la boucle.

...
soup = BeautifulSoup(html,"lxml")
address = []

flag = 0
for i in soup.select_one(".bizgrid_hdr_address"):
    if not i.string:continue
    if 'County' in i.string.strip() and flag:
        break
    if 'County' in i.string.strip(): 
        flag = 1
    address.append(i.string.strip())
print(' '.join(address).strip())

3 commentaires

Que faire s'il y a un seul County situé le plus proche de la fin de la longue adresse. Je souhaite toujours obtenir l'adresse dans ce County .

Vous n'avez pas besoin de ce point-virgule dans flag = 0; De plus, en Python, il est bon de mettre en retrait continue et break

@MITHU y a-t-il des cas avec 1, 2 ou plus de Country ?

1
votes

[('650 County Road 375', 'Jarrell, TX'), ('39829 County Road 452', 'Leesburg, FL'), ('28 County Road 884', 'Rainsville, AL')]

0 commentaires

1
votes

Je ne sais pas si cela fonctionnera pour une plus grande partie du HTML, mais il y a le mot Website dans chaque ancre, vous pouvez donc filtrer par cela.

Par exemple:

Business Address: 650 County Road 375 Jarrell, TX 76537 Williamson County
Business Address: 39829 County Road 452 Leesburg, FL 32788 Lake County
Business Address: 28 County Road 884 Rainsville, AL 35986 DeKalb County

Cela vous permet:

from bs4 import BeautifulSoup

html = """<div class="col_biz bizgrid_hdr_address">
<strong>Business Address:</strong><br>
650 County Road 375<br><a title="YOUnity Clothing located in Jarrell, TX " href="/tx/williamson/jarrell">Jarrell</a>, <a title="YOUnity Clothing located in TX " href="/tx">TX</a> 76537<br><a title="YOUnity Clothing located in Williamson County, TX " href="/tx/williamson">Williamson County</a><br><br><div class="bizbtn"><a title="YOUnity Clothing" href="http://www.younityclothing.com" rel="nofollow" target="_blank">YOUnity Clothing Website</a></div>
</div>


<div class="col_biz bizgrid_hdr_address">
<strong>Business Address:</strong><br>
39829 County Road 452<br><a title="Eco Sciences, LLC located in Leesburg, FL " href="/fl/lake/leesburg">Leesburg</a>, <a title="Eco Sciences, LLC located in FL " href="/fl">FL</a> 32788<br><a title="Eco Sciences, LLC located in Lake County, FL " href="/fl/lake">Lake County</a><br><br><div class="bizbtn"><a title="Eco Sciences, LLC" href="http://www.ecosciencesllc.com/" rel="nofollow" target="_blank">Eco Sciences, LLC Website</a></div>
</div>


<div class="col_biz bizgrid_hdr_address">
<strong>Business Address:</strong><br>
28 County Road 884<br><a title="R&amp;P Painting located in Rainsville, AL " href="/al/dekalb/rainsville">Rainsville</a>, <a title="R&amp;P Painting located in AL " href="/al">AL</a> 35986<br><a title="R&amp;P Painting located in DeKalb County, AL " href="/al/dekalb">DeKalb County</a>
</div>
"""
output = []
for div in BeautifulSoup(html, "lxml").select(".bizgrid_hdr_address"):
    for item in div:
        if item.string and item.string.strip():
            text = item.string.strip()
            if "Website" in text:
                continue
            output.append(text)

addresses = [output[i:i+7] for i in range(0, len(output), 7)]
for address in addresses:
    print(" ".join(address).replace(" ,", ","))

0 commentaires