1
votes

Comment trouver une étendue avec une chaîne de classe par une sous-chaîne spécifique dans la classe python

Je télécharge des données en utilisant Beautifulsoup. J'extrais le code, il ressemble à ceci.

bs4.find_all('span',{class: XXX})

Je dois obtenir le {time, title, month} en un seul df. Cela doit être sélectionné par la sous-chaîne "calendar-date" dans la classe attr.

Je veux utiliser

<td><span class="calendar-date-2">11:50 PM </span></td>,
<tr><td>
<div title="ABC"></div>
</td></tr>
<span>SEP</span>

<td><span class="calendar-date-1">12:00 PM </span></td>,
<tr><td>
<div title="CDE"></div>
</td></tr>
<span>OCT</span>

<td><span class="calendar-date-3">12:10 PM </span></td>,
<tr><td>
<div title="FGH"></div>
</td></tr>
<span>NOV</span>

Mais cela nécessite que la classe ait l'exact attrs.

Je ne sais pas comment écrire le code.

python html date calendar beautifulsoup

0 commentaires

3 Réponses :

0
votes

Vous pouvez obtenir toutes les balises span , puis les filtrer:

spans = [s for s in bs4.find_all('span') if s.get('class', [''])[0].startswith('calendar-date')]

0 commentaires

0
votes

Vous pouvez utiliser Regex.

11:50 PM 
ABC
SEP
12:00 PM 
CDE
OCT
12:10 PM 
FGH
NOV

Sortie:^

import re
from bs4 import BeautifulSoup

html = """<td><span class="calendar-date-2">11:50 PM </span></td>,
<tr><td>
<div title="ABC"></div>
</td></tr>
<span>SEP</span>

<td><span class="calendar-date-1">12:00 PM </span></td>,
<tr><td>
<div title="CDE"></div>
</td></tr>
<span>OCT</span>

<td><span class="calendar-date-3">12:10 PM </span></td>,
<tr><td>
<div title="FGH"></div>
</td></tr>
<span>NOV</span>"""

soup = BeautifulSoup(html, "html.parser")
for span in soup.find_all("span", class_=re.compile(r"^calendar\-date\-\d+")):
    print(span.text)
    print(span.find_previous('td').find_next('div')['title'])
    print(span.find_next('span').text)

0 commentaires

1
votes

Essayez le sélecteur css sans regex.

from bs4 import BeautifulSoup

datahtml = """<td><span class="calendar-date-2">11:50 PM </span></td>,
<tr><td>
<div title="ABC"></div>
</td></tr>
<span>SEP</span>

<td><span class="calendar-date-1">12:00 PM </span></td>,
<tr><td>
<div title="CDE"></div>
</td></tr>
<span>OCT</span>

<td><span class="calendar-date-3">12:10 PM </span></td>,
<tr><td>
<div title="FGH"></div>
</td></tr>
<span>NOV</span>"""

soup = BeautifulSoup(datahtml, "html.parser")
for span in soup.select("[class^='calendar-date-']"):
    print(span.text)
    print(span.find_previous('td').find_next('div')['title'])
    print(span.find_next('span').text)

0 commentaires