3
votes

Gratter du texte dans les balises h3 et p à l'aide de BeautifulSoup Python

J'ai de l'expérience avec python, BeautifulSoup mais je suis impatient de récupérer des données d'un site Web et de les stocker sous forme de fichier csv. Un seul échantillon de données dont j'ai besoin est codé comme suit (une seule ligne de données).

<a href="https://www.stanford.edu/">
    Stanford University
   </a>

Je souhaite obtenir les liens et le nom avec h3 ainsi que le texte à l'intérieur

(je peux le faire mais pas la première partie) Cependant, avec mon code, je ne peux obtenir Stanford que si je suis find_all (class _ = 'colleges')

Mon code

    import requests
from bs4 import BeautifulSoup



page = requests.get('https://thebestschools.org/features/best-computer-science-programs-in-the-world/')
soup = BeautifulSoup(page.text, 'html.parser')


college_name_list = soup.find(class_='college')
college_name_list_items = college_name_list.find_all('a')
for college_name in college_name_list_items:
    print(college_name.prettify())

Sortie

       ...body and not nested divs...       
            <h3 class="college">
                <span class="num">1.</span> <a href="https://www.stanford.edu/">Stanford University</a>
                </h3>
                <div class="he-mod" data-block="paragraph-9"></div>
                <p class="school-location">Stanford, CA</p>
             ...body and not nested divs...
        <h3 id="MIT" class="college">
        <span class="num">2.</span> <a href="https://web.mit.edu/">Massachusetts Institute of Technology (MIT)</a>
        </h3>
        <div class="he-mod" data-block="paragraph-14"></div>
        <p class="school-location">Cambridge, MA</p>
...body and not nested divs... 
    <h3 id="Berkeley" class="college">
    <span class="num">3.</span> <a href="https://www.berkeley.edu/">University of California Berkeley</a>
    </h3>
    <div class="he-mod" data-block="paragraph-19"></div>
    <p class="school-location">Berkeley, CA</p>
...body and not nested divs...

Je souhaite obtenir les autres collèges aussi avec le même class = college mais des identifiants différents

S'il vous plaît, aidez-moi à les obtenir; je peux organiser le .csv moi-même.

Site Web source à supprimer si vous pouvez me dire à quelle division / classe ou autre chose je devrais me tourner!

python beautifulsoup web-crawler

2 commentaires

Vous vouliez probablement utiliser find_all au lieu de find lorsque vous interrogez pour la classe college - essayez d'abord de changer cela.

Non, cela ne fonctionne pas même après avoir changé u / KunduK a fourni une excellente solution.

3 Réponses :

0
votes

Veuillez essayer ce code:

import requests
from bs4 import BeautifulSoup

page = requests.get('https://thebestschools.org/features/best-computer-science-programs-in-the-world/')
soup = BeautifulSoup(page.text, 'html.parser')college_name_list = soup.find_all(class_='college')
college_name_list_items =[]
for i in college_name_list:
    college_name_list_items.append(i.find_all('a'))


for college_name in college_name_list_items:
    print(college_name)

0 commentaires

1
votes

Essayez d'utiliser find_all avec la balise

, puis recherchez puis extrayez le texte et

 href  code > valeur. import requests
from bs4 import BeautifulSoup
import pandas as pd

page = requests.get('https://thebestschools.org/features/best-computer-science-programs-in-the-world/')
soup = BeautifulSoup(page.text, 'html.parser')
college_name=[]
college_name_url=[]
college_name_list = soup.find_all('h3',class_='college')
for college in college_name_list:
  if college.find('a'):
    college_name_url.append(college.find('a')['href'])
    college_name.append(college.find('a').text)
df = pd.DataFrame({"college_name":college_name,"college_name_url":college_name_url})
df.to_csv('college_name.csv')
  Sortie: au format liste. 
 ['https://www.stanford.edu/', 'Stanford University', 'https://web.mit.edu/', 'Massachusetts Institute of Technology (MIT)', 'https://www.berkeley.edu/', 'University of California Berkeley', 'https://www.harvard.edu/', 'Harvard University', 'https://www.princeton.edu/', 'Princeton University', '/pennsylvania-education/carnegie-mellon-university-online/', 'Carnegie Mellon University', 'https://www.utexas.edu/', 'The University of Texas at Austin', 'https://www.cornell.edu/', 'Cornell University', 'https://www.ucla.edu/', 'University of California, Los Angeles (UCLA)', '/california-education/university-southern-california-online/', 'University of Southern California', 'https://www.caltech.edu/', 'California Institute of Technology (Caltech)', 'https://www.utoronto.ca/', 'University of Toronto', 'https://illinois.edu/', 'University of Illinois at Urbana-Champaign', 'https://ucsd.edu/', 'University of California in San Diego', 'https://www.umich.edu/', 'University of Michigan', 'https://www.umd.edu/', 'University of Maryland, College Park', 'https://www.ethz.ch/en.html', 'Swiss Federal Institute of Technology', 'https://www.technion.ac.il/en/home-2/', 'Technion-Israel Institute of Technology', 'https://www.osu.edu/', 'Ohio State University', 'https://english.tau.ac.il/', 'Tel Aviv University', '/indiana-education/purdue-university-online/', 'Purdue University', 'https://www.gatech.edu/', 'Georgia Institute of Technology', 'https://www.cam.ac.uk/', 'University of Cambridge', 'https://www.ntu.edu.tw/english/', 'National Taiwan University', 'http://ac.cs.tsinghua.edu.cn', 'Tsinghua University', 'https://www.imperial.ac.uk/', 'The Imperial College of Science, Technology, and Medicine', 'https://www.kau.edu.sa/home_english.aspx', 'King Abdulaziz University', 'https://www.tum.de/en/homepage/', 'Technical University Munich', 'https://uci.edu/', 'University of California, Irvine', 'https://www.ucdavis.edu/', 'University of California, Davis', 'https://www.columbia.edu/', 'Columbia University', '/online-colleges/arizona-state-university-online/', 'Arizona State University', 'https://www.ntu.edu.sg/Pages/home.aspx', 'Nanyang Technological University', 'https://www.ox.ac.uk/', 'University of Oxford', '/online-colleges/northwestern-university-online/', 'Northwestern University', 'https://www.epfl.ch/en/home/', 'Swiss Federal Institute of Technology Lausanne', 'https://www.nyu.edu/', 'New York University', 'https://www.kau.edu.sa/home_english.aspx', 'The Chinese University of Hong Kong', '/north-carolina-education/university-north-carolina-online/', 'University of North Carolina at Chapel Hill', 'https://www.ust.hk/', 'The Hong Kong University of Science and Technology', 'https://twin-cities.umn.edu/', 'University of Minnesota, Twin Cities', 'https://www.zju.edu.cn/english/', 'Zhejiang University', 'https://www.ugr.es/en/', 'University of Granada', 'https://www.ucl.ac.uk/', 'University College London', 'https://www.cityu.edu.hk/', 'City University of Hong Kong', 'https://www.ubc.ca/', 'University of British Columbia', 'https://www.nd.edu/', 'University of Notre Dame', 'http://www.nus.edu.sg/', 'The National University of Singapore', 'http://en.sjtu.edu.cn/', 'Shanghai Jiao Tong University', 'https://www.yale.edu/', 'Yale University', 'https://www.washington.edu/', 'University of Washington', '/north-carolina-education/duke-university-online/', 'Duke University', 'https://www.colorado.edu/', 'University of Colorado at Boulder', 'https://www.ku.dk/english/', 'University of Copenhagen', 'https://www.ucsb.edu/', 'University of California, Santa Barbara', 'https://www.manchester.ac.uk/', 'University of Manchester', 'https://newbrunswick.rutgers.edu/', 'Rutgers University', 'https://www.rice.edu/', 'Rice University', 'https://www.kuleuven.be/english/', 'KU Leuven', 'https://www.utah.edu/', 'University of Utah', 'https://msu.edu/', 'Michigan State University', 'https://www.tamu.edu/', 'Texas A&M University', 'http://english.pku.edu.cn/', 'Peking University', 'https://www.psu.edu/', 'Pennsylvania State University - University Park', 'https://www.udel.edu/', 'University of Delaware', 'http://en.xjtu.edu.cn/', 'Xian Jiao Tong University', 'http://english.hust.edu.cn/', 'Huazhong University of Science and Technology', 'http://en.hit.edu.cn/', 'Harbin Institute of Technology', 'https://www.sfu.ca/', 'Simon Fraser University', 'https://www.polyu.edu.hk/web/en/home/', 'The Hong Kong Polytechnic University', 'https://www.tue.nl/en/', 'Eindhoven University of Technology', 'https://www.nctu.edu.tw/index.php/en', 'National Chiao Tung University', 'https://en.xidian.edu.cn/', 'Xidian University', 'https://www.ujaen.es/serv/vicint/home/index', 'University of Jaen', 'https://www.kaust.edu.sa/en', 'King Abdullah University of Science and Technology', 'https://www.jhu.edu/', 'Johns Hopkins University', 'https://www.upenn.edu/', 'University of Pennsylvania', 'https://www.wisc.edu/', 'University of Wisconsin', 'https://www.ed.ac.uk/home', 'The University of Edinburgh', 'https://www.mcgill.ca/', 'McGill University', 'https://www.bristol.ac.uk/', 'University of Bristol', 'https://new.huji.ac.il/en', 'The Hebrew University of Jerusalem', 'https://www.ugent.be/en', 'Ghent University', 'https://www.brown.edu/', 'Brown University', 'https://www.weizmann.ac.il/pages/', 'Weizmann Institute of Science', 'https://www.unsw.edu.au/', 'University of New South Wales', 'https://www.ualberta.ca/', 'University of Alberta', 'https://www.southampton.ac.uk/', 'University of Southampton', 'https://www.dtu.dk/english', 'Technical University of Denmark', 'https://en.uniroma1.it/', 'Sapienza University of Rome', 'https://en.ustc.edu.cn/', 'The University of Science and Technology of China', 'https://www.uic.edu/', 'University of Illinois at Chicago', 'https://www.hku.hk/', 'University of Hong Kong', 'https://uwaterloo.ca/', 'University of Waterloo', 'https://www.kaist.edu/html/en/', 'Korea Advanced Institute of Science and Technology', 'https://www.uh.edu/', 'University of Houston', 'http://en.dlut.edu.cn/', 'Dalian University of Technology', 'https://en.whu.edu.cn/', 'Wuhan University', '/online-colleges/new-jersey-institute-technology-online/', 'New Jersey Institute of Technology']
 
  Cependant, vous pouvez utiliser  pandas  code >  dataframe  et importez toutes les données au format  csv . 
  pour installer des pandas, vous pouvez simplement exécuter via la ligne de commande. 
 
   pip installer des pandas 
  Et utilisez le code ci-dessous. 
 import requests
from bs4 import BeautifulSoup

page = requests.get('https://thebestschools.org/features/best-computer-science-programs-in-the-world/')
soup = BeautifulSoup(page.text, 'html.parser')

college_name=[]

college_name_list = soup.find_all('h3',class_='college')
for college in college_name_list:
  if college.find('a'):
    college_name.append(college.find('a')['href'])
    college_name.append(college.find('a').text)

print(college_name)
  Votre fichier csv sera comme ça.

1 commentaires

Y a-t-il un hasard pour que je puisse également obtenir quelques lignes sur le collège (disponible sur la page Web aussi) et l'imprimer dans le csv merci

0
votes

Vous devez obtenir h3 avec class = "college":

import requests

list_colleges = {}

result = requests.get('https://www.stanford.edu/')
if (result.status_code == 200):
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(result.content)
    colleges = soup.findAll('h3', {'class': 'colleges'})
    for college in colleges:
        id_college = college.get('id')
        if not (id_college is None):
            list_colleges[id] = college # Store the inner html

0 commentaires