J'ai de l'expérience avec python, BeautifulSoup mais je suis impatient de récupérer des données d'un site Web et de les stocker sous forme de fichier csv. Un seul échantillon de données dont j'ai besoin est codé comme suit (une seule ligne de données).
<a href="https://www.stanford.edu/"> Stanford University </a>
Je souhaite obtenir les liens et le nom avec h3 ainsi que le texte à l'intérieur
(je peux le faire mais pas la première partie) Cependant, avec mon code, je ne peux obtenir Stanford que si je suis find_all (class _ = 'colleges')
Mon code
import requests from bs4 import BeautifulSoup page = requests.get('https://thebestschools.org/features/best-computer-science-programs-in-the-world/') soup = BeautifulSoup(page.text, 'html.parser') college_name_list = soup.find(class_='college') college_name_list_items = college_name_list.find_all('a') for college_name in college_name_list_items: print(college_name.prettify())
Sortie
...body and not nested divs... <h3 class="college"> <span class="num">1.</span> <a href="https://www.stanford.edu/">Stanford University</a> </h3> <div class="he-mod" data-block="paragraph-9"></div> <p class="school-location">Stanford, CA</p> ...body and not nested divs... <h3 id="MIT" class="college"> <span class="num">2.</span> <a href="https://web.mit.edu/">Massachusetts Institute of Technology (MIT)</a> </h3> <div class="he-mod" data-block="paragraph-14"></div> <p class="school-location">Cambridge, MA</p> ...body and not nested divs... <h3 id="Berkeley" class="college"> <span class="num">3.</span> <a href="https://www.berkeley.edu/">University of California Berkeley</a> </h3> <div class="he-mod" data-block="paragraph-19"></div> <p class="school-location">Berkeley, CA</p> ...body and not nested divs...
Je souhaite obtenir les autres collèges aussi avec le même class = college mais des identifiants différents
S'il vous plaît, aidez-moi à les obtenir; je peux organiser le .csv moi-même.
Site Web source à supprimer si vous pouvez me dire à quelle division / classe ou autre chose je devrais me tourner!
3 Réponses :
Veuillez essayer ce code:
import requests from bs4 import BeautifulSoup page = requests.get('https://thebestschools.org/features/best-computer-science-programs-in-the-world/') soup = BeautifulSoup(page.text, 'html.parser')college_name_list = soup.find_all(class_='college') college_name_list_items =[] for i in college_name_list: college_name_list_items.append(i.find_all('a')) for college_name in college_name_list_items: print(college_name)
Essayez d'utiliser find_all avec la balise Cependant, vous pouvez utiliser pour installer des pandas, vous pouvez simplement exécuter via la ligne de commande. pip installer des pandas Et utilisez le code ci-dessous.
, puis recherchez
puis extrayez le texte
et href code > valeur.
import requests
from bs4 import BeautifulSoup
import pandas as pd
page = requests.get('https://thebestschools.org/features/best-computer-science-programs-in-the-world/')
soup = BeautifulSoup(page.text, 'html.parser')
college_name=[]
college_name_url=[]
college_name_list = soup.find_all('h3',class_='college')
for college in college_name_list:
if college.find('a'):
college_name_url.append(college.find('a')['href'])
college_name.append(college.find('a').text)
df = pd.DataFrame({"college_name":college_name,"college_name_url":college_name_url})
df.to_csv('college_name.csv')
Sortie: au format liste.
['https://www.stanford.edu/', 'Stanford University', 'https://web.mit.edu/', 'Massachusetts Institute of Technology (MIT)', 'https://www.berkeley.edu/', 'University of California Berkeley', 'https://www.harvard.edu/', 'Harvard University', 'https://www.princeton.edu/', 'Princeton University', '/pennsylvania-education/carnegie-mellon-university-online/', 'Carnegie Mellon University', 'https://www.utexas.edu/', 'The University of Texas at Austin', 'https://www.cornell.edu/', 'Cornell University', 'https://www.ucla.edu/', 'University of California, Los Angeles (UCLA)', '/california-education/university-southern-california-online/', 'University of Southern California', 'https://www.caltech.edu/', 'California Institute of Technology (Caltech)', 'https://www.utoronto.ca/', 'University of Toronto', 'https://illinois.edu/', 'University of Illinois at Urbana-Champaign', 'https://ucsd.edu/', 'University of California in San Diego', 'https://www.umich.edu/', 'University of Michigan', 'https://www.umd.edu/', 'University of Maryland, College Park', 'https://www.ethz.ch/en.html', 'Swiss Federal Institute of Technology', 'https://www.technion.ac.il/en/home-2/', 'Technion-Israel Institute of Technology', 'https://www.osu.edu/', 'Ohio State University', 'https://english.tau.ac.il/', 'Tel Aviv University', '/indiana-education/purdue-university-online/', 'Purdue University', 'https://www.gatech.edu/', 'Georgia Institute of Technology', 'https://www.cam.ac.uk/', 'University of Cambridge', 'https://www.ntu.edu.tw/english/', 'National Taiwan University', 'http://ac.cs.tsinghua.edu.cn', 'Tsinghua University', 'https://www.imperial.ac.uk/', 'The Imperial College of Science, Technology, and Medicine', 'https://www.kau.edu.sa/home_english.aspx', 'King Abdulaziz University', 'https://www.tum.de/en/homepage/', 'Technical University Munich', 'https://uci.edu/', 'University of California, Irvine', 'https://www.ucdavis.edu/', 'University of California, Davis', 'https://www.columbia.edu/', 'Columbia University', '/online-colleges/arizona-state-university-online/', 'Arizona State University', 'https://www.ntu.edu.sg/Pages/home.aspx', 'Nanyang Technological University', 'https://www.ox.ac.uk/', 'University of Oxford', '/online-colleges/northwestern-university-online/', 'Northwestern University', 'https://www.epfl.ch/en/home/', 'Swiss Federal Institute of Technology Lausanne', 'https://www.nyu.edu/', 'New York University', 'https://www.kau.edu.sa/home_english.aspx', 'The Chinese University of Hong Kong', '/north-carolina-education/university-north-carolina-online/', 'University of North Carolina at Chapel Hill', 'https://www.ust.hk/', 'The Hong Kong University of Science and Technology', 'https://twin-cities.umn.edu/', 'University of Minnesota, Twin Cities', 'https://www.zju.edu.cn/english/', 'Zhejiang University', 'https://www.ugr.es/en/', 'University of Granada', 'https://www.ucl.ac.uk/', 'University College London', 'https://www.cityu.edu.hk/', 'City University of Hong Kong', 'https://www.ubc.ca/', 'University of British Columbia', 'https://www.nd.edu/', 'University of Notre Dame', 'http://www.nus.edu.sg/', 'The National University of Singapore', 'http://en.sjtu.edu.cn/', 'Shanghai Jiao Tong University', 'https://www.yale.edu/', 'Yale University', 'https://www.washington.edu/', 'University of Washington', '/north-carolina-education/duke-university-online/', 'Duke University', 'https://www.colorado.edu/', 'University of Colorado at Boulder', 'https://www.ku.dk/english/', 'University of Copenhagen', 'https://www.ucsb.edu/', 'University of California, Santa Barbara', 'https://www.manchester.ac.uk/', 'University of Manchester', 'https://newbrunswick.rutgers.edu/', 'Rutgers University', 'https://www.rice.edu/', 'Rice University', 'https://www.kuleuven.be/english/', 'KU Leuven', 'https://www.utah.edu/', 'University of Utah', 'https://msu.edu/', 'Michigan State University', 'https://www.tamu.edu/', 'Texas A&M University', 'http://english.pku.edu.cn/', 'Peking University', 'https://www.psu.edu/', 'Pennsylvania State University - University Park', 'https://www.udel.edu/', 'University of Delaware', 'http://en.xjtu.edu.cn/', 'Xian Jiao Tong University', 'http://english.hust.edu.cn/', 'Huazhong University of Science and Technology', 'http://en.hit.edu.cn/', 'Harbin Institute of Technology', 'https://www.sfu.ca/', 'Simon Fraser University', 'https://www.polyu.edu.hk/web/en/home/', 'The Hong Kong Polytechnic University', 'https://www.tue.nl/en/', 'Eindhoven University of Technology', 'https://www.nctu.edu.tw/index.php/en', 'National Chiao Tung University', 'https://en.xidian.edu.cn/', 'Xidian University', 'https://www.ujaen.es/serv/vicint/home/index', 'University of Jaen', 'https://www.kaust.edu.sa/en', 'King Abdullah University of Science and Technology', 'https://www.jhu.edu/', 'Johns Hopkins University', 'https://www.upenn.edu/', 'University of Pennsylvania', 'https://www.wisc.edu/', 'University of Wisconsin', 'https://www.ed.ac.uk/home', 'The University of Edinburgh', 'https://www.mcgill.ca/', 'McGill University', 'https://www.bristol.ac.uk/', 'University of Bristol', 'https://new.huji.ac.il/en', 'The Hebrew University of Jerusalem', 'https://www.ugent.be/en', 'Ghent University', 'https://www.brown.edu/', 'Brown University', 'https://www.weizmann.ac.il/pages/', 'Weizmann Institute of Science', 'https://www.unsw.edu.au/', 'University of New South Wales', 'https://www.ualberta.ca/', 'University of Alberta', 'https://www.southampton.ac.uk/', 'University of Southampton', 'https://www.dtu.dk/english', 'Technical University of Denmark', 'https://en.uniroma1.it/', 'Sapienza University of Rome', 'https://en.ustc.edu.cn/', 'The University of Science and Technology of China', 'https://www.uic.edu/', 'University of Illinois at Chicago', 'https://www.hku.hk/', 'University of Hong Kong', 'https://uwaterloo.ca/', 'University of Waterloo', 'https://www.kaist.edu/html/en/', 'Korea Advanced Institute of Science and Technology', 'https://www.uh.edu/', 'University of Houston', 'http://en.dlut.edu.cn/', 'Dalian University of Technology', 'https://en.whu.edu.cn/', 'Wuhan University', '/online-colleges/new-jersey-institute-technology-online/', 'New Jersey Institute of Technology']
pandas code >
dataframe
et importez toutes les données au format csv
.
import requests
from bs4 import BeautifulSoup
page = requests.get('https://thebestschools.org/features/best-computer-science-programs-in-the-world/')
soup = BeautifulSoup(page.text, 'html.parser')
college_name=[]
college_name_list = soup.find_all('h3',class_='college')
for college in college_name_list:
if college.find('a'):
college_name.append(college.find('a')['href'])
college_name.append(college.find('a').text)
print(college_name)
Votre fichier csv sera comme ça.
Y a-t-il un hasard pour que je puisse également obtenir quelques lignes sur le collège (disponible sur la page Web aussi) et l'imprimer dans le csv merci
Vous devez obtenir h3 avec class = "college":
import requests list_colleges = {} result = requests.get('https://www.stanford.edu/') if (result.status_code == 200): from bs4 import BeautifulSoup soup = BeautifulSoup(result.content) colleges = soup.findAll('h3', {'class': 'colleges'}) for college in colleges: id_college = college.get('id') if not (id_college is None): list_colleges[id] = college # Store the inner html
Vous vouliez probablement utiliser
find_all
au lieu defind
lorsque vous interrogez pour la classecollege
- essayez d'abord de changer cela.Non, cela ne fonctionne pas même après avoir changé u / KunduK a fourni une excellente solution.