The idea behind webs craping is this: gather data quickly and replicably from websites by taking advantage of the fact that web pages are structured documents. While they are individually idiosyncratic, they tend to be internally consistent. This means that web scraping is always a bespoke affair—you can’t build a web scraper that will simply work in a generalizable way for all pages. However, there are general principles that, when rehearsed, will allow you to develop a scraper for a given website without too much effort.
The scraper we’re going to build today downloads key information about DUSP faculty from the DUSP website’s ‘people’ list. We’re going to scrape information about affiliation, navigate through weblinks, and download photos. Along the way, we’ll be doing some neat tricks—naming downloaded photos such that their nomenclature is consistent and dealing with missing and inaccessible information.
To do this, we’ll be using a couple of Python packages. The first is
bs4
, or Beautiful Soup 4. This is an HTML and XML parser
which takes a downloaded web page and gives us objects and methods for
navigating its structure inuititively. It’s a very, very standard tool
for use in web scraping applications. We’ll also be using
wget
to download files, and requests
to
request webpages.
As such, the first thing you’ll need to do is install these packages.
Assuming you’ve created and activated your virtual environment, you’ll
want to install these packages using pip
.
pip install requests wget bs4
import requests
from pprint import pprint
from bs4 import BeautifulSoup
import wget
Having imported our modules, we want to provide the base url of the
page we’re trying to scrape: in this case, the DUSP people list,
available at https://dusp.mit.edu/people
. We then pass this
URL to the requests
package’s get
method,
which downloads the content of the page.
= 'http://dusp.mit.edu/people'
base_url = requests.get(base_url)
base_page print(base_page)
<Response [200]>
A 200
code in the response tells you the request was
successful. Beautiful soup will now allow us to parse the
downloaded HTML, to render it navigable by BeautifulSoup’s fairly rich
array of methods to navigate a page’s Document Object Model, or DOM. We
do this by creating a new object of the BeautifulSoup class, which takes
the page’s content as its input, and also takes an argument that
identifies which parser to use (here, we use
html.parser
).
= BeautifulSoup(base_page.content, 'html.parser') soup
We can now use a web browser to do some digging—we can inspect the
elements in the webpage to find that each ‘person’ is held in a
div
with the class row-people
. As such, we can
tell BeautifulSoup to find all instances of div
tags with
the class row-people
. We print only the first person to
save space.
= soup.find_all('div', class_='row-people')
people 0]) pprint(people[
<div class="views-row views-row-1 views-row-odd views-row-first row-disc-IDG row-people">
<div> <span><div class="bull"></div></span> </div>
<div class="views-field views-field-field-user-picture"> <div class="field-content"><img alt="" src="http://dusp.mit.edu/sites/dusp.mit.edu/files/styles/profile_pic/public/user/pictures/cherie.jpg?itok=tqCmbLrz"/></div> </div>
<div class="views-field views-field-name"> <span class="field-content"><a class="username" href="/faculty/cherie-abbanat" title="View user profile.">Cherie Abbanat</a></span> </div>
<div class="views-field views-field-field-position-and-title-1"> <div class="field-content">Lecturer of International Development and Urban Studies</div> </div>
<div class="views-field views-field-field-position-and-title-2"> <div class="field-content"></div> </div>
<div class="views-field views-field-field-other-division"> <div class="field-content"></div> </div> </div>
Now that we have found all elements with, and we can access each
element’s components using the element class’s methods. For example, we
can get all text contained within the div
by using the
get_text()
method.
0].get_text()) pprint(people[
('\n'
' \n'
' \n'
' Cherie Abbanat \n'
' Lecturer of International Development and Urban Studies \n'
' \n'
' ')
The text is surrounded by a number of line breaks, which are
represented by the \n
characters—this is how you say ‘new
line!’ in ASCII character encoding. As such, it’s often a good idea to
use the strip()
method to remove new line and space
characters before and after a given chunk of text.
0].get_text().strip()) pprint(people[
'Cherie Abbanat \n Lecturer of International Development and Urban Studies'
Better! But we don’t want all text—we want to be careful and save specifically those pieces of information that are useful to us. Again, we can examine the document structure and notice that…
a
tags with the class username
.href
references in that
same a
tag.div
s with classes
views-field-field-position-and-title-1
and
...-2
, respecitvely.div
with class
views-field-other-division
.Using this knowledge, we can iterate over the people
list we produced above, find
ing each of those classes
within each faculty div
. This looks like this:
for person in people[0:3]:
= person.find('a', class_='username')
name_href = name_href.get_text()
name = name_href.get('href')
href = person.find('div', class_='views-field-field-position-and-title-1').get_text()
pos_1 = person.find('div', class_='views-field-field-position-and-title-2').get_text()
pos_2 = person.find('div', class_='views-field-field-other-division').get_text()
other print(name, href, pos_1, pos_2, other)
Cherie Abbanat /faculty/cherie-abbanat Lecturer of International Development and Urban Studies
Paul Altidor /faculty/paul-altidor Visiting Lecturer of International Development and Planning
Mariana Arcaya /faculty/mariana-arcaya Associate Professor of Urban Planning and Public Health Associate Department Head
But this is only a summary page—much of the information about each
faculty member is stored on their faculty page. Using a combination of
requests
and BeautifulSoup
we can navigate to
each faculty page as we encounter it and draw additional data.
We grabbed everything we need to do this above: we stored the path to
the faculty page in the href
variable. This is a relative
path, meaning that it doesn’t include a web address—only a path relative
to the current page. We can therefore append the relative path to the
base_url
—this happens on the line reading
...BeautifulSoup(requests.get(base_url + href).content...
.
On each faculty’s page we can, for example, pull their biography,
which we determine is located in a div
with the class
pane-user-field-bio
.
= 'http://dusp.mit.edu'
base_url
for person in people[0:3]:
= person.find('a', class_='username')
name_href = name_href.get_text()
name = name_href.get('href')
href = person.find('div', class_='views-field-field-position-and-title-1').get_text()
pos_1 = person.find('div', class_='views-field-field-position-and-title-2').get_text()
pos_2 = person.find('div', class_='views-field-field-other-division').get_text()
other if href:
= BeautifulSoup(requests.get(base_url + href).content, 'html.parser')
person_soup = person_soup.find('div', class_='pane-user-field-bio').get_text()
bio print(name, href, pos_1, pos_2, other, bio)
Cherie Abbanat /faculty/cherie-abbanat Lecturer of International Development and Urban Studies
Cherie is a lecturer at DUSP and in the Department of Architecture where she has been teaching for over fifteen years. Cherie lectures on policy, non-profit management, post-disaster rebuilding in New Orleans and Haiti, and the need for grassroots initiatives.
As a practitioner, Cherie joined Haiti Projects Inc., a 501 (c)3 non-profit, in 2013 as its CEO to transform Haiti Projects from a fledgling non-profit into a growing social enterprise. Cherie successfully turned Haiti Projects around financially and the non-profit is ready to grow. Haiti Projects boasts 4 employees in the US and close to 90 employees in Haiti. Haiti Projects operates a women's sewing cooperative, a women's health clinic that focuses on family planning, health and hygiene, and a community library. With support from the Kellogg Foundation, Haiti Projects plans to build a new community multi-purpose center in 2015.
As a Haitian American, Cherie is passionate about empowering women in Haiti and throughout the world.
Cherie has her undergraduate degree in Economics and French from Simmons College and her Masters with a concentration in Environmental Policy and Planning from the Massachusetts Institute of Technology (MCP '97).
Paul Altidor /faculty/paul-altidor Visiting Lecturer of International Development and Planning
Paul Getty Altidor was named Ambassador Extraordinary and Plenipotentiary of Haiti to the United States in 2012; he presented his credentials to then-President Barack Obama on May 2. He served in this capacity 2019.
Ambassador Altidor spent his early years where he was born in Jérémie, Haiti. He attended primary school in Jérémie then studied at the Centre d’Etudes Secondaire in Port au Prince. His family later moved to Boston where he completed his secondary education. Ambassador Altidor received his undergraduate degree from Boston College. He earned an MS degree from MIT in Urban Studies and Planning (2004) and also pursued graduate studies in law and economics in France.
Ambassador Altidor has an extensive private sector background. As a management consultant, he counseled firms in different countries on corporate governance and responsibility. In the aftermath of the 2010 earthquake, Ambassador Altidor led a team of professors and researchers from the Massachusetts Institute of Technology’s Community Innovators Lab down to Haiti. At the request of Haitian authorities, the team provided guidance to reconstruction officials on housing policy and financing.
Prior to the earthquake in Haiti, Ambassador Altidor worked at the International Finance Corporation (IFC) where he advised governments on infrastructure projects and public-private partnerships. He has also worked for the World Bank. Ambassador Altidor has taught at Ecole Supérieure Catholique de Droit de Jérémie (ESCDROJ), a law school in his native town of Jérémie, and he is a frequent speaker at universities in Haiti and the United States.
Mariana Arcaya /faculty/mariana-arcaya Associate Professor of Urban Planning and Public Health Associate Department Head
Mariana Arcaya is a social epidemiologist and urban planner whose work explores dynamic relationships between geographic contexts, particularly neighborhoods, and health. Mariana conducts scholarly and policy-relevant research in two main areas: 1) bi-directional relationships between place and health, including how health considerations shape socioeconomic outcomes for individuals and communities, and, 2) applied and translational research on the social determinants of health, particularly health risk factors shaped by urban policy and planning decisions.
Prior to coming to MIT in 2015, Mariana served as a post-doctoral fellow at the Harvard Center for Population and Development Studies. She holds a Doctorate of Science from the Harvard School of Public Health, and a Master of City Planning from MIT's Department of Urban Studies & Planning. Her professional work experience includes instituting and managing a Public Health Division within Metropolitan Boston’s regional planning agency, as well as designing and overseeing the implementation of healthy urban planning strategies under a federally funded Community Transformation Grant.
We can extend this by locating the fields that contain, for example,
faculty bios, offices, emails, and interests. Here, we introduce a
slight complication—we test for the existence of a given field before
trying to extract its text. This is because if we try to extract text
(get_text()
) from a field that doesn’t exist, Python will
throw an error and our script will stop executing. When you’re writing
scrapers this is a very common approach—while websites are structured,
not all pages are going to contain all information, nor are all pages
going to contain all information represented in a manner that is
consistent. As such, much of the work of scraping is introducing tests
that keep your script from failing when it encounters an exception.
for person in people[0:3]:
= person.find('a', class_='username')
name_href = name_href.get_text()
name = name_href.get('href')
href = person.find('div', class_='views-field-field-position-and-title-1').get_text()
pos_1 = person.find('div', class_='views-field-field-position-and-title-2').get_text()
pos_2 = person.find('div', class_='views-field-field-other-division').get_text()
other if href:
= BeautifulSoup(requests.get('http://dusp.mit.edu' + href).content, 'html.parser')
person_soup = person_soup.find('div', class_='pane-user-field-bio')
bio if bio:
= bio.get_text()
bio = person_soup.find('div', class_='views-field views-field-field-office')
office if office:
= office.get_text()
office = person_soup.find('div', class_='views-field views-field-field-secondary-email')
email if email:
= email.get_text()
email = person_soup.find('strong', class_='views-label views-label-field-areas-of-interest')
interests if interests:
= interests.next_sibling.next_sibling.get_text()
interests print(name, href, pos_1, pos_2, other, bio, office, email, interests)
Cherie Abbanat /faculty/cherie-abbanat Lecturer of International Development and Urban Studies
Cherie is a lecturer at DUSP and in the Department of Architecture where she has been teaching for over fifteen years. Cherie lectures on policy, non-profit management, post-disaster rebuilding in New Orleans and Haiti, and the need for grassroots initiatives.
As a practitioner, Cherie joined Haiti Projects Inc., a 501 (c)3 non-profit, in 2013 as its CEO to transform Haiti Projects from a fledgling non-profit into a growing social enterprise. Cherie successfully turned Haiti Projects around financially and the non-profit is ready to grow. Haiti Projects boasts 4 employees in the US and close to 90 employees in Haiti. Haiti Projects operates a women's sewing cooperative, a women's health clinic that focuses on family planning, health and hygiene, and a community library. With support from the Kellogg Foundation, Haiti Projects plans to build a new community multi-purpose center in 2015.
As a Haitian American, Cherie is passionate about empowering women in Haiti and throughout the world.
Cherie has her undergraduate degree in Economics and French from Simmons College and her Masters with a concentration in Environmental Policy and Planning from the Massachusetts Institute of Technology (MCP '97).
9-520 abbanat@mit.edu None
Paul Altidor /faculty/paul-altidor Visiting Lecturer of International Development and Planning
Paul Getty Altidor was named Ambassador Extraordinary and Plenipotentiary of Haiti to the United States in 2012; he presented his credentials to then-President Barack Obama on May 2. He served in this capacity 2019.
Ambassador Altidor spent his early years where he was born in Jérémie, Haiti. He attended primary school in Jérémie then studied at the Centre d’Etudes Secondaire in Port au Prince. His family later moved to Boston where he completed his secondary education. Ambassador Altidor received his undergraduate degree from Boston College. He earned an MS degree from MIT in Urban Studies and Planning (2004) and also pursued graduate studies in law and economics in France.
Ambassador Altidor has an extensive private sector background. As a management consultant, he counseled firms in different countries on corporate governance and responsibility. In the aftermath of the 2010 earthquake, Ambassador Altidor led a team of professors and researchers from the Massachusetts Institute of Technology’s Community Innovators Lab down to Haiti. At the request of Haitian authorities, the team provided guidance to reconstruction officials on housing policy and financing.
Prior to the earthquake in Haiti, Ambassador Altidor worked at the International Finance Corporation (IFC) where he advised governments on infrastructure projects and public-private partnerships. He has also worked for the World Bank. Ambassador Altidor has taught at Ecole Supérieure Catholique de Droit de Jérémie (ESCDROJ), a law school in his native town of Jérémie, and he is a frequent speaker at universities in Haiti and the United States.
None altidorp@mit.edu None
Mariana Arcaya /faculty/mariana-arcaya Associate Professor of Urban Planning and Public Health Associate Department Head
Mariana Arcaya is a social epidemiologist and urban planner whose work explores dynamic relationships between geographic contexts, particularly neighborhoods, and health. Mariana conducts scholarly and policy-relevant research in two main areas: 1) bi-directional relationships between place and health, including how health considerations shape socioeconomic outcomes for individuals and communities, and, 2) applied and translational research on the social determinants of health, particularly health risk factors shaped by urban policy and planning decisions.
Prior to coming to MIT in 2015, Mariana served as a post-doctoral fellow at the Harvard Center for Population and Development Studies. She holds a Doctorate of Science from the Harvard School of Public Health, and a Master of City Planning from MIT's Department of Urban Studies & Planning. Her professional work experience includes instituting and managing a Public Health Division within Metropolitan Boston’s regional planning agency, as well as designing and overseeing the implementation of healthy urban planning strategies under a federally funded Community Transformation Grant.
9-426 marcaya@mit.edu Community Planning and Economic Development, Environmental Planning and Management, Healthy Communities and Active Living, Urban Information, Technology, and Media and Analytics
Finlly, let’s write our scraped data to a CSV. We’ll keep it simple
and use the stock csv
library—the only slight quirks here
are that we’re using the writeheader()
method of the
DictWriter
object to write field names. This is a standard
workflow that you’ve seen before if you’ve been following along—we open
a file, create a writer object and use it to write a row for each
iteration of some activity (here, each time a faculty member is
scraped).
import csv
= 'faculty.csv'
output_csv
with open(output_csv, 'w') as f:
= ['name', 'href', 'pos_1', 'pos_2', 'other_affil', 'bio', 'office', 'email', 'interests']
field_names = csv.DictWriter(f, field_names)
writer
writer.writeheader()for person in people[0:3]:
= person.find('a', class_='username')
name_href = name_href.get_text()
name print(f'Scraping {name}...')
= 'http://dusp.mit.edu' + name_href.get('href')
href = person.find('div', class_='views-field-field-position-and-title-1').get_text()
pos_1 = person.find('div', class_='views-field-field-position-and-title-2').get_text()
pos_2 = person.find('div', class_='views-field-field-other-division').get_text()
other_affil if href:
= BeautifulSoup(requests.get(href).content, 'html.parser')
person_soup = person_soup.find('div', class_='pane-user-field-bio')
bio if bio:
= bio.get_text()
bio = person_soup.find('div', class_='views-field views-field-field-office')
office if office:
= office.get_text()
office = person_soup.find('div', class_='views-field views-field-field-secondary-email')
email if email:
= email.get_text()
email = person_soup.find('strong', class_='views-label views-label-field-areas-of-interest')
interests if interests:
= interests.next_sibling.next_sibling.get_text()
interests = {
row 'name': name,
'href': href,
'pos_1': pos_1,
'pos_2': pos_2,
'other_affil': other_affil,
'bio': bio,
'office': office,
'email': email,
'interests': interests
}
writer.writerow(row)print("Done!")
Scraping Cherie Abbanat...
Done!
Scraping Paul Altidor...
Done!
Scraping Mariana Arcaya...
Done!
As a final extension, let’s download each faculty member’s photo! We
can use the wget
module to download the image, and store
the image filename in an additional field in our CSV.
# These two lines are necessary to solve a certificate verification problem.
import ssl
= ssl._create_unverified_context
ssl._create_default_https_context
if image_url:
= image_url.get('src')
image_url = os.getcwd() + '/images/'
out_dir = name.replace(' ', '_').lower().replace('.', '') + '.jpg'
image_file try:
+ image_file)
wget.download(image_url, out_dir except Exception as err:
print(err, '---Could not download faculty photo.')
= None image_file
If you run this for all faculty photos, you’ll find that some do not allow you to download them. I’ve chatted with Duncan Kincaid about this and his diagnosis is that the DUSP webpage is powered by some fairly old infrastructure that at this point is run through with inconsistent permissions, etc. Which is good! This is exactly the kind of trouble you can get into while scraping.
import csv
import os
import wget
import time
# Same as above---these prevent certificate verification errors.
import ssl
= ssl._create_unverified_context
ssl._create_default_https_context
= 'faculty.csv'
output_csv
with open(output_csv, 'w') as f:
= ['name', 'href', 'pos_1', 'pos_2', 'other_affil',
field_names 'bio', 'office', 'email', 'interests', 'image_file']
= csv.DictWriter(f, field_names)
writer
writer.writeheader()for person in people[0:3]:
= person.find('a', class_='username')
name_href = name_href.get_text()
name print(f'Scraping {name}...')
= 'http://dusp.mit.edu' + name_href.get('href')
href = person.find('img')
image_url if image_url:
= image_url.get('src')
image_url = os.getcwd() + '/images/'
out_dir = name.replace(' ', '_').lower().replace('.', '') + '.jpg'
image_file try:
+ image_file)
wget.download(image_url, out_dir except Exception as err:
print(err + '---Could not download faculty photo.')
= None
image_file else:
= None
image_file = person.find('div', class_='views-field-field-position-and-title-1').get_text()
pos_1 = person.find('div', class_='views-field-field-position-and-title-2').get_text()
pos_2 = person.find('div', class_='views-field-field-other-division').get_text()
other_affil if href:
= BeautifulSoup(requests.get(href).content, 'html.parser')
person_soup = person_soup.find('div', class_='pane-user-field-bio')
bio if bio:
= bio.get_text()
bio = person_soup.find('div', class_='views-field views-field-field-office')
office if office:
= office.get_text()
office = person_soup.find('div', class_='views-field views-field-field-secondary-email')
email if email:
= email.get_text()
email = person_soup.find('strong', class_='views-label views-label-field-areas-of-interest')
interests if interests:
= interests.next_sibling.next_sibling.get_text()
interests = {
row 'name': name,
'href': href,
'pos_1': pos_1,
'pos_2': pos_2,
'other_affil': other_affil,
'bio': bio,
'office': office,
'email': email,
'interests': interests,
'image_file': image_file
}
writer.writerow(row)1.5) time.sleep(
Scraping Cherie Abbanat...
Scraping Paul Altidor...
Scraping Mariana Arcaya...