Parse HTML Code With Beautiful Soup
Beautiful Soup is a Python library for parsing structured data. It allows you to interact with HTML in a similar way to how you would interact with a web page using developer tools. Beautiful Soup exposes a couple of intuitive functions you can use to explore the HTML you received. To get started, use your terminal to install the Beautiful Soup library:
$ pip3 install beautifulsoup4
Then, import the library and create a Beautiful Soup object:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.monster.com/jobs/search/?q=Software-Developer&where=Australia'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
When you add the two highlighted lines of code, you’re creating a Beautiful Soup object that takes the HTML content you scraped earlier as its input. When you instantiate the object, you also instruct Beautiful Soup to use the appropriate parser.
Find Elements by ID
In an HTML web page, every element can have an
id
attribute assigned. As the name already suggests, that id
attribute makes the element uniquely identifiable on the page. You can begin to parse your page by selecting a specific element by its ID.
Switch back to developer tools and identify the HTML object that contains all of the job postings. Explore by hovering over parts of the page and using right-click to Inspect.
Note: Keep in mind that it’s helpful to periodically switch back to your browser and interactively explore the page using developer tools. This helps you learn how to find the exact elements you’re looking for.
At the time of this writing, the element you’re looking for is a
<div>
with an id
attribute that has the value "ResultsContainer"
. It has a couple of other attributes as well, but below is the gist of what you’re looking for:<div id="ResultsContainer">
<!-- all the job listings -->
</div>
Beautiful Soup allows you to find that specific element easily by its ID:
results = soup.find(id='ResultsContainer')
For easier viewing, you can
.prettify()
any Beautiful Soup object when you print it out. If you call this method on the results
variable that you just assigned above, then you should see all the HTML contained within the <div>
:print(results.prettify())
When you use the element’s ID, you’re able to pick one element out from among the rest of the HTML. This allows you to work with only this specific part of the page’s HTML. It looks like the soup just got a little thinner! However, it’s still quite dense.
Find Elements by HTML Class Name
You’ve seen that every job posting is wrapped in a
<section>
element with the class card-content
. Now you can work with your new Beautiful Soup object called results
and select only the job postings. These are, after all, the parts of the HTML that you’re interested in! You can do this in one line of code:job_elems = results.find_all('section', class_='card-content')
Here, you call
.find_all()
on a Beautiful Soup object, which returns an iterable containing all the HTML for all the job listings displayed on that page.
Take a look at all of them:
for job_elem in job_elems:
print(job_elem, end='\n'*2)
That’s already pretty neat, but there’s still a lot of HTML! You’ve seen earlier that your page has descriptive class names on some elements. Let’s pick out only those:
for job_elem in job_elems:
# Each job_elem is a new BeautifulSoup object.
# You can use the same methods on it as you did before.
title_elem = job_elem.find('h2', class_='title')
company_elem = job_elem.find('div', class_='company')
location_elem = job_elem.find('div', class_='location')
print(title_elem)
print(company_elem)
print(location_elem)
print()
Great! You’re getting closer and closer to the data you’re actually interested in. Still, there’s a lot going on with all those HTML tags and attributes floating around:
<h2 class="title"><a data-bypass="true" data-m_impr_a_placement_id="JSR2CW" data-m_impr_j_cid="4" data-m_impr_j_coc="" data-m_impr_j_jawsid="371676273" data-m_impr_j_jobid="0" data-m_impr_j_jpm="2" data-m_impr_j_jpt="3" data-m_impr_j_lat="30.1882" data-m_impr_j_lid="619" data-m_impr_j_long="-95.6732" data-m_impr_j_occid="11838" data-m_impr_j_p="3" data-m_impr_j_postingid="4755ec59-d0db-4ce9-8385-b4df7c1e9f7c" data-m_impr_j_pvc="4496dab8-a60c-4f02-a2d1-6213320e7213" data-m_impr_s_t="t" data-m_impr_uuid="0b620778-73c7-4550-9db5-df4efad23538" href="https://job-openings.monster.com/python-developer-woodlands-wa-us-lancesoft-inc/4755ec59-d0db-4ce9-8385-b4df7c1e9f7c" onclick="clickJobTitle('plid=619&pcid=4&poccid=11838','Software Developer',''); clickJobTitleSiteCat('{"events.event48":"true","eVar25":"Python Developer","eVar66":"Monster","eVar67":"JSR2CW","eVar26":"_LanceSoft Inc","eVar31":"Woodlands_WA_","prop24":"2019-07-02T12:00","eVar53":"1500127001001","eVar50":"Aggregated","eVar74":"regular"}')">Python Developer
</a></h2>
<div class="company">
<span class="name">LanceSoft Inc</span>
<ul class="list-inline">
</ul>
</div>
<div class="location">
<span class="name">
Woodlands, WA
</span>
</div>
You’ll see how to narrow down this output in the next section.
Extract Text From HTML Elements
For now, you only want to see the title, company, and location of each job posting. And behold! Beautiful Soup has got you covered. You can add
.text
to a Beautiful Soup object to return only the text content of the HTML elements that the object contains:for job_elem in job_elems:
title_elem = job_elem.find('h2', class_='title')
company_elem = job_elem.find('div', class_='company')
location_elem = job_elem.find('div', class_='location')
print(title_elem.text)
print(company_elem.text)
print(location_elem.text)
print()
Run the above code snippet and you’ll see the text content displayed. However, you’ll also get a lot of whitespace. Since you’re now working with Python strings, you can
.strip()
the superfluous whitespace. You can also apply any other familiar Python string methods to further clean up your text.
Note: The web is messy and you can’t rely on a page structure to be consistent throughout. Therefore, you’ll more often than not run into errors while parsing HTML.
When you run the above code, you might encounter an
AttributeError
:AttributeError: 'NoneType' object has no attribute 'text'
If that’s the case, then take a step back and inspect your previous results. Were there any items with a value of
None
? You might have noticed that the structure of the page is not entirely uniform. There could be an advertisement in there that displays in a different way than the normal job postings, which may return different results. For this tutorial, you can safely disregard the problematic element and skip over it while parsing the HTML:for job_elem in job_elems:
title_elem = job_elem.find('h2', class_='title')
company_elem = job_elem.find('div', class_='company')
location_elem = job_elem.find('div', class_='location')
if None in (title_elem, company_elem, location_elem):
continue
print(title_elem.text.strip())
print(company_elem.text.strip())
print(location_elem.text.strip())
print()
Feel free to explore why one of the elements is returned as
None
. You can use the conditional statement you wrote above to print()
out and inspect the relevant element in more detail. What do you think is going on there?
After you complete the above steps try running your script again. The results finally look much better:
Python Developer
LanceSoft Inc
Woodlands, WA
Senior Engagement Manager
Zuora
Sydney, NSW
Find Elements by Class Name and Text Content
By now, you’ve cleaned up the list of jobs that you saw on the website. While that’s pretty neat already, you can make your script more useful. However, not all of the job listings seem to be developer jobs that you’d be interested in as a Python developer. So instead of printing out all of the jobs from the page, you’ll first filter them for some keywords.
You know that job titles in the page are kept within
<h2>
elements. To filter only for specific ones, you can use the string
argument:python_jobs = results.find_all('h2', string='Python Developer')
This code finds all
<h2>
elements where the contained string matches 'Python Developer'
exactly. Note that you’re directly calling the method on your first results
variable. If you go ahead and print()
the output of the above code snippet to your console, then you might be disappointed because it will probably be empty:[]
There was definitely a job with that title in the search results, so why is it not showing up? When you use
string=
like you did above, your program looks for exactly that string. Any differences in capitalization or whitespace will prevent the element from matching. In the next section, you’ll find a way to make the string more general.Pass a Function to a Beautiful Soup Method
In addition to strings, you can often pass functions as arguments to Beautiful Soup methods. You can change the previous line of code to use a function instead:
python_jobs = results.find_all('h2',
string=lambda text: 'python' in text.lower())
Now you’re passing an anonymous function to the
string=
argument. The lambda function looks at the text of each <h2>
element, converts it to lowercase, and checks whether the substring 'python'
is found anywhere in there. Now you’ve got a match:
>>>
>>> print(len(python_jobs))
1
Your program has found a match!
Note: In case you still don’t get a match, try adapting your search string. The job offers on this page are constantly changing and there might not be a job listed that includes the substring
'python'
in its title at the time that you’re working through this tutorial.
The process of finding specific elements depending on their text content is a powerful way to filter your HTML response for the information that you’re looking for. Beautiful Soup allows you to use either exact strings or functions as arguments for filtering text in Beautiful Soup objects.
Extract Attributes From HTML Elements
At this point, your Python script already scrapes the site and filters its HTML for relevant job postings. Well done! However, one thing that’s still missing is the link to apply for a job.
While you were inspecting the page, you found that the link is part of the element that has the
title
HTML class. The current code strips away the entire link when accessing the .text
attribute of its parent element. As you’ve seen before, .text
only contains the visible text content of an HTML element. Tags and attributes are not part of that. To get the actual URL, you want to extract one of those attributes instead of discarding it.
Look at the list of filtered results
python_jobs
that you created above. The URL is contained in the href
attribute of the nested <a>
tag. Start by fetching the <a>
element. Then, extract the value of its href
attribute using square-bracket notation:python_jobs = results.find_all('h2',
string=lambda text: "python" in text.lower())
for p_job in python_jobs:
link = p_job.find('a')['href']
print(p_job.text.strip())
print(f"Apply here: {link}\n")
The filtered results will only show links to job opportunities that include
python
in their title. You can use the same square-bracket notation to extract other HTML attributes as well. A common use case is to fetch the URL of a link, as you did above.Building the Job Search Tool
If you’ve written the code alongside this tutorial, then you can already run your script as-is. To wrap up your journey into web scraping, you could give the code a final makeover and create a command line interface app that looks for Software Developer jobs in any location you define.
You can check out a command line app version of the code you built in this tutorial at the link below:
Get Sample Code: Click here to get the sample code you'll use for the project and examples in this tutorial.
If you’re interested in learning how to adapt your script as a command line interface, then check out How to Build Command Line Interfaces in Python With argparse.
Additional Practice
Below is a list of other job boards. These linked pages also return their search results as static HTML responses. To keep practicing your new skills, you can revisit the web scraping process using any or all of the following sites:
Go through this tutorial again from the top using one of these other sites. You’ll see that the structure of each website is different and that you’ll need to re-build the code in a slightly different way to fetch the data you want. This is a great way to practice the concepts that you just learned. While it might make you sweat every so often, your coding skills will be stronger for it!
During your second attempt, you can also explore additional features of Beautiful Soup. Use the documentation as your guidebook and inspiration. Additional practice will help you become more proficient at web scraping using Python,
requests
, and Beautiful Soup.Conclusion
Beautiful Soup is packed with useful functionality to parse HTML data. It’s a trusted and helpful companion for your web scraping adventures. Its documentation is comprehensive and relatively user-friendly to get started with. You’ll find that Beautiful Soup will cater to most of your parsing needs, from navigating to advanced searching through the results.
In this tutorial, you’ve learned how to scrape data from the Web using Python,
requests
, and Beautiful Soup. You built a script that fetches job postings from the Internet and went through the full web scraping process from start to finish.
You learned how to:
- Inspect the HTML structure of your target site with your browser’s developer tools
- Gain insight into how to decipher the data encoded in URLs
- Download the page’s HTML content using Python’s
requests
library - Parse the downloaded HTML with Beautiful Soup to extract relevant information
With this general pipeline in mind and powerful libraries in your toolkit, you can go out and see what other websites you can scrape! Have fun, and remember to always be respectful and use your programming skills responsibly.
You can download the source code for the sample script that you built in this tutorial by clicking on the link below:
Comments
Post a Comment