How to Use Beautiful Soup to Parse Text for NLP Projects

Henry Alpert
6 min readMar 5, 2022
Photo by Monika Grabkowska on Unsplash

The Internet is filled with words, and behind every webpage those words are nestled within a hierarchy of HTML tags. Beautiful Soup is a Python library that can pull text from a webpage using those tags, and once you have the text, you can use it as data for Natural Language Processing or other data science projects.

Before I get to an overview of how it works, I feel compelled to answer the question, Why is the library called Beautiful Soup? According to Wikipedia, “‘tag soup’ is a pejorative for syntactically or structurally incorrect HTML written for a web page,” and this library apparently gets its name from the term “tag soup.”

Pulling Headlines from a News Site

To give a general example of how Beautiful Soup works, I’ll use it to pull all headlines from a fictional news site’s front page. (I’m using a fictional site for this example, because many sites do not allow web scraping.)

Once you have Beautiful Soup installed on your machine (see documentation), spark up a Jupyter notebook and pull the website into it just by entering its URL. But note that you can’t access the website directly in Beautiful Soup. You need get the document behind the webpage with requests, and then feed that document into Beautiful Soup using requests’s content attribute along with an HTML parser.

Back in the browser, use the browser’s inspector to get a look at the HTML behind the webpage. The inspector is called slightly different names depending on the browser, but the keyboard shortcut is control-shift-I in Windows-based browsers, and command-option-I works in Safari. From there, say that I can see that all headlines on the home page are within an <h3> tag.

Let’s take a look at the first three of them using findAll().

[<h3 class="rr562" size="300">Russia's Invasion of Ukraine Enters New Phase<div class="jk500"></div></h3>,
<h3 class="rr562" size="300">Big Tech Firm Launching New Product</h3>,
<h3 class="rr562" size="400">Five Tips for Tax Season</h3>]

You see I get all the HTML and CSS syntax along with the headlines, when I just want the actual headlines. But I can isolate the text with Beautiful Soup’s text attribute (or its get_text() function). Here, I use a for loop to make a list of all the headlines on the home page, and then I’ll display the first four:

['Russia's Invasion of Ukraine Enters New Phase',
'Big Tech Firm Launching New Product',
'Five Tips for Tax Season',
'Biden Announces Initiative']

And that’s it! I grabbed the day’s headlines from this fictional website and made them into a list of strings in Python.

Make a Dataframe Using Beautiful Soup

For a Natural Language Processing project, I used Beautiful Soup in a similar way. My dataset entailed some 22,000 business news articles that appeared on the Reuters newswire in 1987. The articles were provided as a number of SGM files, which is a type of markdown format.

(The dataset used in the following examples is called “Reuters-21578, Distribution 1.0.” It can be found here and has been made available for public use. The README file is here.)

Each article could be tagged with one or more of 135 topics, typically about commodities (oat, grain, gold, etc.), but some weren’t tagged with any topic at all. The goal of my project to develop an NLP classifier for the topic earn, indicating the article concerned a business’ earnings.

Here, I’m reading one of the datafiles and inspecting the first 1,000 characters.

'<!DOCTYPE lewis SYSTEM "lewis.dtd">\n<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5544" NEWID="1">\n<DATE>26-FEB-1987 15:01:01.79</DATE>\n<TOPICS><D>cocoa</D></TOPICS>\n<PLACES><D>el-salvador</D><D>usa</D><D>uruguay</D></PLACES>\n<PEOPLE></PEOPLE>\n<ORGS></ORGS>\n<EXCHANGES></EXCHANGES>\n<COMPANIES></COMPANIES>\n<UNKNOWN> \n&#5;&#5;&#5;C T\n&#22;&#22;&#1;f0704&#31;reute\nu f BC-BAHIA-COCOA-REVIEW   02-26 0105</UNKNOWN>\n<TEXT>&#2;\n<TITLE>BAHIA COCOA REVIEW</TITLE>\n<DATELINE>    SALVADOR, Feb 26 - </DATELINE><BODY>Showers continued throughout the week in\nthe Bahia cocoa zone, alleviating the drought since early\nJanuary and improving prospects for the coming temporao,\nalthough normal humidity levels have not been restored,\nComissaria Smith said in its weekly review.\n    The dry period means the temporao will be late this year.\n    Arrivals for the week ended February 22 were 155,221 bags\nof 60 kilos making a cumulative total for the season of 5.93\nmln against 5.81 at the sa'

It’s hard to read, but you can recognize some angle-bracketed tags in there, unique tags like <companies>, <topics>, and <exchanges>. This format does not seem to be HTML but XML which allows for custom tags. Beautiful Soup has an XML parser. I’ll use it here to parse the document, and then use Beautiful Soup’s prettify()function to make it more readable. Here, I’ll print the first 300 characters.

<!DOCTYPE lewis SYSTEM "lewis.dtd">
<html>
<body>
<reuters cgisplit="TRAINING-SET" lewissplit="TRAIN" newid="1" oldid="5544" topics="YES">
<date>
26-FEB-1987 15:01:01.79
</date>
<topics>
<d>
cocoa
</d>
</topics>
<places>
<d>
el-salvador
</d>
<d>

Ah, that’s a lot easier to look at. Using this technique and inspecting more of the data, I find that each article is demarcated by a <reuters> tag, topics are under the <topics> tag, and all the article content is nestled in a <text> tag.

To create my dataframe to feed into machine learning models, first, I find all the <topics>. I create a Beautiful Soup object that resembles a list of strings.

[<topics><d>cocoa</d></topics>,
<topics></topics>,
<topics></topics>,
<topics></topics>,
<topics><d>grain</d><d>wheat</d><d>corn</d><d>barley</d><d>oat</d><d>sorghum</d></topics>,
<topics><d>veg-oil</d><d>linseed</d><d>lin-oil</d><d>soy-oil</d><d>sun-oil</d><d>soybean</d><d>oilseed</d><d>corn</d><d>sunseed</d><d>grain</d><d>sorghum</d><d>wheat</d></topics>,
<topics></topics>,
<topics></topics>,
<topics><d>earn</d></topics>,
<topics><d>acq</d></topics>]

As with the messy HTML and CSS in the first example, I need to get rid of all the XML tags and make a new, clean-looking list. I can use Beautiful Soup’s text attribute to do that.

[['cocoa'],
[],
[],
[],
['grain', 'wheat', 'corn', 'barley', 'oat', 'sorghum'],
['veg-oil',
'linseed',
'lin-oil',
'soy-oil',
'sun-oil',
'soybean',
'oilseed',
'corn',
'sunseed',
'grain',
'sorghum',
'wheat'],
[],
[],
['earn'],
['acq']]

Next, I make function to pull out earn topic, which is my primary classifier. If the article is tagged earn alone or earn with any other topic, it’s classified as earn. If it doesn’t include earn, I’m calling it other. And some articles haven’t been assigned any topic. At this point in the project, I’m not sure why, so I’m keeping those and calling those blank.

Now, I create my dataframe with the topic column.

Image by author

Looking good. For the content of all the articles, I follow a similar process. I find everything with the <text> tag in the Beautiful Soup object to create a Beautiful Soup element. Then, I create an empty list and loop through the element to add each article’s text as a single string.

Now, I can add this list to my dataframe.

Image by author

This dataframe is the one I was looking to create. Now, I can do the other tasks needed to get the text ready for NLP modeling, such as performing Exploratory Data Analysis, removing stopwords, tokenizing the articles’ text, and so on.

Summary

  • Pull the data into a Beautiful Soup object and parse it
  • Inspect the Beautiful Soup object to find the useful tags
  • Make new Beautiful Soup objects out of the tags
  • Use the text attribute or get_text() function to pull out just the text
  • Use loops to create lists to add to a dataframe

--

--