An excursion into web scraping with Python 3 and BeautifulSoup

A quick note to start:

In the following web scraping program, I used:

Python 3.5.2
the requests module for python 3
the BeautifulSoup4 module for python 3
Sublime Text 3

If you want to duplicate the process or code, please feel free :)

Earlier this week I was going over some ideas in my head for fun little experiments I could do with Python - every time I do one of these, the language opens up just a little more for me, and I grow to love it even more than I already do (which is a lot). This time, I decided I would try my hand at web scraping.

The first thing I noticed was the incredible ease python displays with the process, given the existing modules the language offers: in this case BeautifulSoup4, an enormously powerful data model for web-based HTML data that's been around for a while, combined with "requests", a module that allows you to get any URL's page code in pure text format, were all I needed to get things going.

This all started when I was trying to organize a bunch of files on my computer. These were mostly media, with some possibility for scraping from sites like IMDB, iTunes, youtube and others for metadata that could serve as a basis for organizing these files in a way that makes some sort of sense. Of course, actually doing that wouldn't be my first step - I needed a simpler data set. This lead me to a Google Excel Sheet that I helped set up a few years back: link. This Sheet now contains over 800 Twitter handles of game developers, organized by job title, and including their employer at the time of filling it out. What better data set than this to get into web scraping with?

The Steps

Step 1: requesting the actual data.

This was the easy part: the "requests" module just takes a URL, and returns HTML code in text form if it got a response. Otherwise, it fails and throws an exception. Simple enough:

page_response = requests.get(url)
page_content = page_response.content

Right, so the variable "page_content" now contains the content of the web page we just requested. Cool. This data is essentially just a clean string with the raw text code of the web page we requested it from. From here, we use BeautifulSoup to mine the contents of the page.

Step 2: we have the data, now we format it to find what we need.

This part of the process was actually surprisingly easy: BeautifulSoup does all the hard work for you, in creating a comprehensive data model from any HTML-based text string or file you present it with. This is where my original goal of scraping for media file metadata turned into scraping game developers' Twitter pages for whatever information I could get - purely as a result of the data set I had on hand, and to see how much data mining was possible using only public information, starting from nothing more than a Twitter handle.

So, I built a small data model for this information as well:

content_soup = BeautifulSoup(page_content, "html.parser")
content = dict()
content["soup"] = content_soup
content["links"] = content_soup.find_all("a")
content["images"] = content_soup.find_all("img")
content["title"] = content_soup.title
content["header"] = content_soup.header
content["body"] = content_soup.body
content["footer"] = content_soup.footer

This builds a nice dictionary for me, which contains all images, hyperlinks etc... of the page I'm requesting info from. This doesn't necessarily have to be someone's Twitter page, it can be any. In the case of my data, to get someone's home page URL, Twitter structures their pages pretty conveniently:

spans = soup.find_all("span")
for span in spans:
    if span.get("class") is not None:
        if "ProfileHeaderCard-urlText" in span.get("class"):
            if span.a is not None:
                content["homepage_url"] = span.a.get("title")

Great, so now we have someone's homepage URL. As you can see, Twitter provides a nice tag we can search for to retrieve this: "ProfileHeaderCard-urlText". Next step: get the link, and request the info from the URL we just found!

links = soup.find_all("a")
for l in links:
    href = l.get("href")
    if href is not None:
        if ".pdf" in href.lower():
            urls["pdf"] = href
        if ".word" in href.lower():
            urls["word"] = href
        if "resume" in href.lower():
            urls["resume"] = href
        if "cv" in href.lower():
            urls["cv"] = href
        if "linkedin" in href.lower():
            urls["linkedin"] = href
        if "facebook" in href.lower():
            urls["facebook"] = href

for k in urls:
    print("\t", k, "\n\t\t", urls[k])

This outputs something like the following (actual output from my own data):

http://www.mattiasvancamp.com Mattias Van Camp - Technical Artist
     pdf 
         projects/ResumeMattiasVanCamp.pdf
     linkedin 
         https://www.linkedin.com/in/mattiasvancamp
     resume 
         projects/ResumeMattiasVanCamp.doc

As you can see, with even this basic scraping, which doesn't contain many fancy checks at all, and no machine learning, you can extract a lot of information about someone if you just simulate clicking through the links on their profiles.

This leads me to my last point.

Step 3: what can we learn from this?

Well, other than that it might just be a little creepy, there are a few valuable lessons here:

Assuming other web crawlers are likely to work in similar ways, e.g. by looking for certain keywords like "resume", "cv", "linkedin", etc... in hyperlinks on your home page, you might want to provide ways for them to find these easily: taking a lesson from Twitter, you could do something like:

<span class="ResumeLink"><a href="link_to_your_resume.pdf">

Click here for my resume!</a></span>

For now, this is as far as I've gotten, but in continuing this foray into web scraping further, I plan on seeing just how much information I can extract about someone / something. I still plan to apply this to various forms of media, but for now, because I have a fairly solid data set to work with, people ( and their publicly accessible information ) will be my test cases.

Stay tuned for more content like this, and if you have any questions or comments, don't hesitate to leave a message!

Tech Art Works

Search This Blog