A quick note to start:
In the following web scraping program, I used:
- Python 3.5.2
- the requests module for python 3
- the BeautifulSoup4 module for python 3
- Sublime Text 3
If you want to duplicate the process or code, please feel free :)
Earlier this week I was going over some ideas in my head for fun little experiments I could do with Python - every time I do one of these, the language opens up just a little more for me, and I grow to love it even more than I already do (which is a lot). This time, I decided I would try my hand at web scraping.
The first thing I noticed was the incredible ease python displays with the process, given the existing modules the language offers: in this case BeautifulSoup4, an enormously powerful data model for web-based HTML data that's been around for a while, combined with "requests", a module that allows you to get any URL's page code in pure text format, were all I needed to get things going.
This all started when I was trying to organize a bunch of files on my computer. These were mostly media, with some possibility for scraping from sites like IMDB, iTunes, youtube and others for metadata that could serve as a basis for organizing these files in a way that makes some sort of sense. Of course, actually doing that wouldn't be my first step - I needed a simpler data set. This lead me to a Google Excel Sheet that I helped set up a few years back: link. This Sheet now contains over 800 Twitter handles of game developers, organized by job title, and including their employer at the time of filling it out. What better data set than this to get into web scraping with?
This was the easy part: the "requests" module just takes a URL, and returns HTML code in text form if it got a response. Otherwise, it fails and throws an exception. Simple enough:
This all started when I was trying to organize a bunch of files on my computer. These were mostly media, with some possibility for scraping from sites like IMDB, iTunes, youtube and others for metadata that could serve as a basis for organizing these files in a way that makes some sort of sense. Of course, actually doing that wouldn't be my first step - I needed a simpler data set. This lead me to a Google Excel Sheet that I helped set up a few years back: link. This Sheet now contains over 800 Twitter handles of game developers, organized by job title, and including their employer at the time of filling it out. What better data set than this to get into web scraping with?
The Steps
Step 1: requesting the actual data.
page_response = requests.get(url) page_content = page_response.content
Right, so the variable "page_content" now contains the content of the web page we just requested. Cool. This data is essentially just a clean string with the raw text code of the web page we requested it from. From here, we use BeautifulSoup to mine the contents of the page.
Step 2: we have the data, now we format it to find what we need.
So, I built a small data model for this information as well:
content_soup = BeautifulSoup(page_content, "html.parser") content = dict() content["soup"] = content_soup content["links"] = content_soup.find_all("a") content["images"] = content_soup.find_all("img") content["title"] = content_soup.title content["header"] = content_soup.header content["body"] = content_soup.body content["footer"] = content_soup.footer
This builds a nice dictionary for me, which contains all images, hyperlinks etc... of the page I'm requesting info from. This doesn't necessarily have to be someone's Twitter page, it can be any. In the case of my data, to get someone's home page URL, Twitter structures their pages pretty conveniently:
spans = soup.find_all("span") for span in spans: if span.get("class") is not None: if "ProfileHeaderCard-urlText" in span.get("class"): if span.a is not None: content["homepage_url"] = span.a.get("title")
Great, so now we have someone's homepage URL. As you can see, Twitter provides a nice tag we can search for to retrieve this: "ProfileHeaderCard-urlText". Next step: get the link, and request the info from the URL we just found!
links = soup.find_all("a") for l in links: href = l.get("href") if href is not None: if ".pdf" in href.lower(): urls["pdf"] = href if ".word" in href.lower(): urls["word"] = href if "resume" in href.lower(): urls["resume"] = href if "cv" in href.lower(): urls["cv"] = href if "linkedin" in href.lower(): urls["linkedin"] = href if "facebook" in href.lower(): urls["facebook"] = href for k in urls: print("\t", k, "\n\t\t", urls[k])
This outputs something like the following (actual output from my own data):
http://www.mattiasvancamp.com Mattias Van Camp - Technical Artist pdf projects/ResumeMattiasVanCamp.pdf linkedin https://www.linkedin.com/in/mattiasvancamp resume projects/ResumeMattiasVanCamp.doc
As you can see, with even this basic scraping, which doesn't contain many fancy checks at all, and no machine learning, you can extract a lot of information about someone if you just simulate clicking through the links on their profiles.
This leads me to my last point.
Step 3: what can we learn from this?
Well, other than that it might just be a little creepy, there are a few valuable lessons here:
Assuming other web crawlers are likely to work in similar ways, e.g. by looking for certain keywords like "resume", "cv", "linkedin", etc... in hyperlinks on your home page, you might want to provide ways for them to find these easily: taking a lesson from Twitter, you could do something like:
<span class="ResumeLink"><a href="link_to_your_resume.pdf">
Click here for my resume!</a></span>
For now, this is as far as I've gotten, but in continuing this foray into web scraping further, I plan on seeing just how much information I can extract about someone / something. I still plan to apply this to various forms of media, but for now, because I have a fairly solid data set to work with, people ( and their publicly accessible information ) will be my test cases.
Stay tuned for more content like this, and if you have any questions or comments, don't hesitate to leave a message!
Comments
Post a Comment
All questions about blog or article content are welcome. Questions about my employment or employer, past or present, will not be answered.