Digging further into internet data mining: how website structure affects your website's visibility and ease-of-use for web scrapers (and why it should matter to you).
This post will be a continuation of the previous excursion into web scraping I did, as I posted about here: https://techartworks.blogspot.com/2016/12/an-excursion-into-web-scraping-with.html
In an attempt to take some first, be it tentative, steps into web scraping with Python this week (making use of the fantastic "requests" and "Beautifulsoup4" modules), I ran into a few walls, but also made a good bit of headway. In particular, I gained some insights into ways you could optimize your website's structure in order to improve its user friendlines both for users and web scrapers, thereby increasing the chances of your pages popping up when someone googles your name.
Before going any further with this, there are a few notes I would like to make:
- Networks like LinkedIn provide ways to aggregate information way beyond what one person can do, and come with nifty communication features. Using them is highly advised if you want to be visible.
- Algorithms like the one I built require a data set to start from. I'm lucky to have over 800 Twitter Handles belonging a group of people who all very interested in making themselves visible online. This usually goes hand in hand with websites that come with a nice structure, so my test data here is not ideal. With that said, a fair number of them did pose a few challenges that I had to overcome, which I will detail the solutions to here.
- Google, LinkedIn, Facebook, Twitter and other websites that provide profile pages (like ArtStation) are very good at SEO. Nothing I can tell you will make your site more discoverable than your profile there. With that said, SEO optimization on your own site carries some value too.
- This is not an SEO optimization tutorial. There are plenty of those, google them :)
With that out of the way, let's get into the nitty gritty.
Let's retrace the steps we've made up to this point:
- we started from a set of approximately 800 game developers, which consists of, in most cases, their full name, employer at the time of writing, and their twitter handle.
- from these people's Twitter profiles, where possible, we requested their homepage URL, something Twitter makes easily accessible if provided.
- from their homepage URL (which in this case we're assuming is their portfolio, or in either case the page they want people to go to if they're interested in hiring them), we were able to mine some basic details about each of them, like the direct link to their facebook profile, linkedin profile, their email address, and resume PDF / DOC if provided.
The Next Step
The next step now is to compare how our crawler works with the ones we can find around the web, to try and establish a baseline. Below is a list of good sources of information on this topic:
- https://en.wikipedia.org/wiki/Web_crawler
- http://www.socialhunt.net/blog/extracting-website-data/
- http://www.makeuseof.com/tag/build-basic-web-crawler-pull-information-website/
- http://www.opensearchserver.com/documentation/faq/crawling/how_to_extract_specific_information_from_web_pages.md
If you don't want to go through all of these, the cliff notes are as follows:
- Web crawlers are pattern recognition engines of various levels of complexity and intelligence.
- Web crawlers are generally calibrated to look for certain patterns that follow conventions in order to provide the "spider" with the information it's looking for.
- If we follow conventions, we can make it easy for bots to crawl our website, and therefore make it easier for websites like Google to get to our information easily (which is great for a portfolio)
So, what can we gain from this knowledge, and how can we use it?
This is a little tricky to give a straight answer to, but I've outlined some basics:
Make your website as "shallow" as possible
As web crawlers' default behavior is to "crawl" recursively through all the pages in your website, the more navigation options you give them, the longer it's going to take them to index all your information. Big crawlers that have to process the entire internet are therefore likely to have a limit on the amount of links to recurse through before moving on to the next website. Less links to get to the info = better, in other words.
In terms of user friendliness, this is a great tip too: if a recruiter has to click more than two times to get to your resume or contact information, they'll likely run out of time and/or patience and move to the next applicant.
Link to your resume PDF/DOC directly from your homepage
One thing crawlers are guaranteed to index when searching the web is your index page. If a bot finds it and then lists it on say, google, when you type your name in the search bar followed by "resume", and your resume document is linked to directly from your homepage, the search engine is likely to pick this up. This could result in a search engine displaying the direct link to your resume PDF when someone googles it; exactly what a recruiter wants.
Include any pertinent information on your homepage
This goes back to the previous points: the more information on a single page, the less navigation a bot/recruiter will have to do in order to get to your information. This means that search engines, when they search your site, might display your contact information directly in google when search results pop up, saving any potential employer the trouble of actually navigating to your page if they're just collating URLs right now.
Use the class and id attributes to flag important information
When bots look through your page, they will search through the following:
- all hyperlinks
- any divs and spans with attributes they're looking for
- any text blocks with information they're looking for
- title, metadata, footer and other classic website information containers
To make it easier for them to get to your info, you could include keywords in your div and span tags that are wrapping important information, like so:
<span id="linkedin reference external profile professional"> <a href="http://www.linkedin.com">Linkedin</a> </span>
As you can see i'm not doing anything fancy here. All this essentially does is provide "metadata" to important and relevant links that dot your website so crawlers can use them to navigate more efficiently.
A short conclusion
Web crawlers, while sometimes invasive if you don't want to be visible online, work in a certain way. They are algorithms after all, however complicated. This means they work with a set of rules that we can use to make them work for us and increase our visibility online if we so wish. Obviously this will have only minor impact, and won't ever be as useful as having a LinkedIn profile for example. Nevertheless these insights could prove useful.
That's all for now, I'll keep digging into this to see how many useful insights I can get from this, stay tuned :)
Comments
Post a Comment
All questions about blog or article content are welcome. Questions about my employment or employer, past or present, will not be answered.