Category Archives: Search

Google, Yahoo! & Microsoft Collaborate For The Greater Good

Google, Yahoo! and Microsoft have collaborated for the greater good and are all going to support a standardised XML sitemap protocol.

Google was the first to implement the XML sitemaps in June 2005 as a beta product. After a few months of public testing, the beta tag was removed and the service was ready for general consumption. Since that time, Google Sitemaps gained a significant amount of momentum.

Its wonderful to see that Yahoo! and Microsoft didn’t implement another format specific to their own search engines and have collaborated with Google. With the standardised XML format, its now possible for content publishers to feed Google, Yahoo! and Microsoft search engines off the same physical file. That point alone is a huge bonus for the publishing community though I think even more significant is that the standardised format will have the same semantic meaning to all search engines as well.

What Is A Search Engine?

A search engine is typically a software application which is capable of receiving requests and returning results based on a simple human readable text phrase or query. The query is received and the search engine then evaluates the request and attempts to find the most relevant results from its database. The relevancy of the results returned is based on complex rules and algorithms which rank each unique resource in its database. The results of a search request are typically sorted in descending order based on relevance to the search query.

Search Engine Types

There are three main types of search engines:

  1. human generated
  2. automated or algorithm based
  3. a hybrid of the two previous options

A human generated search engine is what is now generally considered a directory. Users submit their sites and the administrators of the directory review and include sites on their discretion. If a particular web site is included into the directory, it is evaluated, categorised and subsequently placed within the directory. The most widely known human generated search engine in existence today is the Open Directory (dmoz.org).

An automated or algorithm based search engine does not rely on humans to provide information for searches to take place on. Instead, an algorithm based search engine relies on other computer programs, known as web crawlers or spiders to provide the data. Once the web crawlers or spiders have provided the data, separate computer programs evaluate and categorise the web sites into the directory.

The hybrid search engines combine both human generated and an algorithm based approach to increase the quality of the search data. In these systems, the internet is crawled and indexed like an automated approach; however the information is reviewed and updated as this process takes place.

Search Engine Strengths & Weaknesses

Each technique described above has its own strength and weaknesses. In a directory style search engine, the quality of the results is often very high due to a physical person reviewing the content on the web site and subsequently taking the appropriate actions. Unfortunately, due to the ever increasing number of web sites and content on the internet, requiring human intervention to rank and categorise a web site doesn’t scale.

In a purely automated approach, the search engines rely on the speed of software applications to index the internet. While the human based approach might allow for tens or possibly hundreds of pages to be categorised simultaneously; a search engine spider or crawler is capable of doing thousands or millions of pages simultaneously. The obvious problem with this approach is that since the search engines rely on algorithms, the algorithms can be exploited. In years gone past, “webmasters” cottoned onto how these type of search engines worked and started abusing the system by including keywords into their site which had nothing to do with the primary focus of the page or domain. The search engine would spider the site and suddenly an online shoe shop is coming up in searches for porn, drugs, gambling and more.

The hybrid based approach attempts to resolve the two aforementioned issues by crawling the internet using software applications and reviewing the results. The algorithms which rate and categorise a particular web site are tuned appropriately over time and the results they produce are monitored very closely to ensure accuracy of the search results. Companies which implement a hybrid based approach have teams of people whose soul purpose is to review the validity of various search results. If they find results which they would consider to be out of place, they are marked for investigation. If the results they expect do not come up, that is also noted down and sites can be manually included into the search index.

Now that you know what a search engine is, keep your eyes peeled for a follow up on how search engines work.

Search Engine Optimisation (SEO): Demystifying The Black Art

Search Engine Optimisation (SEO) is a black art to a lot of people. Anyone publishing content online that is looking for a lot of exposure for their product, service or general announcement really need to know about it. The unfortunate reality is that most people don’t know about SEO and those that do, know that they should do it but don’t really know what it is.

In the coming weeks, I’m going to release a series of short, simple to understand posts about what search engine optimisation is and how it roughly works. Through these simple posts, I’m hoping to demystify the black art of search engine optimisation a little so that the less savvy content publishers can understand it and start taking advantage of it.

Below is a short list of some of the items that are going to be covered:

  • HTML <title> element
  • HTML heading (<hX>) elements
  • Content
  • URL nomenclature
  • Keywords
  • HTML <meta> data
  • HTML markup
  • Images
  • Inbound links
  • Outbound links
  • Intersite links

Google & Pingomatic Sitting In A Tree

Following on from my previous post about Google Blog Search accepting pings, it appears that Pingomatic is already sending ping messages into the Google Blog Search service.

By default, WordPress only sends pings into Pingomatic, which then distributes them to a lot of other services. To verify that Pingomatic has already been updated, the previous post title and URL were amended. A few minutes later, you could do a search using the Google Blog Search and the newly amended title and URL were visible in the results.

I love a good service.

Search Engine Optimisation (SEO): Dashes Versus Underscores

Recently I broke down a fairly content heavy section of a site into smaller more succinct pages. As a by product, each page now had focused content instead of a single large page with comparatively jumbled content. Of course, this allows you to target the information on each individual page for search engine optimisation thus search engine ranking/placement.

When breaking it into smaller pages, I thought I would use a common naming convention for all of the ASP filenames, for instance faq_<some>_<meaningful>_<name>.asp. I chose an underscore (_) because I’m a fan of C-style programming languages. The underscore character is pretty common (used for private variable declarations in objects, compiler stuff, …) and I also prefer the ‘look’ of them in filenames as they seem to be ‘out of the way’.

Splitting it all went well up until the pages just weren’t being picked up by Google; however I could confirm that the site and parent page were being indexed regularly. I let it slide for a while, in case the links weren’t followed, for whatever reason, on the previous visits. On Wednesday just gone, I decided something had to have been going wrong for it to not be showing up in the various indexes properly.

After investigating all the other pages on the site, the one thing that became apparent was that none of them had underscores in the filenames. Of the pages that had filenames which might have warranted one, they were either words concatenated with no separating character or a dash was used to separate each word. This led me to check how the pages that used dashes (-) were going in the search engines. It appeared that they had no problems at all and that Google was actually utilising the filename as part of the ‘this page is relevant’ algorithm.

Cruising through some useful searches has confirmed that Google considers a dash to be a separating character in a filename. For instance, if you had a filename of faq-some-meaningful-name.asp, Google would see that as “faq some meaningful name” and utilise that when indexing the site. Conversely, an underscore is considered a plain character, which means unless the person was searching for “faq_some_meaningful_name” – that page might not show up as a by product of the filename.

The moral of the story: for the moment a dash in a filename trumps an underscore; so if you are using underscores in filenames, you might be missing out on valuable search engine ranking.