Why a search engine crawler is not at all like Lynx.

SearchGuild SiteSearch


Why a search engine crawler is not at all like Lynx.

SearchGuild.com Forum Index » General SEO

chris
Site Admin

Joined: 12 Jan 2003
Posts: 4329

Karma:

Posted: Mon Mar 24, 2003 1:52 pm   Post subject: Why a search engine crawler is not at all like Lynx.

I was going to release this as a seperate article, but I decided I wanted to rant here If anybody wants to steal it for their sites feel free but credit it back to SearchGuild.

Those with a basic understanding of crawlers and indexing on an algorithmic level will get this, for the rest it'll be an interesting read at least....

-----------------------------------------------------

We're often told, in the SEO industry, that we should imagine crawlers as a very simple browser like Lynx. Quite why that is, I don't know, I can only assume that it helps lazy search engine software developers. But it has become a general trend to confuse the two. The crawler shares just one basic, superficial, similarity to Lynx; it processes web pages in a very simple way. The difference stops there. I therefore believe it worthwhile to take some time to examine the differences and understand just where we could go wrong if we think of search engine crawlers as Lynx style browsers.

Let�s start by examining what the two pieces of software do:

Lynx: retrieves a web page specified by the user and reformats it for display on a screen. Included in that formatting is various extra bits of information such as what to do if a user performs a particular action (for example the title element in href tags).

Search Engine Crawler: retrieves a web page specified by a software program (often known as url control) and saves it. It extracts additional urls from it. Later this information is fed through the indexer to generate the actual search index.

These are very different tasks. Whilst Lynx has to actually understand the elements of the page, the search engine crawler does not. Because the crawler is not re-formatting for human viewing there is greater tolerance for error and it can do it�s job using simple pattern matching. Let�s take the extracting urls as an example. Lynx has to actually display the anchor, the crawler does not. So whilst lynx would have to understand ever element of the following url:

<a href=�http://www.searchguild.com� target=�_new� title=�Example�>

the crawler merely needs to looking for the pattern that represents an anchor (<a followed by something and then >� or <anchor something >). Then extract the href section. This has two important implications:

1. Whilst Lynx must understand that things could be written in a different order in a different way, the simple pattern match of crawlers doesn�t matter.
2. Following on from 1, because it is a simple pattern match there is greater tolerance for errors. Consider this bad code:

<a href=�http://www.searchguild.com fish=�battered� onClick=�as()####>

It shouldn�t validate and so the browser has to choose how to deal with it. The crawler is just pattern matching, it still matches the rules I described earlier so it�s just fine.

Incidentally this is also why crawlers could, if their programmers choose to, easily find links in Javascript or unlinked citations. There�s a fundamental difference between interpreting Javascript and being able to find urls in Javascript. Thinking about this in human terms, if you give somebody who doesn't know Javascript a bit of code to look at with a url in it and ask them to tell you what the url is the chances are they�ll see it.

When we get to indexing this retrieved page (which just means creating the database for people to search), it�s actually nothing like Lynx either. With indexing we want to break things down to as little as possible. So the page is turned in to a list of positions of each word that occurs in the page and any special attributes. By special attributes I mean things like bold or font or color that�s different from the rest of the page. This really means that we have a very limited subset of html with very few tags, and because it is not actually displaying them the search engine has no need to understand what they mean but merely that they delimit a section of text.

I can only presume then, that those who support the view that Lynx shows pages like a crawler would see them do so because they believe that the more simplistic view represents something that must be closer to crawler. This again does not hold water. Sure it shows you a page without images, javascript, flash and so on. But that's a very superficial way of looking at things. Take the images, what about the filename? That's used in ranking but it doesn't show in Lynx. All you get without navigating through it's horrible menus is the alt text, well I can hover my mouse in IE as well as the rest of them. Javascript? Well I've already mentioned that search engines could read Javascript if they wanted to. It's there, it gets read and it gets processed but just not run. Flash? Doesn't AlltheWeb index flash? It sure does. Is this going to be a growing trend? You bet it is. So hang on, which of those simplifications is actually giving you a true or a better view when you're using Lynx? My answer is none of them.

Many of the people I've spoken to in an effort to try to understand the Lynx myth have pointed me to the "Google Information for Webmasters", which states:

--"Use a text browser such as Lynx to examine your site, because most search engine spiders see your site much as Lynx would. If fancy features such as Javascript, cookies, session ID's, frames, DHTML, or Flash keep you from seeing all of your site in a text browser, then search engine spiders may have trouble crawling your site. --"

We've dispensed with many of these elements already, showing why they don't hold water. Let's pick a couple more:

Cookies. Does Google's crawler support cookies? Nope. Does Lynx? It sure does, so why would we want to test our sites with it to check that the cookies are okay for Google?

Session ID's. Does Google's crawler support session ids? Nope. Does Lynx? It sure does, so why would we want to test our sites with it to check that the session ids are okay for Google?

The answer of course is in a little word that many of the people I spoke to forgot to read: "may". This essentially means the whole paragraph could be true, false, or partially true and partially false. The only true for Google there is "Flash", and that's unlikely to be a true too long in to the future. And frankly, if you don't know when you're using flash on your pages you've got problems

In reality, the average person using Lynx to check in the light of current advice given by many SEOs and Google themselves is likely to end up making mistakes and not finding them. I don't argue that there isn't a time when there is a benefit, I merely argue that a regular old browser and hovering the mouse or right clicking is more often than not less confusing, easier and with a lesser learning curve. To imply that Lynx is anything like a crawler is telling newbie Niel that because his site doesn't render or work in Lynx it won't get crawled. That's just plain wrong. It will always get crawled and the vast majority of the time it will get indexed.

I know that now I've written this there will be those that choose to disbelieve me because of established belief, or because of the perception that the established belief is doing something beneficial for them (i.e. that Lynx helps them). I know this because I've spoken to a few people and that has been the general reaction. My one and only answer to that is that I've programmed crawlers, I know the differences and that doing so shifts your conceptual understanding of them further away from the truth and not closer to it. Maybe you believe you can see something in it that you could not elsewhere, but in all likelihood you are doing yourself more damage than your perceived gain. The benefits you perceive you gain could well be precisely because you believe that Lynx views things like a crawler, i.e. the logic is circular in nature. Take another look at Lynx, ask yourself "if this is not a representation of what a crawler sees, then what do I gain from this viewpoint?". In either case I ask you to look at things afresh and not with the eyes of what has been said in the past or the proveable bad "may"s of one particular search engine, to make your own reasoned decision and, hopefully, to stop another myth.

wildebeest
Moderator

Joined: 21 Jan 2003
Posts: 2816

Karma:

Posted: Mon Mar 24, 2003 2:21 pm   Post subject:

Good rant

I think the old advice to use Lynx was just to get webmasters to make sure that the text on their pages was actually 'readable'

Webby
Senior Member

Joined: 20 Feb 2003
Posts: 652

Karma:

Posted: Mon Mar 24, 2003 3:08 pm   Post subject:

Yes it normally runs along the lines of "what you can see with Lynx will get indexed..." etc. I think it has always been clear that a robot is not meant to render the code visually but to take it back and store it.

chris
Site Admin

Joined: 12 Jan 2003
Posts: 4329

Karma:

Posted: Wed Mar 26, 2003 8:41 am   Post subject:

On a related note, what's all this "Crawlers read pages from the top down, left to right" nonsense that people seem to go on about nowadays? Grrrr...

wildebeest
Moderator

Joined: 21 Jan 2003
Posts: 2816

Karma:

Posted: Wed Mar 26, 2003 9:19 am   Post subject:

Sounds like people (maybe non-coders) trying to explain how the code for tables is read (ie usually a left-hand navigation, right-hand content pane) meaning the first td is read fully first.

Webby
Senior Member

Joined: 20 Feb 2003
Posts: 652

Karma:

Posted: Wed Mar 26, 2003 10:04 am   Post subject:

Hmmm,
so if they do not start at the top of the code, where do they start then?
They must start at the top to get the robots meta data surely.
I seem to have missed out on the resources that point to SE spiders not crawling from the top of a page.

chris
Site Admin

Joined: 12 Jan 2003
Posts: 4329

Karma:

Posted: Wed Mar 26, 2003 2:04 pm   Post subject:

They don't start anywhere, it's not a linear read. Take this code:

Code:

<html><body>
Testing</body>
<head>
<title>Hello</title>
<meta name="blahrobots" content="blahblarchive">
</head>
</html>

Where our blahrobots tag is at the end of the document. Common sense says the HEAD section should be at the....well....head of the document. However the standards don't define where it must appear. Which goes to show why when I tested it both Netscape and IE rendered it properly including the title.

There are probably numerous other examples, but that's a good one to show how our pre-conceptions and assumptions mis-shape our view of what crawlers do.

Pattern matching is not the same as reading. It's not a "that comes before that" process. The only time that comes in to play is indexing when the actual word positions are stored.

chris
Site Admin

Joined: 12 Jan 2003
Posts: 4329

Karma:

Posted: Wed Mar 26, 2003 2:07 pm   Post subject:

Quote:

I seem to have missed out on the resources that point to SE spiders not crawling from the top of a page.

<grins> That's probably because this thread is about the sum of those resources. A lot of people just assume and then naturally end up with it being something like reading a book. And then ...

wildebeest
Moderator

Joined: 21 Jan 2003
Posts: 2816

Karma:

Posted: Wed Jul 09, 2003 9:25 pm   Post subject:

hehe, just read this little snippet from Google's webmaster guidelines

Quote:

...because most search engine spiders see your site much as Lynx would

chris
Site Admin

Joined: 12 Jan 2003
Posts: 4329

Karma:

Posted: Thu Jul 10, 2003 8:50 am   Post subject:

Yeh, it also says:

Quote:

Once your site is online, submit it to Google at http://www.google.com/addurl.html.

  Page 1 of 1

Your Ad Here

Page 1 of 1

Printer Friendly Version

Put the latest SearchGuild threads on your site
or put the latest posts in your RSS aggregator
Syndicate the latest search news on your site

Disclaimer: The information on this site represents the opinions of the poster and is not endorsed by this site in any way. Please read and abide by the Forum Rules. Please be aware of our Privacy Policy, Complaints Procedure, and Moderator and Admin Selection Procedure. © Copyright Searchguild.com. All Rights Reserved. Felix qui potuit rerum cognoscere causas.