Surfing and Searching the World Wide Web

Outline

What is the Internet? The World Wide Web?

Many people use the terms Internet and World Wide Web interchangeably, but the two terms are not synonymous. The Internet is a vast collection of computers all over the world that are connected. The World Wide Web (WWW or simply the web) is the part of the Internet that consists of pages (HTML documents). As you have learned already, HTML supports hyperlinks to other documents, color, sound, graphics, animation, video, and interactivity. Not all internet computers are part of the WWW. The WWW is just one of the ways that information can be communicated via the Internet. For instance, e-mail, and instant messaging use the Internet and NOT the WWW.

There are many search engines and web directories that can help you retrieve information if you search with the right keywords. Our Library has an extended list of search engines along with other useful material.

Accessibility of Information beyond imagination!

For example, take a look at the Health Risks of Dihydrogen Monoxide (you can also use a search engine for it.) Take a few minutes and answer the question:

Anyone can now access information on this chemical and be informed. But is it enough?

Who is providing information on the web?

In other words, almost everyone. Searching the web is easy. Accessing the information is easy. But is it reliable?

Recognizing unreliable, incorrect or just bogus information is becoming an important skill like reading and writing. "Problematic" "information" can be inserted in the web by anyone:

How big is the problem?

A study conducted recently shows that educated people can be fooled too, because there is a shift in the reliability expectation that comes with printed material. The study found that

  1. Students strongly rely on the Internet for information, even when they have trouble finding it.
  2. A majority of students are falling for advertising claims, authoritative misinformation and propaganda. A large minority will even fall for scams.
  3. Every student is susceptible to misleading claims equally likely.

Keep in mind the closing paragraph:

As students continue to view the Internet as a primary source of information, without a significant shift in training methods, this problem will only grow worse. It is vital that they better understand the nature of the Internet and develop an instinctive inclination for verifying all information. This will allow students to take advantage of the tremendous benefits provided by the Internet without falling prey to the pitfalls of online research.

Examples

What are some examples of web sites that might be biased on certain issues?

Discussion: Using Wikipedia

A great resource online is the Wikipedia, an online encyclopedia with entries that can be edited by anyone. No entries are signed.

Some researchers feel that Wikipedia is not acceptable as a source for a scholarly article, including work by college students. Indeed, the history department at Middlebury College has declared Wikipedia out of bounds for its students.

What do you think?

Search Engines

You probably know this already, but search engines don't search the web at the moment you ask. They're searching the web all the time, using programs called "web crawlers," "spiders" or "robots," which compile statistics on the words that are in the web pages they find and building up big databases. Then, when you go to their site and do a search, they look in the database and hand you back a bunch of URLs. Some of those URLs may no longer exist. Some may have changed content. That's all because the search is based on old information.

Also, it's important to know that search engines are not comprehensive. First, there are millions of new pages being added or updated every day, and the spiders can only get to a fraction of them. Secondly, the databases are enormous, but they are still limited in size. Consequently, even a very large search engine (such as AltaVista) only indexes 20-30 percent of the web. That percentage is actually dropping over time, as the web grows faster than the databases can keep up.

Another reason that search engines are not comprehensive is that there are parts of the web they can't get to. They can't search archives of newpapers (such as the NYTimes online) because they either don't have links to them (though they're available by search engines on the site) or because they require payment. There are many online journals and such that require subscription or the like, such as the MLA Bibliography or Academic Universe or EconLit. The Wellesley library has subscriptions to many of these sites, so be sure to take advantage of that while you're here. When you're no longer here, just keep in mind that if Google comes up with nothing on a topic, that doesn't mean there's nothing. Talk to a good reference librarian.

Meta Search Engines

Some search engines don't have their own database. What they do is take your query and hand it off to several other search engines and then combine the results. Some people have called them "information carnivores," while regular search engines are "information herbivores," by analogy with the food chain.

Page Ranking

Search engine sites try to "rank" the URLs they give you, in order of decreasing "relevance": how well it seems to match your query.

First, you must read the following, because it's hilarious: http://www.google.com/technology/pigeonrank.html

In the early days, search engines were based on the older field of information retrieval (AKA document retrieval), and the rankings were entirely based on the contents of the documents: How often do the search words appear in the given document, relative to their rarity in the entire corpus of documents. The importance of words could be boosted if they appear in titles or section headers, but it's all document-based. Each search engine had its own custom formula, with discount factors and fudge factors and such. In fact, all the software you can buy to increase your ranking on search engines (Scott's dad has used some to try to improve the ranking of his speechwriting web site) are based on profiles of what kinds of pages rank well. For example, it would say that the word "speechwriter" has to appear in the title and within the first five words in order to rank well on XYZ search engine. The following link provides some simple suggestions for helping search engines index your Web site. Essentially, you give the robot/spider a list of terms to index your site under, by using the META tag. You try to think of terms that people would use to try to find your site. Notice the variations in the following example:

<META name="keywords" content="speechwriter, speech-writer, speech, communications, talk, ghostwriter">
<META name="description" content="freelance speechwriter, free-lance speech-writer">

Hypertext systems, like the web, also have information about document relationships encoded in their link structure: A link between two items can be regarded as a statement by the author that their document is related to another. Work in the late 1980s sought to exploit this information for indexing and searching. With the advent of the web, more people began to work in this area and researchers at IBM came up with the idea of a web authority, a page (or site) that is frequently linked to by other pages. For example, if Ron Rivest has a great web page on the RSA cryptosystem, lots of web pages will link to his, and this should increase the relevance score of his page for queries about RSA encryption.

Interestingly, rankings based on link structures and rankings based on document content were separate for quite a while. These two notions of rank were put together in the mid-1990s. (This paper got the ball rolling with a simple strategy for combining the ranks.) Google has rather famously developed this idea to high art.

Another issue is "pay for placement." There are search engines where you can pay to get your web page ranked highly. This may be good for you as an author, but it's bad for people searching the web. Suppose you went to a librarian and asked for recommendations of books on a particular subject. Wouldn't you feel a bit concerned if you knew she was receiving a kickback from certain authors to push their books? Don't you want an unbiased opinion about the best book?

What should you do about this "payola"? Try to find out of the search engine accepts "pay for placement." If you think a search engine isn't giving you good results; try another. Don't just look at the first few URLs in a list.

Google Bombing

The page-rank algorithm is not infallible. It can make "mistakes" and it can be tricked, a technique often known as Google bombing, but as a trade is known as "search engine optimization".

Directories

Search engines tend to return a lot of results, sometimes millions. Of course, most of that is junk. You end up looking at lots of irrelevant pages, trying to find the relevant ones. An alternative is to find a site where topics have been organized and cataloged (by actual human beings, like librarians cataloging books in a library), so that everything in a particular area is relevant. Sites like this are called "directories." Among the most well-known are Yahoo! www.yahoo.com and the librarian's index to the internet lii.org .

The good side of a directory is that things are well-organized and everything you find is likely to be on-topic, just as in a library. The downside is that there may be pages on the web that aren't yet in the directory. Even a massive search engine like Google can only index 20-30 percent of the web; anything that involves real people reading pages and cataloging them will be much less comprehensive.

We suggest consulting directories sometimes; it's a good antidote to the wheat and chaff approach of search engines. A good research effort should seek several sources and search the web in different ways.

Resources

The Wellesley library staff has written a wonderful page about searching the web that you should definitely read: http://www.wellesley.edu/Library/Research/search.html. It includes information about search engines and how to use them, directories, and evaluating web sites. UC Berkeley's library also has a terrific page that goes beyond general web searching.

If you are doing academic research, there are two other kinds of searching you can do from the library's home page ( http://www.wellesley.edu/Library/ ): The Wellesley librarians have actually compiled custom lists of resources for each department at http://www.wellesley.edu/Library/Research/research.html. These pages will lead you to books, articles, and web pages that may be pertinent for your research. There's also an A-Z list of all electronic databases at http://luna.wellesley.edu/screens/a-zlist.html.

If you have a knotty research problem, be sure to talk to the Wellesley library staff. They're there to help you and they're really good at it.

© Computer Science 110 Staff
This work is licensed under a Creative Commons License
Date Modified: Monday, 28-Apr-2008 09:11:13 EDT