In this section:
This is NOT a chapter on computer-assisted reporting. The Investigative Reporters and Editors organization (www.ire.org) puts on a fine workshop on how to use databases and spreadsheets for such in-depth stories.
Rather, this chapter grew out of a "Smarter Surfing: A Workshop for Better Use of Your Web Time" class taught by Columbia University Graduate School of Journalism Associate Professor Sreenath Sreenivasan (www.sree.net ). I am indebted to him for much of the content. Also consulted were two books: Nora Paul's "Computer-Assisted Research: A Guide to Tapping Online Information" and "The Associated Press Guide to Internet Research and Reporting," by Frank Bass. Clicked on were Web sites of the Society of American Business Editors and Writers (www.sabew.org), the University at Albany (www.library.albany,edu), Ithaca College (www.ithaca.edu/library), University of California at Berkeley (www.lib.berkeley.edu), University at South Carolina Beaufort Library (www.sc.edu/beaufort/library/bones.html), and Okanagan University College Library (www.ouc.bc.ca/libr/connect96/search.htm). Christopher Callahan's site at the University of Maryland (www.reporter.umd.edu/strategies.htm), Paul Grabowicz and J.D. Lasicas page at the USC Annenberg Online Journalism Review (www.ojr.usc.edu), Mike Wendland (www.poynter.org/research/biblio/bib_car.htm), and Rich Meislin at the New York Times (www.nytimes.com/library/tech/reference/cynavi.html) were clicked on. The sites of Jonathan Oatis (www.oatis.com), Bill Dedman (www.powerreporting.com), Jeremy Caplan (www.jeremycaplan.com/webtips.htm) and Staci Kramer (www.stlouisspj.org/surf/tips.html) were also consulted.
If the World Wide Web is a giant library with most of the contemporary (emphasis added) information we need, wed still be at the mercy of a reference librarian because everything seems so topsy-turvy. (Let me digress for a moment and define what I mean by "contemporary information." Today, many newspaper and magazine articles automatically get put on a Web site. However, those sites didnt exist in the 1980s or early 1990s. So although we may not have any trouble finding recent articles on historical, scientific or literary figures and events, we have to realize those articles were only posted within the past few years. Some organizations get around this by recruiting volunteers to transcribe articles from past editions. For example, a group affiliated with The Catholic Encyclopedia recently sought out Web-savvy people to retype and then post articles from the past.) OK, now, back to the present: There are billions of sites, but search engines only refer us to a fraction of them. So when approaching Web research it's best to realize the Web may not find the answer, but it can dig up a clue to the answer.
Here are some steps when dealing with search engines:
Step 1: Dont use a hammer when a Phillips screwdriver will do.
First, to honor IBM, "Think." What do you want to know? How can you find it? Are you searching for a list of new media journalism faculty at Columbia University? Well, since you already know that Columbias URL will end in .edu, why not just type out www.columbia.edu instead of doing a search on Google? Want to know the spin the presidents pals are putting on a congressional initiative? Try www.whitehouse.gov.
But for the times when a search engine or a subject directory is needed, take a minute to refine your approach before blindly scrolling through a gazillion sites. Lets begin by defining terms:
A search engine is a database of Web pages assembled by a "spider" or "robot" ("bot") that goes through links and ranks things according to the keywords or search phrase. So sites with no links could be missed. Its important to know the database search was done at an earlier date. How often the spider refreshes depends on the search engine.
There are two types of search engines:
Its a good idea to use more than one search engine since search engines differ in size, speed and content. Also, no two engines rank things the same way, so each one used will return different results.
In the old days going alllll the way back to the 1990s, engines would return results based on where or how often the keywords appeared (e.g. in the title, on the first page, or grouped closely together). That was called a first-generation search engine. But now we have Google, which ranks results according to Web page links. Since this is a merger of spiders with somebodys judgment of what is relevant, it is called a second-generation search engine.
Metasearch engines are fast, but they dont list all the results they find from each search engine. So theyre most useful when doing a simple search or looking for a quick synopsis on a specific subject.
Another useful tool is a subject directory, which is maintained by human editors (e.g. Yahoo) and organized into subject categories. But many subject directories also have a search engine partner, or have been absorbed by search engines, so the distinction may be moot in the future.
Many subject directories only index a sites top page. Therefore, a list of results would be smaller than a search engines and categorized by an editors description of a subject. So a searcher would use a subject directory to find out what information is on the Web on particular topics, organizations, sites or products. Unlike search engines, a subject directory does not contain every word of a Web page, but it does tells a searcher where to find the page. And, since human editors are not as fast as a search engines spider, dead pages can be a problem.
There are several types of subject directories:
Experts in a specific field assemble and maintain databases for portals, vortals and library gateways. The advantage is that they scan the so-called "Invisible Web," and include sites not usually found by search engine spiders. (See Step 6) However, the term Invisible Web may give the false belief that anything on it is vital because the name implies it is all secretive and (potentially) essential. Actually, a lot of it is private pages or something that may not be of general interest.
For online tutorials on search engines, metasearch engines, subject directories, the Invisible Web, and more, go to these sites:
Step 2: Print out the chart.
One of my daughters favorite TV shows is "Reading Rainbow," hosted by LeVar Burton every morning on PBS in our market. Near the end of every half-hour show Burton talks about other books dealing with the topic du jour, whether its AP style or Boolean logic. (The books are then reviewed by kids in three 30-second segments.) Burtons tagline is always " but you dont have to take my word for it " and then they cut to the kids.
Stick with me a moment longer as I honor George Orwell and stretch one of his lines: "All search engines are created equal, but some are more equal than others." Each search engine scans data, but the syntax is different for each. For example, some engines allow wildcard characters for those who cant spell well (i.e. j*m will return jam and Jim). In some the person doing the search has to type in AND, OR, NOT and NEAR to refine the search. Other search engines require plus (+) or minus (-) signs in queries. Its easier for a middle-aged man (like me) to remember all the numbers for phones, addresses and on Social Security cards than to remember what search engine uses what syntax. And thats why the Internet for People Project (www.infopeople.org/search/chart.html) has put together a handy cheat sheet to print out and refer to when using a search engine.
But you dont have to take my word for it. Heres what Columbia University Graduate School of Journalism Associate Professor Sreenath Sreenivasan said about the site in a weekly Web tip column he writes for the Poynter Institute (www.poynter.org/web): "It summarizes the features of each engine and tells you how best to use it Think you know Google well? The chart just might show you some ways to improve how you use it."
Step 3: Dont be afraid of people who say Boolean.
British mathematician George Boole (1815-64) had time on his hands one day in 1849 and wrote, "An Investigation of the Laws of Thought, on Which are Founded the Mathematical Theories of Logic and Probability." He obviously had a lousy editor, but his idea of merging logic and algebra flourished and is found today on search engines near and far.
Boole called his logic a "calculus of thought" and a person types in its operators (the words AND, OR, NOT, NEAR) to refine terms to either show more or fewer Web sites.
For example, a search for Marren on Alta Vista will yield 3,260 sites, some of them in foreign languages and some about Marren Motor Sports in Connecticut. (And remember that Web sites are fluid. What shows up today may not show up tomorrow.) But suppose a reporter wants to find a link to my book on Buffalos literary and artistic history. Then he or she would type Marren AND authors, which shows 91 sites. Included in that list is the much more appreciated and popular book, "Mergers and Acquisitions: A Valuable Handbook," by Joseph H. Marren. Alas, he is not our hero. So the clever reporter then types Marren AND authors AND NOT Joseph H. Marren. Ahh, now we see some links to my book. (And, of course, the astute searcher would have first tried amazon.com.)
But thats Alta Vista. Each search engine is different and some (e.g. Google) automatically default to AND when doing a search with a complex phrase or multiple nouns. Also keep in mind that most people today swear by Google.
So whats a searcher to do? A rule of thumb is to use Boolean operators for complex searches. Use plus (+) and minus (-) signs for simpler searches. Confusing? You bet it is! See Step 2 above.
A good Boolean tutorial is offered by South Carolinas Beaufort Library at www.sc.ed/beaufort/library/lesson8.html. Lesson 7 from the same site is also helpful.
Step 4: Now put the steps together and dance.
You say youre new to Buffalo and you want to know where chicken wings were "invented" to better fit in with the neighbors in the next parish across the crick?
What do you do? Well, first, find out if there is there a book or other resource nearby with the answer. If not, then go to the Web.
(Before searching, though, please remember that all of the references below were valid at the time I wrote this in spring 2002. The Web is fluid and some of the sites I mention may have gone to that great dot-com in the sky.)
If you took a best guess at an Internet domain name, would you likely find it? For example, do you think the Culinary Institute of America would have some info on chicken wings? We know its a school, so perhaps typing in cia.edu would get us in the front door. Unfortunately, not in this case because cia.edu is the Web site for the Cleveland Institute of Art. How about cia.org? Wrong again, thats some site dealing with Pentagon communications. Dont even bother trying cia.gov, you KNOW where that will take you. OK, what to do?
Is there a distinctive word or phrase that can be put in quote marks? (e.g. "Culinary Institute of America") If so, run it on a search engine (such as Google) and it will take you to a listing of several sites, one of which is the CIA in question (www.ciachef.edu).
But suppose that site only offers chicken wing recipes and doesnt explore the full lore of chicken wing culture and social history. Where to look? How about searching for the concept under a topic in a subject directory, such as Yahoo, which would have titles and descriptions of Web sites.
Or try this: Use one or more distinctive words or phrases in quote marks at a search engine site. Perhaps we could employ some Boolean operators or plus or minus signs (once again, see Step 2). For example, using "chicken wings" on Google returns 93,200 results; "buffalo wings" returns 29,300, including several for a defunct roller hockey team.
Its time to refine our search and think this problem through. We want to find information or links to the history of chicken wings, sometimes called "buffalo wings" in other parts of the country. Heres what we get on a Google search using plus or minus signs:
+buffalo +wings +history = 66,000 sites
+buffalo +wings +food = 53,100
+buffalo +wings +history +food = 16,500
+buffalo +chicken wings +food history = 3,850
Thats still too many sites, but near the top is a site for the Anchor Bar, which true fans know is THE home of the wing. And that brings up an important point: DONT SCROLL. Its better to refine your search since an engine will rank your request based on relevance to the keywords or phrases you typed in. (Even though some engines rank them based on fiscal considerations but at least Google will say if it is a sponsor site.) Therefore, the farther down you look, the less your chances are of finding what you want.
What this all means is that the best searches are the ones with fewest results. Remove or add restrictions to refine the search and to get a more manageable field.
One last word. Dont rely on so-called "stop words," which really dont make the spider stop, unless the words are part of a phrase. A "stop word" is a small or common figure of speech (conjunctions, prepositions, some common verbs, adjectives or adverbs) that a spider will pass right by in its quest to get you the results quickly.
More comprehensive (yet still brief) tutorials on searching are available from:
Step 5: Go to the head of the class.
There is an "advanced" way to search the Web by doing a field search. A typical Web page contains several fields. For example, the site that hosts my Web page is at http://www.sree.net/teaching/scrippshoward and it shows the following fields:
A savvy searcher can narrow a search considerably by field searching under title, host, URL, etc. See step 2 to see what search engine uses what syntax on a field search.
Doing a title search: Suppose you want to know the publisher and formal title of a book written by Joe Marren. What do you do? Well, the first step would be to go to amazon.com or barnesandnoble.com. But lets say those sites were down and youre on deadline. Stuck? Nope. Do a title search. (The title is in the banner at the top. Searching for a keyword in the title field, rather than a plain keyword search, can sometimes produce more relevant results.)
For example: Using altavista.com to search title:"joe marren" will yield one result and its the right result.
Doing a domain search: Lets suppose youre a real Joe Marren fan and want to find more links to stuff about our hero. Rather than typing the keywords "joe marren" in quotes in an engine, try limiting the search to top-level domains to produce more relevant results.
For example: Using altavista.com to search domain:com AND "joe marren" yields 16 results. The first several are about a writer from Buffalo and the rest about a cross country star from the University of Illinois-Chicago.
Still using altavista.com, typing in domain:edu AND "joe marren" will yield three results, all about the college cross country runner.
And typing in domain:org AND "joe marren" on altavista.com shows six results, including one from the eriebar.org site that has a picture and brief bio of amateur thespian Joe Marren playing the part of photographer Harry Bliss in the Bar Association of Erie Countys recreation of the trial of Leon Czolgosz, the assassin of President McKinley in Buffalo back in 1901.
Continuing the same technique on altavista.com under other top-level domain names (aero, biz, coop, gov, info, mil, name, net, museum and pro) yields no results.
OK, time to expand our horizons. The Internet was created in the United States so US was not assigned as a code to U.S. sites, although it is used on state and local government sites, among others. Foreign countries have a two-letter code on a site to let searchers know the country of origin e.g., China is CN, India is IN, Ireland is IE and Jamaica is JN. (For a list of country codes, go to www.hotbot.lycos.com/help/domains.asp)
Suppose we want to know about small-business conditions in China and we want to use Google as a search engine. Typing in domain:CN AND small businesses will yield 15 results. (Notice there are no quote marks around the phrase as there have been in past searches. Originally, I tried it with quote marks and it produced no results. The lesson is to keep trying different approaches.) The top result is a university site in Australia from the late 90s that talks about the possibilities of e-commerce in China. We know this because the URL starts out with ausweb.scu.edu.au the edu signifies its an education site (usually a university or college) and au is the country code for Australia. OK, thats another idea for a search. Type in domain:CN AND "Internet in China" and you will find a screen full of results, one of which has cn in its URL, which means it is from China itself.
Those are all fairly simple because the needed information was at the top. Remember, though, that a search engine reads every word of a site and sometimes what you want to know is buried. For example, suppose youre involved in a killer Trivial Pursuit game and you need to know the name of the new Northwest Territory in Canada and suppose Mapquest, National Geographic and all other such sites were unavailable. Thats a lot of supposing, but bear with me. Typing domain:CA AND "Northwest Territories" on Google will show 23 results. You click on the most likely one but it doesnt mention anything right away about the new territory. In that case, use your find command (usually under the "Edit" menu) and type in the keyword (Northwest Territories). It will take you to the first instance that lists the keyword, and perhaps it, or another instance, will mention that Nunavut was spun off to become the fourth Northwest Territory.
Doing a host search: This lets a searcher find information that he or she knows is somewhere on a specific host computer. For example, I know there is a tutorial on field searching on the Beaufort Library site hosted by the University of South Carolina, so typing in host:www.sc.edu on Google will take me to a small field of relevant results. By the way, the tutorial is Lesson 9 at www.sc.edu/beaufort/library/bones.html
Doing a URL search: Do you recall that I mentioned I had a Web page I made while a Scripps Howard new media fellow? You dont? Hmmm. Go to the first paragraph in Step 5 to refresh your memory. But lets take advantage of your lousy recall: You know it was a Scripps Howard program, but you dont know who taught it or where or when. Since you desperately want to impress me, you need to know that info. Do a URL search on Alta Vista by typing in url:scrippshoward. Among the handful of results is a site that will say I was a fellow at a Columbia University seminar Jan. 4-7, 2002, taught by Graduate School of Journalism Associate Professor Sreenath Sreenivasan.
Doing a link search: Sree, as he is popularly known, is my brave new media world mentor. He has a hand in a gazillion things and yet also has a life outside the office! To find out the range of his activities, do a link search. Typing in link:www.sree.net on altavista.com will yield more than 200 results; on Google, more than 600 results. Go ahead, be amazed. Be very amazed.
Notes to note:
Step 6: Its invisible:
The invisible Web refers to sites that cant be found by search engines because spiders cant or wont go into non-text, PDF (portable document format), or subject-specific databases. Therefore, the only way to find such sites is to go into the databases themselves.
Alta Vista and Google can search some graphic, PDF or non-text databases, but another solution is to try either Gary Prices links or the invisible Web site (see Search Engines & Directories under the subject index).
Step 7: "What if "
There are a gazillion results: Think of synonyms, or a way to rephrase the search using different keywords, maybe some uncommon nouns.
There arent enough results: Add or subtract some keywords or phrases from the search string. Try another engine, try a metasearcher, try a subject directory.
There is a "404 file not found" message: That means the file has been renamed, removed, etc. Try another search engine. (Google has cached copies of pages.) Do a field search on the title. Try shortening the URL by getting rid of things after tildes (~) and percent signs (%). If there is a date in the URL, check to see if you typed in the right date.
There is a "Server does not have a DNS entry" message: This could also happen with a "server error" message. It means the network is busy or the server is down. Try again later. Or once again check your spelling.
The wrong home page comes up: Not everything ends in .com, so try guessing. For example, www.sree.com is the home page for a chain of hotels in the South. But a searcher who wants to find the home page of Columbia University Graduate School of Journalism Associate Professor Sreenath Sreenivasan has to type in www.sree.net. And www.georgewbush.com is sponsored by the Republican National Committee; www.gwbush.com is a whole different animal thats NOT endorsed by the GOP.
To find out more, or to learn some rudimentary troubleshooting, go to www.learnthenet.com/english/htm/96error.htm
Step 8: A case study:
Lets tie all this together now by looking at an actual series of events that happened one Sunday morning in May 2000 in Scranton, Pa.
Al Tompkins, the online/broadcast group leader at the Poynter Institute, was giving a seminar at WNEP-TV in Scranton when news filtered in that a plane crashed nearby. Go to www.poynter.org/dj/052600.htm to see how Tompkins used the Web on deadline to help the station get the news out.
Briefly, Tompkins and the reporting staff started with the FAA and National Transportation Safety Board Web sites to track down official phone numbers. But, to be honest, officialdom wasnt much help. So next he went to the Investigative Reporters and Editors sites resources list for some tips. But it all hinged on getting the planes tail number. Once they had that by making some old-fashioned offline phone calls to sources, they went to the database at landings.com to find out the type of airplane and its possible owners. Also, assignmenteditor.com had links to phone numbers for potential owners. The whole story, including tracking down pictures and making a graphic that showed the planes route and crash site makes for fascinating reading. Go to the Poynter site for the real story.
Step 9: A review (with no quiz at the end):
If you have a good idea of what you need to know, go to a search engine. If your topic is rather broad, or you just need general information, use a subject directory.
Speaking of search engines:
Death be not proud, because Google saved (sorta) Dejanews.com. Dejanews.com was an archive of newsgroups that died a dot-com death in 2001. That was bad news for a lot of people, but lets be myopic and selfish and mourn the loss to the media first. A newsgroup is a collection of messages on topical bulletin boards that millions of people use daily. Reporters can use newsgroups to find info, sources and background related to their beats, or to find out what readers think (for lazy reporters who dont want to leave the office).
After Dejanews.com died, Google wisely bought the archives and now Google Groups (http://groups.google.com) has more than 20 years of archives with some 700 million messages. Naturally, there is a lot of junk with the pearls of wisdom. So the best bet to find relevant information would be to look at specific groups and search by date.
The "news" in any group is certainly not the word of God. Try to figure out:
Its easy to use a Google or a Yahoo newsgroup: Simply e-mail the person who posted the message. But be wary if you post a message on a bulletin board because once posted, always posted. Remember what your mother told you and be polite. However, if you were raised by wolves try the Netiquette Web site at www.albion.com/netiquette/index.html, which gently provides pointers on Internet dos and donts.
An advanced search, newsgroups can check what people are posting by typing in queries in any of these fields: newsgroup, subject, author, date, or keywords.
There are also discussion groups, trackers, alerts and other assorted mailing lists that can be found in the subject index. A good way to find more is by using Liszt at www.liszt.com
Since everyone can see what you post, dont give away any secrets. The Wall Street Journal reporters know how to delicately mine newsgroups for sources and other information. For a glimpse at how they do it see www.poynter.org/web/020702jon.htm.
Heres another tip: Create a filter or walk-away address so your e-mail inbox isnt filled with hundreds of replies and not all of them polite. Since the possibility also exists of getting a virus, you should delete anything that makes you suspicious, even if it comes from a known source.
Using a newsgroup and its Internet cousin, the e-mail interview, a reporter can contact virtually anyone at any time, but we dont really know who is writing the reply. Is it the sought-after exec or the PR department? Use your instincts to check and double-check.
OK, we have all this nifty info from various Web sites and newsgroups, but now the question is what to believe. Anybody with some dollars can put anything on the Web. So, as journalist Staci Karmer puts it (www.stlouisspj.org/surf/tips.html): Be cautious. Treat information from the Internet the same way you would information from any other source.
Think about it: Who created the page? Does that person or group have an agenda to push or cause to promote? Is it a joke or hoax site? Just because it looks like a duck and quacks like a duck doesnt mean it is a duck. For example, an "academic" study on how cats react to bearded men is at www.sree.net/stories/feline.html. "Official" media stories about the first male pregnancy are at www.malepregnancy.com. A link to find out about hoax (no kidding) Web sites is at www.museumofhoaxes.com.
But not every site is a hoax. Dig deep to see how unreliable supposedly reliable sites are. The site at www.martinlutherking.org looks like a good site until you click on the links and notice the anti-King sentiments. Be wary of the language and viewpoint of such sites.
Realizing everyone and every group has a spin on everything, a handy rule of thumb can be that a government (.gov), military (.mil), or educational (.edu) page is more reliable than a top-level domain with .net, .org, or .com. But be wary still. Is an edu site written by a prof or a group of students as a class project?
A tilde on a site usually means it is a personal page on the server. It could give seemingly credible info, but check to see if there is an inherent bias.
A brief tutorial may help about now: Many file names, like Caesars Gaul, have three parts:
So the file "reallyimportantpic.jpg.pif" is a vital picture that you have to open immediately, right? Maybe not. The name is a distraction. Pay attention to the entire file, especially the second extension because sometimes viruses are hidden in that second extension. Be wary and ready to delete any attachments with two file extensions.
Suppose a sites URL mentions a credible source (e.g. CNN.com) and is followed by a series of numbers after the @ symbol (CNN.firstname.lastname@example.org). Want to know if it is legit? Again, look at the entire URL and dont be fooled by the CNN.com part. Check with the American Registry for Internet Numbers (www.arin.net/whois/index.html) by typing in the numbers to find out about the server.
But lets suppose there are no numbers or @ symbol. For example, who owns bytebelt.com Type in the domain name at http://allwhois.com for the answer.
Another way to determine a sites authenticity is to look for attribution. Are the facts and sources identified? Are they reliable?
Although we all make mistakes, a page filled with spelling or AP style errors could also be playing fast and loose with the facts. Be wary.
To determine the purpose of a site, try to figure out:
Even reputable sites can sometimes be suspect. Dont believe something just because its on a site. Did anyone review or proofread the info? For example, I met Lia Chang, a photojournalist, model, and actor, when I was at Columbia University. Can I trust the bio on her on www.imdb.com? Well, that would depend on who wrote it. Determine if it was written by a fan or by a recognized authority. If it is unsigned, dont use it until you verify it. If the site contains a blurb or two about any of her movies, determine if it was written by a studio PR person or by a legit critic.