So you have the newest, glitziest, "Fisher Price" version of Windows: XP. How
can you use XP in a way that sets you apart from the boring millions of ordinary
users?
The key to doing amazing things with XP is as simple as D O S. Yes, that's
right, DOS as in MS-DOS, as in MicroSoft Disk Operating System. Windows XP (as
well as NT and 2000) comes with two versions of DOS. Command.com is an old DOS
version. Various versions of command.com come with Windows 95, 98, SE, ME,
Window 3, and DOS only operating systems.
The other DOS, which comes only with XP, 2000 and NT, is cmd.exe. Usually
cmd.exe is better than command.com because it is easier to use, has more
commands, and in some ways resembles the bash shell in Linux and other Unix-type
operating systems. For example, you can repeat a command by using the up arrow
until you back up to the desired command. Unlike bash, however, your DOS command
history is erased whenever you shut down cmd.exe. The reason XP has both
versions of DOS is that sometimes a program that won?t run right in cmd.exe will
work in command.com
note : m not comparing bash to dos
DOS is your number one Windows gateway to the Internet, and the open sesame to
local area networks. From DOS, without needing to download a single hacker
program, you can do amazingly sophisticated explorations and even break into
poorly defended computers.
****************
You can go to jail warning: Breaking into computers is against the law if you do
not have permission to do so from the owner of that computer. For example, if
your friend gives you permission to break into her Hotmail account, that won't
protect you because Microsoft owns Hotmail and they will never give you
permission.
****************
****************
You can get expelled warning: Some kids have been kicked out of school just for
bringing up a DOS prompt on a computer. Be sure to get a teacher's WRITTEN
permission before demonstrating that you can hack on a school computer.
****************
So how do you turn on DOS?
Click All Programs -> Accessories -> Command Prompt
That runs cmd.exe. You should see a black screen with white text on it, saying
something like this:
Microsoft Windows XP [Version 5.1.2600]
(C) Copyright 1985-2001 Microsoft Corp.
C:\>
Your first step is to find out what commands you can run in DOS. If you type
"help" at the DOS prompt, it gives you a long list of commands. However, this
list leaves out all the commands hackers love to use. Here are some of those
left out hacker commands.
TCP/IP commands:
telnet
netstat
nslookup
tracert
ping
ftp
NetBIOS commands (just some examples):
nbtstat
net use
net view
net localgroup
TCP/IP stands for transmission control protocol/Internet protocol. As you can
guess by the name, TCP/IP is the protocol under which the Internet runs. along
with user datagram protocol (UDP). So when you are connected to the Internet,
you can try these commands against other Internet computers. Most local area
networks also use TCP/IP.
NetBIOS (Net Basic Input/Output System) protocol is another way to communicate
between computers. This is often used by Windows computers, and by Unix/Linux
type computers running Samba. You can often use NetBIOS commands over the
Internet (being carried inside of, so to speak, TCP/IP). In many cases, however,
NetBIOS commands will be blocked by firewalls. Also, not many Internet computers
run NetBIOS because it is so easy to break in using them. I will cover NetBIOS
commands in the next article to XP Hacking.
The queen of hacker commands is telnet. To get Windows help for telnet, in the
cmd.exe window give the command:
C:\>telnet /?
Here's what you will get:
telnet [-a][-e escape char][-f log file][-l user][-t term][host
[port]]
-a Attempt automatic logon. Same as --l option except uses the currently logged
on user's name.
-e Escape character to enter telnet cclient prompt.
-f File name for client side logging
-l Specifies the user name to log in with on the remote system. Requires that
the remote system support the TELNET ENVIRON option.
-t Specifies terminal type. Supportedd term types are vt100, vt52, ansi and vtnt
only.
host Specifies the hostname or IP address of the remote computer to connect to.
port Specifies a port number or service name.
****************
Newbie note: what is a port on a computer? A computer port is sort of like a
seaport. It's where things can go in and/or out of a computer. Some ports are
easy to understand, like keyboard, monitor, printer and modem. Other ports are
virtual, meaning that they are created by software. When that modem port of
yours (or LAN or ISDN or DSL) is connected to the Internet, your computer has
the ability to open or close any of over 65,000 different virtual ports, and has
the ability to connect to any of these on another computer - if it is running
that port, and if a firewall doesn?t block it.
****************
****************
Newbie note: How do you address a computer over the Internet? There are two
ways: by number or by name.
****************
The simplest use of telnet is to log into a remote computer. Give the command:
C:/>telnet targetcomputer.com (substituting the name of the computer you want to
telnet into for targetcomputer.com)
If this computer is set up to let people log into accounts, you may get the
message:
login:
Type your user name here, making sure to be exact. You can't swap between lower
case and capital letters. For example, user name Guest is not the same as guest.
****************
Newbie note: Lots of people email me asking how to learn what their user name
and password are. Stop laughing, darn it, they really do. If you don't know your
user name and password, that means whoever runs that computer didn't give you an
account and doesn't want you to log on.
****************
Then comes the message:
Password:
Again, be exact in typing in your password.
What if this doesn't work?
Every day people write to me complaining they can't telnet. That is usually
because they try to telnet into a computer, or a port on a computer that is set
up to refuse telnet connections. Here's what it might look like when a computer
refuses a telnet connection:
C:\ >telnet 10.0.0.3
Connecting To 10.0.0.3...Could not open connection to the host, on port 23. A
connection attempt failed because the connected party did not properly respond
after a period of time, or established connection failed because connected host
has failed to respond.
Or you might see:
C:\ >telnet hotmail.com
Connecting To hotmail.com...Could not open connection to the host, on port
23. No connection could be made because the target machine actively refused it.
If you just give the telnet command without giving a port number, it will
automatically try to connect on port 23, which sometimes runs a telnet server.
**************
Newbie note: your Windows computer has a telnet client program, meaning it will
let you telnet out of it. However you have to install a telnet server before
anyone can telnet into port 23 on your computer.
*************
If telnet failed to connect, possibly the computer you were trying to telnet
into was down or just plain no longer in existence. Maybe the people who run
that computer don't want you to telnet into it.
Even though you can't telnet into an account inside some computer, often you can
get some information back or get that computer to do something interesting for
you. Yes, you can get a telnet connection to succeed -without doing anything
illegal --against almost any computer, even if you don't have permission to log
in. There are many legal things you can do to many randomly chosen computers
with telnet. For example:
C:/telnet freeshell.org 22
SSH-1.99-OpenSSH_3.4p1
That tells us the target computer is running an SSH server, which enables
encrypted connections between computers. If you want to SSH into an account
there, you can get a shell account for free at http://freeshell.org . You can
get a free SSH client program from http://winfiles.com .
***************
You can get punched in the nose warning: Your online provider might kick you off
for making telnet probes of other computers. The solution is to get a local
online provider and make friends with the people who run it, and convince them
you are just doing harmless, legal explorations.
*************
Sometimes a port is running an interesting program, but a firewall won't let you
in. For example, 10.0.0.3, a computer on my local area network, runs an email
sending program, (sendmail working together with Postfix, and using Kmail to
compose emails). I can use it from an account inside 10.0.0.3 to send emails
with headers that hide from where I send things.
If I try to telnet to this email program from outside this computer, here's what
happens:
C:\>telnet 10.0.0.3 25
Connecting To 10.0.0.3...Could not open connection to the host, on port 25. No
connection could be made because the target machine actively refused it.
However, if I log into an account on 10.0.0.3 and then telnet from inside to
port 25, here's what I get:
Last login: Fri Oct 18 13:56:58 2002 from 10.0.0.1
Have a lot of fun...
cmeinel@test-box:~> telnet localhost 25
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1... [Carolyn's note: 127.0.0.1 is the numerical address meaning
localhost, the same computer you are logged into]
Connected to localhost.
Escape character is '^]'.
220 test-box.local ESMTP Postfix
The reason I keep this port 25 hidden behind a firewall is to keep people from
using it to try to break in or to forge email. Now the ubergeniuses reading this
will start to make fun of me because no Internet address that begins with 10. is
reachable from the Internet. However, sometimes I place this "test-box" computer
online with a static Internet address, meaning whenever it is on the Internet,
it always has the same numerical address. I'm not going to tell you what its
Internet address is because I don't want anyone messing with it. I just want to
mess with other people's computers with it, muhahaha. That's also why I always
keep my Internet address from showing up in the headers of my emails.
***************
Newbie note: What is all this about headers? It's stuff at the beginning of an
email that may - or may not - tell you a lot about where it came from and when.
To see full headers, in Outlook click view -> full headers. In Eudora, click the
"Blah blah blah" icon.
****************
Want a computer you can telnet into and mess around with, and not get into
trouble no matter what you do to it? I've set up my techbroker.com
(206.61.52.33) with user xyz, password guest for you to play with. Here's how to
forge email to xyz@techbroker.com using telnet. Start with the command:
C:\>telnet techbroker.com 25
Connecting To Techbroker.com
220
Now you type in who you want the message to appear to come from:
helo santa@techbroker.com
Techbroker.com will answer:
250
Next type in your mail from address:
mail from:santa@techbroker.com
250 Requested mail action okay, completed
Your next command:
rcpt to:xyz@techbroker.com
250 Requested mail action okay, completed
Your next command:
data
354 Start main input; end with
period between the
enter, type a period, then hit enter again. Anyhow, try typing:
This is a test.
.
250 Requested mail action okay, completed
quit
221
Connection to host lost.
Using techbroker's mail server, even if you enable full headers, the message we
just composed looks like:
Status: R
X-status: N
This is a test.
That's a pretty pathetic forged email, huh? No "from", no date. However, you can
make your headers better by using a trick with the data command. After you give
it, you can insert as many headers as you choose. The trick is easier to show
than explain:
220
helo santa@northpole.org
250
mail from:santa@northpole.com
250 Requested mail action okay, completed
rcpt to:cmeinel@techbroker.com
250 Requested mail action okay, completed
data
354 Start main input; end with
from:santa@deer.northpole.org
Date: Mon, 21 Oct 2002 10:09:16 -0500
Subject: Rudolf
This is a Santa test.
.
250 Requested mail action okay, completed
quit
221
Connection to host lost.
The message then looks like:
from:santa@deer.northpole.org
Date: Mon, 21 Oct 2002 10:09:16 -0500
Subject: Rudolf
This is a Santa test.
The trick is to start each line you want in the headers with one word followed
by a colon, and the a line followed by "return". As soon as you write a line
that doesn't begin this way, the rest of what you type goes into the body of the
email.
Notice that the santa@northpole.com from the "mail from:" command didn't show up
in the header. Some mail servers would show both "from" addresses.
You can forge email on techbroker.com within one strict limitation. Your email
has to go to someone at techbroker.com. If you can find any way to send email to
someone outside techbroker, let us know, because you will have broken our
security, muhahaha! Don't worry, you have my permission.
Next, you can read the email you forge on techbroker.com via telnet:
C:\>telnet techbroker.com 110
+OK <30961.5910984301@techbroker.com> service ready
Give this command:
user xyz
+OK user is known
Then type in this:
pass test
+OK mail drop has 2 message(s)
retr 1
+OK message follows
This is a test.
If you want to know all possible commands, give this command:
help
+OK help list follows
USER user
PASS password
STAT
LIST [message]
RETR message
DELE message
NOOP
RSET
QUIT
APOP user md5
TOP message lines
UIDL [message]
HELP
Unless you use a weird online provider like AOL, you can use these same tricks
to send and receive your own email. Or you can forge email to a friend by
telnetting to his or her online provider's email sending computer(s).
With most online providers you need to get the exact name of their email
computer(s). Often it is simply mail.targetcomputer.com (substitute the name of
the online provider for targetcomputer). If this doesn't work, you can find out
the name of their email server with the DOS nslookup program, which only runs
from cmd.exe. Here's an example:
C:\ >nslookup
Default Server: DNS1.wurld.net
Address: 206.61.52.11
> set q=mx
> dimensional.com
Server: DNS1.wurld.net
Address: 206.61.52.11
dimensional.com MX preference = 5, mail exchanger =
mail.dimensional.com
dimensional.com MX preference = 10, mail exchanger =
mx2.dimensional.com
dimensional.com MX preference = 20, mail exchanger =
mx3.dimensional.com
dimensional.com nameserver = ns.dimensional.com
dimensional.com nameserver = ns-1.dimensional.com
dimensional.com nameserver = ns-2.dimensional.com
dimensional.com nameserver = ns-3.dimensional.com
dimensional.com nameserver = ns-4.dimensional.com
mail.dimensional.com internet address = 206.124.0.11
mx2.dimensional.com internet address = 206.124.0.30
mx3.dimensional.com internet address = 209.98.32.54
ns.dimensional.com internet address = 206.124.0.10
ns.dimensional.com internet address = 206.124.26.254
ns.dimensional.com internet address = 206.124.0.254
ns.dimensional.com internet address = 206.124.1.254
ns.dimensional.com internet address = 209.98.32.54
ns.dimensional.com internet address = 206.124.0.32
ns.dimensional.com internet address = 206.124.0.30
ns.dimensional.com internet address = 206.124.0.25
ns.dimensional.com internet address = 206.124.0.15
ns.dimensional.com internet address = 206.124.0.21
ns.dimensional.com internet address = 206.124.0.9
ns-1.dimensional.com internet address = 206.124.26.254
ns-2.dimensional.com internet address = 209.98.32.54
ns-3.dimensional.com internet address = 206.124.1.254
ns-4.dimensional.com internet address = 206.124.0.254
>
The lines that tell you what computers will let you forge email to people with
@dimensional.com addresses are:
dimensional.com MX preference = 5, mail exchanger =
mail.dimensional.com
dimensional.com MX preference = 10, mail exchanger =
mx2.dimensional.com
dimensional.com MX preference = 20, mail exchanger =
mx3.dimensional.com
MX stands for mail exchange. The lower the preference number, the more they
would like you to use that address for email.If that lowest number server is too
busy, then try another server.
Sometimes when you ask about a mail server, nslookup will give you this kind of
error message:
DNS request timed out.
timeout was 2 seconds.
DNS request timed out.
timeout was 2 seconds.
*** Request to [207.217.120.202] timed-out
To get around this problem, you need to find out what are the domain servers for
your target online provider. A good place to start looking is
http://netsol.com/cgi-bin/whois/whois . If this doesn't work, see
http://happyhacker.org/HHA/fightback.shtml for how to find the domain servers
for any Internet address.
****************
Newbie note: A domain name server provides information on the names and numbers
assigned to computers on the Internet. For example, dns1.wurld.net and
dns2.wurld.net contain information on happyhacker.org, techbroker.com,
securitynewsportal.com, thirdpig.com and sage-inc.com. When you query
dns1.wurld.net about other computers, it might have to go hunting for that
information from other name servers. That's why you might get a timed out
failure.
***************
Once you know the domain servers for an online service, set one of them for the
server for your nslookup program. Here's how you do it:
C:\ >nslookup
Default Server: DNS1.wurld.net
Address: 206.61.52.11
Now give the command:
> server 207.217.126.41
Default Server: ns1.earthlink.net
Address: 207.217.126.41
Next command should be:
> set q=mx
> earthlink.net
Server: ns1.earthlink.net
Address: 207.217.126.41
earthlink.net MX preference = 5, mail exchanger = mx04.earthlink.net
earthlink.net MX preference = 5, mail exchanger = mx05.earthlink.net
earthlink.net MX preference = 5, mail exchanger = mx06.earthlink.net
earthlink.net MX preference = 5, mail exchanger = mx00.earthlink.net
earthlink.net MX preference = 5, mail exchanger = mx01.earthlink.net
earthlink.net MX preference = 5, mail exchanger = mx02.earthlink.net
earthlink.net MX preference = 5, mail exchanger = mx03.earthlink.net
earthlink.net nameserver = ns3.earthlink.net
earthlink.net nameserver = ns1.earthlink.net
earthlink.net nameserver = ns2.earthlink.net
mx00.earthlink.net internet address = 207.217.120.28
mx01.earthlink.net internet address = 207.217.120.29
mx02.earthlink.net internet address = 207.217.120.79
mx03.earthlink.net internet address = 207.217.120.78
mx04.earthlink.net internet address = 207.217.120.249
mx05.earthlink.net internet address = 207.217.120.31
mx06.earthlink.net internet address = 207.217.120.23
ns1.earthlink.net internet address = 207.217.126.41
ns2.earthlink.net internet address = 207.217.77.42
ns3.earthlink.net internet address = 207.217.120.43
>
Your own online service will usually not mind and may even be glad if you use
telnet to read your email. Sometimes a malicious person or faulty email program
will send you a message that is so screwed up that your email program can't
download it. With telnet you can manually delete the bad email. Otherwise tech
support has to do it for you.
If you think about it, this ability to forge email is a huge temptation to
spammers. How can your online provider keep the bad guys from filling up a
victim's email box with garbage? The first time a bad guy tries this, probably
nothing will stop him or her. The second time the online provider might block
the bad guy at the firewall, maybe call the bad guy's online provider and kick
him or her and maybe get the bad guy busted or sued.
**************
You can go to jail warning: Sending hundreds or thousands of junk emails to bomb
someone's email account is a felony in the US.
***************
***************
You can get sued warning: Spamming, where you send only one email to each
person, but send thousands or millions of emails, is borderline legal. However,
spammers have been successfully sued when they forge the email addresses of
innocent people as senders of their spam.
****************
Now that you know how to read and write email with telnet, you definitely have
something you can use to show off with. Happy hacking!
Oh, here's one last goodie for advanced users. Get netcat for Windows. It's a
free program written by Weld Pond and Hobbit, and available from many sites, for
example
http://www.atstake.com/research/tools/#network_utilities . It is basically
telnet on steroids. For example, using netcat, you can set up a port on your
Windows computer to allow people to telnet into a DOS shell by using this
command:
C:\>nc -L -p 5000 -t -e cmd.exe
You can specify a different port number than 5000. Just make sure it doesn't
conflict with another port by checking with the netstat command. Then you and
your friends, enemies and random losers can either telnet in or netcat in with
the command:
C:\>nc -v [ipaddress of target] [port]
Of course you will probably get hacked for setting up this port. However, if you
set up a sniffer to keep track of the action, you can turn this scary back door
into a fascinating honeypot. For example, you could run it on port 23 and watch
all the hackers who attack with telnet hoping to log in. With some programming
you could even fake a unix-like login sequence and play some tricks on your
attackers.
Sunday, March 16, 2008
XPHacking With Windows XP
Subscribe to:
Post Comments (Atom)
Google Page Rank Explained
Grand Valley State University Imagine a library containing 25 billion documents but with no centralized organization and no librarians. In addition, anyone may add a document at any time without telling anyone. You may feel sure that one of the documents contained in the collection has a piece of information that is vitally important to you, and, being impatient like most of us, you'd like to find it in a matter of seconds. How would you go about doing it? Posed in this way, the problem seems impossible. Yet this description is not too different from the World Wide Web, a huge, highly-disorganized collection of documents in many different formats. Of course, we're all familiar with search engines (perhaps you found this article using one) so we know that there is a solution. This article will describe Google's PageRank algorithm and how it returns pages from the web's collection of 25 billion documents that match search criteria so well that "google" has become a widely used verb. Most search engines, including Google, continually run an army of computer programs that retrieve pages from the web, index the words in each document, and store this information in an efficient format. Each time a user asks for a web search using a search phrase, such as "search engine," the search engine determines all the pages on the web that contains the words in the search phrase. (Perhaps additional information such as the distance between the words "search" and "engine" will be noted as well.) Here is the problem: Google now claims to index 25 billion pages. Roughly 95% of the text in web pages is composed from a mere 10,000 words. This means that, for most searches, there will be a huge number of pages containing the words in the search phrase. What is needed is a means of ranking the importance of the pages that fit the search criteria so that the pages can be sorted with the most important pages at the top of the list. One way to determine the importance of pages is to use a human-generated ranking. For instance, you may have seen pages that consist mainly of a large number of links to other resources in a particular area of interest. Assuming the person maintaining this page is reliable, the pages referenced are likely to be useful. Of course, the list may quickly fall out of date, and the person maintaining the list may miss some important pages, either unintentionally or as a result of an unstated bias. Google's PageRank algorithm assesses the importance of web pages without human evaluation of the content. In fact, Google feels that the value of its service is largely in its ability to provide unbiased results to search queries; Google claims, "the heart of our software is PageRank." As we'll see, the trick is to ask the web itself to rank the importance of pages.
This shows that page 8 wins the popularity contest. Here is the same figure with the web pages shaded in such a way that the pages with higher PageRanks are lighter.
We will illustrate with the example above.
It is natural to ask what these numbers mean. Of course, there can be no absolute measure of a page's importance, only relative measures for comparing the importance of two pages through statements such as "Page A is twice as important as Page B." For this reason, we may multiply all the importance rankings by some fixed quantity without affecting the information they tell us. In this way, we will always assume, for reasons to be explained shortly, that the sum of all the popularities is one.
Here is one way in which our algorithm could proceed:
In this case, the importance rating of both pages is zero, which tells us nothing about the relative importance of these pages. The problem is that P2 has no links. Consequently, it takes some of the importance from page P1 in each iterative step but does not pass it on to any other page. This has the effect of draining all the importance from the web. Pages with no links are called dangling nodes, and there are, of course, many of them in the real web we want to study. We'll see how to deal with them in a minute, but first let's consider a new way of thinking about the matrix H and stationary vector I.
In other words, page P2 has twice the importance of page P1, which may feel about right to you. The matrix S has the pleasant property that the entries are nonnegative and the sum of the entries in each column is one. In other words, it is stochastic. Stochastic matrices have several properties that will prove useful to us. For instance, stochastic matrices always have stationary vectors. For later purposes, we will note that S is obtained from H in a simple way. If A is the matrix whose entries are all zero except for the columns corresponding to dangling nodes, in which each entry is 1/n, then S = H + A.
In this case, the sequence of vectors I k fails to converge. Why is this? The second eigenvalue of the matrix S satisfies and so the argument we gave to justify the power method no longer holds. To guarantee that . This means that, for some m, Sm has all positive entries. In other words, if we are given two pages, it is possible to get from the first page to the second after following m links. Clearly, our most recent example does not satisfy this property. In a moment, we will see how to modify our matrix S to obtain a primitive, stochastic matrix, which therefore satisfies . This means that, for some m, Sm has all positive entries. In other words, if we are given two pages, it is possible to get from the first page to the second after following m links. Clearly, our most recent example does not satisfy this property. In a moment, we will see how to modify our matrix S to obtain a primitive, stochastic matrix, which therefore satisfies S is
Notice that the PageRanks assigned to the first four web pages are zero. However, this doesn't feel right: each of these pages has links coming to them from other pages. Clearly, somebody likes these pages! Generally speaking, we want the importance rankings of all pages to be positive. The problem with this example is that it contains a smaller web within it, shown in the blue box below. Links come into this box, but none go out. Just as in the example of the dangling node we discussed above, these pages form an "importance sink" that drains the importance out of the other four pages. This happens when the matrix S is reducible; that is, S can be written in block form as Indeed, if the matrix S is irreducible, we can guarantee that there is a stationary vector with all positive entries. A web is called strongly connected if, given any two pages, there is a way to follow links from the first page to the second. Clearly, our most recent example is not strongly connected. However, strongly connected webs provide irreducible matrices S. To summarize, the matrix S is stochastic, which implies that it has a stationary vector. However, we need S to also be (a) primitive so that To find a new matrix that is both primitive and irreducible, we will modify the way our random surfer moves through the web. As it stands now, the movement of our random surfer is determined by S: either he will follow one of the links on his current page or, if at a page with no links, randomly choose any other page to move to. To make our modification, we will first choose a parameter To find a new matrix that is both primitive and irreducible, we will modify the way our random surfer moves through the web. As it stands now, the movement of our random surfer is determined by S: either he will follow one of the links on his current page or, if at a page with no links, randomly choose any other page to move to. To make our modification, we will first choose a parameter between 0 and 1. Now suppose that our random surfer moves in a slightly different way. With probability , he is guided by S. With probability , he chooses the next page at random. If we denote by 1 the matrix whose entries are all one, we obtain the Google matrix: Notice now that G is stochastic as it is a combination of stochastic matrices. Furthermore, all the entries of G are positive, which implies that G is both primitive and irreducible. Therefore, G has a unique stationary vector I that may be found using the power method. The role of the parameter is an important one. Notice that if , then G = S. This means that we are working with the original hyperlink structure of the web. However, if , then . In other words, the web we are considering has a link between any two pages and we have lost the original hyperlink structure of the web. Clearly, we would like to take close to 1 so that we hyperlink structure of the web is weighted heavily into the computation. However, there is another consideration. Remember that the rate of convergence of the power method is governed by the magnitude of the second eigenvalue . For the Google matrix, it has been proven that the magnitude of the second eigenvalue . This means that when is close to 1 the convergence of the power method will be very slow. As a compromise between these two competing interests, Serbey Brin and Larry Page, the creators of PageRank, chose .
How to tell who's important
If you've ever created a web page, you've probably included links to other pages that contain valuable, reliable information. By doing so, you are affirming the importance of the pages you link to. Google's PageRank algorithm stages a monthly popularity contest among all pages on the web to decide which pages are most important. The fundamental idea put forth by PageRank's creators, Sergey Brin and Lawrence Page, is this: the importance of a page is judged by the number of pages linking to it as well as their importance. We will assign to each web page P a measure of its importance I(P), called the page's PageRank. At various sites, you may find an approximation of a page's PageRank. (For instance, the home page of The American Mathematical Society currently has a PageRank of 8 on a scale of 10. Can you find any pages with a PageRank of 10?) This reported value is only an approximation since Google declines to publish actual PageRanks in an effort to frustrate those would manipulate the rankings. Here's how the PageRank is determined. Suppose that page Pj has lj links. If one of those links is to page Pi, then Pj will pass on 1/lj of its importance to Pi. The importance ranking of Pi is then the sum of all the contributions made by pages linking to it. That is, if we denote the set of pages linking to Pi by Bi, then This may remind you of the chicken and the egg: to determine the importance of a page, we first need to know the importance of all the pages linking to it. However, we may recast the problem into one that is more mathematically familiar. Let's first create a matrix, called the hyperlink matrix, in which the entry in the ith row and jth column is Notice that H has some special properties. First, its entries are all nonnegative. Also, the sum of the entries in a column is one unless the page corresponding to that column has no links. Matrices in which all the entries are nonnegative and the sum of the entries in every column is one are called stochastic; they will play an important role in our story. We will also form a vector whose components are PageRanks--that is, the importance rankings--of all the pages. The condition above defining the PageRank may be expressed as In other words, the vector I is an eigenvector of the matrix H with eigenvalue 1. We also call this a stationary vector of H. Let's look at an example. Shown below is a representation of a small collection (eight) of web pages with links represented by arrows. The corresponding matrix iswith stationary vector |
Computing I
There are many ways to find the eigenvectors of a square matrix. However, we are in for a special challenge since the matrix H is a square matrix with one column for each web page indexed by Google. This means that H has about n = 25 billion columns and rows. However, most of the entries in H are zero; in fact, studies show that web pages have an average of about 10 links, meaning that, on average, all but 10 entries in every column are zero. We will choose a method known as the power method for finding the stationary vector I of the matrix H. How does the power method work? We begin by choosing a vector I 0 as a candidate for I and then producing a sequence of vectors I k by The method is founded on the following general principle that we will soon investigate. General principle: The sequence I k will converge to the stationary vector I. |
I 0 | I 1 | I 2 | I 3 | I 4 | ... | I 60 | I 61 |
1 | 0 | 0 | 0 | 0.0278 | ... | 0.06 | 0.06 |
0 | 0.5 | 0.25 | 0.1667 | 0.0833 | ... | 0.0675 | 0.0675 |
0 | 0.5 | 0 | 0 | 0 | ... | 0.03 | 0.03 |
0 | 0 | 0.5 | 0.25 | 0.1667 | ... | 0.0675 | 0.0675 |
0 | 0 | 0.25 | 0.1667 | 0.1111 | ... | 0.0975 | 0.0975 |
0 | 0 | 0 | 0.25 | 0.1806 | ... | 0.2025 | 0.2025 |
0 | 0 | 0 | 0.0833 | 0.0972 | ... | 0.18 | 0.18 |
0 | 0 | 0 | 0.0833 | 0.3333 | ... | 0.295 | 0.295 |
Three important questions
Three questions naturally come to mind:- Does the sequence I k always converge?
- Is the vector to which it converges independent of the initial vector I 0?
- Do the importance rankings contain the information that we want?
with matrix |
I 0 | I 1 | I 2 | I 3=I |
1 | 0 | 0 | 0 |
0 | 1 | 0 | 0 |
A probabilitistic interpretation of H
Imagine that we surf the web at random; that is, when we find ourselves on a web page, we randomly follow one of its links to another page after one second. For instance, if we are on page Pj with lj links, one of which takes us to page Pi, the probability that we next end up on Pi page is then . As we surf randomly, we will denote by the fraction of time that we spend on page Pj. Then the fraction of the time that we end up on page Pi page coming from Pj is . If we end up on Pi, we must have come from a page linking to it. This means that where the sum is over all the pages Pj linking to Pi. Notice that this is the same equation defining the PageRank rankings and so . This allows us to interpret a web page's PageRank as the fraction of time that a random surfer spends on that web page. This may make sense if you have ever surfed around for information about a topic you were unfamiliar with: if you follow links for a while, you find yourself coming back to some pages more often than others. Just as "All roads lead to Rome," these are typically more important pages. Notice that, given this interpretation, it is natural to require that the sum of the entries in the PageRank vector I be one. Of course, there is a complication in this description: If we surf randomly, at some point we will surely get stuck at a dangling node, a page with no links. To keep going, we will choose the next page at random; that is, we pretend that a dangling node has a link to every other page. This has the effect of modifying the hyperlink matrix H by replacing the column of zeroes corresponding to a dangling node with a column in which each entry is 1/n. We call this new matrix S. In our previous example, we now havewith matrix | and eigenvector |
How does the power method work?
In general, the power method is a technique for finding an eigenvector of a square matrix corresponding to the eigenvalue with the largest magnitude. In our case, we are looking for an eigenvector of S corresponding to the eigenvalue 1. Under the best of circumstances, to be described soon, the other eigenvalues of S will have a magnitude smaller than one; that is, S are S are and that |\lambda_2| \geq |\lambda_3| \geq \ldots \geq |\lambda_n| \] " src="http://www.ams.org/featurecolumn/images/december2006/index_15.gif" title="\[ 1 = \lambda_1 > |\lambda_2| \geq |\lambda_3| \geq \ldots \geq |\lambda_n| \] " align="absmiddle"> We will also assume that there is a basis vj of eigenvectors for S with corresponding eigenvalues . This assumption is not necessarily true, but with it we may more easily illustrate how the power method works. We may write our initial vector I 0 as Then Since the eigenvalues with have magnitude smaller than one, it follows that if and therefore , an eigenvector corresponding to the eigenvalue 1. It is important to note here that the rate at which is determined by . When is relatively close to 0, then relatively quickly. For instance, consider the matrix The eigenvalues of this matrix are and . In the figure below, we see the vectors I k, shown in red, converging to the stationary vector I shown in green. Now consider the matrix Here the eigenvalues are and . Notice how the vectors I k converge more slowly to the stationary vector I in this example in which the second eigenvalue has a larger magnitude.When things go wrong
In our discussion above, we assumed that the matrix S had the property that and S is S is Then we seeI 0 | I 1 | I 2 | I 3 | I 4 | I 5 |
1 | 0 | 0 | 0 | 0 | 1 |
0 | 1 | 0 | 0 | 0 | 0 |
0 | 0 | 1 | 0 | 0 | 0 |
0 | 0 | 0 | 1 | 0 | 0 |
0 | 0 | 0 | 0 | 1 | 0 |
with stationary vector |
Computing I
What we've described so far looks like a good theory, but remember that we need to apply it to matrices where n is about 25 billion! In fact, the power method is especially well-suited to this situation. Remember that the stochastic matrix S may be written as and therefore the Google matrix has the form Therefore, Now recall that most of the entries in H are zero; on average, only ten entries per column are nonzero. Therefore, evaluating HI k requires only ten nonzero terms for each entry in the resulting vector. Also, the rows of A are all identical as are the rows of 1. Therefore, evaluating AI k and 1I k amounts to adding the current importance rankings of the dangling nodes or of all web pages. This only needs to be done once. With the value of chosen to be near 0.85, Brin and Page report that 50 - 100 iterations are required to obtain a sufficiently good approximation to I. The calculation is reported to take a few days to complete. Of course, the web is continually changing. First, the content of web pages, especially for news organizations, may change frequently. In addition, the underlying hyperlink structure of the web changes as pages are added or removed and links are added or removed. It is rumored that Google recomputes the PageRank vector I roughly every month. Since the PageRank of pages can be observed to fluctuate considerably during this time, it is known to some as the Google Dance. (In 2002, Google held a Google Dance!)Summary
Brin and Page introduced Google in 1998, a time when the pace at which the web was growing began to outstrip the ability of current search engines to yield useable results. At that time, most search engines had been developed by businesses who were not interested in publishing the details of how their products worked. In developing Google, Brin and Page wanted to "push more development and understanding into the academic realm." That is, they hoped, first of all, to improve the design of search engines by moving it into a more open, academic environment. In addition, they felt that the usage statistics for their search engine would provide an interesting data set for research. It appears that the federal government, which recently tried to gain some of Google's statistics, feels the same way. There are other algorithms that use the hyperlink structure of the web to rank the importance of web pages. One notable example is the HITS algorithm, produced by Jon Kleinberg, which forms the basis of the Teoma search engine. In fact, it is interesting to compare the results of searches sent to different search engines as a way to understand why some complain of a Googleopoly.Better Search engine
Search Engine name http://www.gahooyoogle.com/ *** http://www.jux2.com/ ***** http://www.dogpile.com/ ***** http://www.sputtr.com/ ***** http://www.spacetime.com/ http://mindset.research.yahoo.com/ http://www.mooter.com/ http://qunu.com/ http://www.webcrawler.com/info.wbcrwl/ http://www.scirus.com/srsapp/ http://www.randomwebsearch.com/ http://www.soople.com/soople_int.php | URL | Category |
Accoona | www.accoona.com | A.I. Search (HM) |
AfterVote (SEM) | www.aftervote.com | Social Search |
Agent 55 | www.agent55.com | MetaSearch |
AllTha.at | www.allth.at | Continuous Search |
AnswerBus | www.answerbus.com | Semantic Search |
Blabline | www.blabline.com | Podcast Search |
Blinkx* | www.blinkx.com | Video Search |
Blogdigger | www.blogdigger.com | Blog Search |
Bookmach.com* | www.bookmach.com | Bookmark Search |
ChaCha* (#1 2006) | www.chacha.com | Guided Search |
ClipBlast!* | www.clipblast.com | Video Search |
Clusty* | www.clusty.com | Clustering Search |
CogHog | www.infactsolutions.com/projects/coghog/demo.htm | Semantic Search |
Collarity* | www.collarity.com | Social Search (HM) |
Congoo* | www.congoo.com | Premium Content Search |
CrossEngine (Mr. Sapo)* | www.crossengine.com | MetaSearch |
Cydral | http://en.cydral.com | Image Search (French) |
Decipho* | www.decipho.com | Filtered Search |
Deepy | www.deepy.com | RIA Search |
Ditto* | www.ditto.com | Visual Search |
Dogpile | www.dogpile.com | MetaSearch |
Exalead* | www.exalead.com/search | Visual Search |
Factbites* | www.factbites.com | Filtered Search |
FeedMiner | www.feedminer.com | RSS Feeds Search |
Feedster | www.feedster.com | RSS Feeds Search |
Filangy | www.filangy.com | Social Search |
Find Forward | www.findforward.com | Meta Feature Search |
FindSounds* | www.findsounds.com | Audio Search |
Fisssh! | www.fisssh.com | Filtered Search (HM) |
FyberSearch | www.fybersearch.com | Meta Feature Search |
Gigablast* | www.gigablast.com | Blog Search |
Girafa* | www.girafa.com | Visual Display |
Gnosh | www.gnosh.org | Meta Search |
GoLexa | www.golexa.com | Meta Feature Search |
GoshMe* (SEM) | www.goshme.com | Meta Meta Search |
GoYams* | www.goyams.com | Meta Search |
Grokker* | www.grokker.com | Meta Search |
Gruuve | www.gruuve.com | Recommendation Search |
Hakia | www.hakia.com | Meaning Based Search |
Hyper Search | http://hypersearch.webhop.org.90.seekdotnet.com | Filtered Search |
iBoogie | www.iboogie.com | Clustering Search |
IceRocket* | www.icerocket.com | Blog Search |
Info.com | www.info.com | MetaSearch |
Ixquick* | www.ixquick.com | Meta Search |
KartOO* | www.kartoo.com | Clustering Search |
KoolTorch (SEM) | www.kooltorch.com | Clustering Search |
Lexxe* | www.lexxe.com | Natural Language Processing (NLP) |
Lijit | www.lijit.com | Search People |
Like* | www.like.com | Visual Search |
LivePlasma* | www.liveplasma.com | Recommendation Search (HM) |
Local.com* | www.local.com | Local Search |
Mamma | www.mamma.com | MetaSearch |
Mnemomap | www.mnemo.org | Clustering Search |
Mojeek* | www.mojeek.com | Custom Search Engines (CSE) |
Mooter* | www.mooter.com | Clustering Search |
Mp3Realm | http://mp3realm.org | MP3 Search |
Mrquery | www.mrquery.com | Clustering Search |
Ms. Dewey* | www.msdewey.com | Unique Interface (HM) |
Nutshell | www.gonutshell.com | MetaSearch |
Omgili | www.omgili.com | Social Search |
Pagebull* | www.pagebull.com | Visual Display |
PeekYou | www.peekyou.com | People Search |
Pipl | http://pipl.com | People Search |
PlanetSearch* | www.planetsearch.com | MetaSearch |
PodZinger | www.podzinger.com | Podcast Search |
PolyMeta | www.polymeta.com | MetaSearch |
Prase | www.prase.us | MetaSearch |
PureVideo | www.purevideo.com | Video Search (HM) |
Qksearch | www.qksearch.com | Clustering Search |
Querycat | http://querycat.com | F.A.Q. Search (HM) |
Quintura* | www.quintura.com | Clustering Search |
RedZee | www.redzee.com | Visual Display |
Retrievr | http://labs.systemone.at/retrievr/ | Visual Search |
Searchbots | www.searchbots.net | Continuous Search |
SearchKindly | www.searchkindly.org | Charity Search |
Searchles* (DumbFind) | www.searchles.com | Social Search |
SearchTheWeb2* | www.searchtheweb2.com | Long Tail Search |
SeeIt | www.seeit.com | Image Search |
Sidekiq* | www.sidekiq.com | MetaSearch |
Slideshow* | http://slideshow.zmpgroup.com/ | Visual Display |
Slifter* | www.slifter.com | Mobile Shopping Search (HM) |
Sphere | www.sphere.com | Blog Search |
Sproose | www.sproose.com | Social Search |
Srchr* | www.srchr.com | MetaSearch |
SurfWax* | www.surfwax.com | Meaning Based Search |
Swamii | www.swamii.com | Continuous Search (HM) |
TheFind.com* | www.thefind.com | Shopping Search |
Trexy* | www.trexy.com | Search Trails |
Turboscout* | www.turboscout.com | MetaSearch |
Twerq | www.twerq.com | Tabbed Results |
Url.com* | www.url.com | Social Search |
WasaLive! | http://en.wasalive.com | RSS Search |
Web 2.0* | www.web20searchengine.com | Web 2.0 Search |
Webbrain* | www.webbrain.com | Clustering Search |
Whonu?* | www.whonu.com | MetaSearch |
Wikio* | www.wikio.com | Web 2.0 Search |
WiseNut* | www.wisenut.com | Clustering Search |
Yoono* | www.yoono.com | Social Search |
ZabaSearch* | www.zabasearch.com | People Search |
Zuula* | www.zuula.com | Tabbed Search (HM) |
No comments:
Post a Comment