Google Hacking - Is your web application secure?

By Paladion

July 15, 2005

Google hacking is a term that refers to applying advanced searching techniques to access unauthorized information through any search engine. In this article we look at some of vulnerabilties which are exploited by these techniques and how to safeguard applications from being compromised.

Google Hacking

Web application owners want more and more people to visit their websites everyday. Today's search engines make their job all the more easier. There are also Google search engine optimization techniques being employed by website owners to increase the rankings of their site by the search engine. This enables them to increase traffic and get more visitors to their websites. Well that's just the positive side of using Google.

Ironically, hackers seem to get more of your websites than the legitimate visitors using search engines. For quite sometime, there has been a lot of buzz about "Google hacking". Google hacking is a term that refers to applying advanced searching techniques to access unauthorized information through any search engine. The term is named after Google as it is the most popular and highly advanced search engine available on the Internet.

Some of the famous instances of Google hacking show how easy it is to hack into websites and compromise the confidentiality of data. The first instance of Google hacking being used against websites came to light in mid 2000. Later, an article published in the online magazine The Register quoted that "while searching Google for a vulnerability in Cisco IOS Web Server, Ryan Russell, a SecurityFocus researcher, followed a link and found himself in a switch belonging to a US .gov site." One of the articles from ITWeb also quoted that "Barry Cribb, MD of IS Digital Networks, says entering a certain string into the Google search window will get a list of about 38,000 sites with admin login pages." In late 2004, a worm known as "Santy" started spreading by using Google to search for online bulletin boards running a vulnerable version of the community forum software PHP Bulletin Board (phpBB)

How Google hacking works

The various features such as domain, page and file-format specific searches provided by the technologically advanced search engines present many nefarious possibilities to the most malicious Internet users such as hackers, computer criminals, identity thieves and even terrorists to access confidential information. This is basically achieved by using special characters and advanced operators which the search engines provide to give the users better searching experience and more accurate results. Special characters such as '+', '-', 'Boolean OR', and operators such as cache, filetype, link, site, intitle, inurl are the main ones used in Google hacking. Some examples of how these operators can be used for Google hacking are:

  1. site:gov secret - this query searches every website in the .gov domain for the word 'secret'
  2. intitle:index.of "parent directory" - this query will return all web pages where directory listing is enabled and contains the keywords parent directory.
  3. filetype:doc "for internal use only" - this query searches for word documents with the keywords for internal use only.

Other forms of Google hacking are directory listing, vulnerability scanning, fingerprinting and automated scanning. While fingerprinting is an issue pertaining to the system administrators, directory listing, and vulnerability scanning is more concerned with web application owners.

Vulnerabilities exploited by Google hackers

Sensitive pages residing on the websites

Most application owners do not realize that nothing is secure as far as public websites are concerned. Many a times, sensitive pages are put up on the web servers on the belief that there are no links pointing to these pages from any other page and are henceforth only known the administrators. But the truth is that the moment these pages are loaded on to the server, they are available for public access. By using specifically crafted search terms on a search engine, it is quite easy to get hold of such sensitive pages which might lead to a catastrophic damage to the company owning the website. Sometimes such sensitive pages are put up on the public websites for a temporary purpose but unfortunately by the time they are removed from the sites, Google has already indexed and cached those pages. Now the attackers can access those sensitive pages without even accessing the website. The query that can used for such purpose is confidential or "for internal use only". This will retrieve the page from search engines cache without connecting to the original site.

Vulnerable files present on the server

There have also been cases where the application owners have put up certain confidential files on the web server believing that only web pages are accessible by the Internet users. But unfortunately this is just a myth. Today any system that is accessible through the Internet and not completely secured is open for attacks and any information present in the system can be compromised irrespective of whether it is a HTML page or any other file type.

The search technique that allows searching for specific filetypes when used along with specially crafted search terms, aids in getting access to such vulnerable files on the web server. The common search string used by Google hackers to identify such files is ext:doc confidential (retrieve all word documents containing the word confidential) or filetype:pdf "for internal use only".

Another web server feature that could be used by Google hackers for file searching is directory listing. It is mostly used by application owners to allow users to view and download files from the directory tree. Sometimes directory listings may not be purposefully created. A misconfigured web server creates a directory listing in cases where the index or the main web page is missing. In some other cases, directory listings are created for temporary storage of files. In any case, directory listings are one of the favorite targets of Google hackers. An obvious search query to identify these types of directory listings might be intitle:index.of, which will return pages having 'index.of' in the title of the document. Using other keywords such as 'parent directory', 'name' and 'size' along with this query will provide much more accurate results to the Google hackers. Directory listings also act as a means to exploit other techniques such as fingerprinting and versioning.

Malicious use of website's own search engine

As mentioned earlier, Google hacking does not stop at Google. Any search engine that supports advanced features can be used as an attack tool. This may well include the search feature provided by many websites for searching through their own site. Sometimes these search engines themselves expose sensitive pages available in the server when used maliciously. One of the main reasons for this is that the web application owners ignore these search engines while testing the web application believing that the search feature by itself does not pose a threat to the application. Google hackers use such ignorance to launch attacks on the website. Hence, it is also very important to understand how the search engine of the site works and test it for vulnerabilities. The Web Application Security Consortium recently published an article on Attacks against local search engines.

Prevention strategies against Google hacking

Awareness about sensitive pages

The first strategy towards prevention against Google hacking is for the application owners to understand the implications of unauthorized access to sensitive pages on the web server. It is very important to analyze the contents of each page and ensure that there is no sensitive information present in those pages that could cause harm to the confidentiality of the company, before uploading them on to the web server.

Use of META tags

There are several mechanisms to avoid crawling and indexing of specific parts of the website by the search engines. The first method of prevention is to use META tags within the pages. The application developers can use these META tags on the pages that contain sensitive information to avoid indexing or caching of these pages by the search engine robots. Even though, all robots do not follow these META tag standards, most of the popular search engines are complaint with them. A sample META tag to avoid caching of a page looks like this: <meta name="GOOGLEBOT" content="NOARCHIVE" /> - where 'GOOGLEBOT' is the search engine robot used by Google. The names of the robots used by different search engines can be found here.

Access control using robots.txt

Another method is to use a 'robots.txt' file to block the robots from scanning the site. This method is more useful to mitigate the risks associated with directory listing. The first step in this mitigation process is to disable the directory listing feature on the web server thereby avoiding the chances of directory listing getting created on its own. The next step in the mitigation process is to create the 'robots.txt' file in the root directory of the site. This method is like using some kind of an access control on the directories of the website. All search engines, which work on robot principle, will not scan and crawl those parts of the site that are listed in the 'robots.txt' file. It is important to include all the directories that are to be excluded from web crawling in the 'robots.txt' file and ensure that the site security protects the files. In the 'robots.txt' file, the first line defines the user-agent (name of the robot) and the next line contains the Disallow field, which specifies the directories and files that are not to be scanned by the search engine's robots. A sample 'robots.txt' file looks like:

User-agent: * (wildcard referring to all search robots)
Disallow: /norobots/

It is very much essential to follow secure coding practices and implement security code reviews in this approach. Browsing through the OWASP Guide would provide a better understanding of secure coding practices. Appropriate care must be taken to ensure the security of the applications and file systems.

Removal of already-indexed pages from Google

When addressing the issue of already indexed and cached pages, Google offers several options. Once you enable one of the above methods to prevent Google from indexing, any pages that are already in cache will be removed the next time Google crawls your site. There is also an option to request for immediate removal of the content from Google's index. This can be achieved by sending a request to Google after registering through a Google account with the Google's automatic URL removal system after creating either the META tags or the 'robots.txt' file in the web server. A complete explanation of all the removal options is available here.

Google hack- Try it yourself

The best known technique for protecting the websites against Google hacking is for the web application owners to try and hack their own sites. There are also tools and applications available on the Internet to automate the process of Google hacking. Many of these tools are written using Google APIs. It is necessary to possess a Google account to obtain a license key for using the APIs. Some of the most popularly known tools are Foundstone Sitedigger, Apollo 2.0, Athena, and Wikto. Most of these tools use the Google Hacking Database (GHDB) to perform their automated searches.

The Google Hacking Database (GHDB) is a complete collection of all known Google hacks contributed by the Google hacking community to the public. GHDB is maintained by Johnny Long, a security researcher at CSC, and one of the most well known experts in the field of Google hacking. GHDB is one of the best resources available on the Internet for search engine hacking.

Even though Google hacking may originate from search engines, it is always better to understand its implications on web applications and protect them from getting hacked, than expecting the search engines to protect against such Google hackers.

Tags: Technical