<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>insanesecurity &#187; isf</title>
	<atom:link href="http://insanesecurity.info/blog/tag/isf/feed" rel="self" type="application/rss+xml" />
	<link>http://insanesecurity.info/blog</link>
	<description>security through a distorted eye</description>
	<lastBuildDate>Thu, 25 Feb 2010 22:31:12 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>WGet all the way</title>
		<link>http://insanesecurity.info/blog/wget-all-the-way</link>
		<comments>http://insanesecurity.info/blog/wget-all-the-way#comments</comments>
		<pubDate>Tue, 07 Jul 2009 17:04:06 +0000</pubDate>
		<dc:creator>dblackshell</dc:creator>
				<category><![CDATA[How To]]></category>
		<category><![CDATA[Toolbox]]></category>
		<category><![CDATA[isf]]></category>

		<guid isPermaLink="false">http://insanesecurity.info/blog/?p=121</guid>
		<description><![CDATA[There are a couple of security auditing frameworks out there, and the temptation is high on creating your own; either in Perl, Ruby, Python and why not in PHP as well. Needles to say, I too was tempted in creating my own framework. Ideas kept flowing in, the project has been started and then BAM, [...]]]></description>
			<content:encoded><![CDATA[<p>There are a couple of security auditing frameworks out there, and the temptation is high on creating your own; either in Perl, Ruby, Python and why not in PHP as well.</p>
<p>Needles to say, I too was tempted in creating my own framework. Ideas kept flowing in, the project has been started and then BAM, I&#8217;ve read an interesting article on <a href="http://www.gnucitizen.org/blog/you-dont-need-the-ultimate-pen-testing-framework/">GNUCITIZEN</a>, which made me rethink my strategy&#8230;</p>
<p>One of the comments pointed it out very well:</p>
<blockquote><p>most of the stuff we need is on the shell already. pentesting frameworks is like the new security-testing hype. first we had hundreds of portscanners, then hundreds of webapp MiTM proxies, then hundreds of fuzzers, then hundreds of SQL injectors, now it’s about pentesting frameworks :)</p></blockquote>
<p>So instead of starting to write redundant code, I started to learn already available command line tools, which have years of development behind and fill in almost every aspect they need to.</p>
<p>Basically I&#8217;m building my framework around already available tools, and only code up things that do not exist, or for some very particular cases.<br />
<span id="more-121"></span></p>
<h2>So why WGet?</h2>
<p>Well I had to start with something my series of articles (it&#8217;s gonna be a series), and <code>wget</code> seemed to be a good starting point.</p>
<p>If you&#8217;ve never dealt with <code>wget</code> (which I sincerely doubt), the following description best describes it:</p>
<blockquote><p>GNU Wget is a free software package for retrieving files using HTTP, HTTPS and FTP, the most widely-used Internet protocols. It is a non-interactive commandline tool, so it may easily be called from scripts, cron jobs, terminals without X-Windows support, etc</p></blockquote>
<p>Without further useless rambling let&#8217;s see in which scenarios you would use wget; apart from downloading <code>psyBNC</code> archives, like seen on many h4x00r websites.<br />
<script type="text/javascript"><!--
google_ad_client = "pub-4879499347590889";
/* 468x60, created 1/22/09 */
google_ad_slot = "0361207255";
google_ad_width = 468;
google_ad_height = 60;
// --></script><br />
<script src="http://pagead2.googlesyndication.com/pagead/show_ads.js" type="text/javascript"></script></p>
<h2>Website crawling</h2>
<p>There are a couple of tools that facilitate website crawling, I even mentioned one in my <a href="http://insanesecurity.info/blog/intercepting-proxies">Intercepting Proxies?</a> article with the difference that liveHTTPHeaders may be used for passive crawling of websites&#8230;</p>
<p>So how would we go on crawling a website with <code>wget</code>?</p>
<pre>wget -r -nd --spider -o links.txt http://insanesecurity.info</pre>
<p>Where:</p>
<ul>
<li>r &#8211; recursive crawling</li>
<li>nd &#8211; don&#8217;t create directories</li>
<li>spider &#8211; do not save pages, discard after collecting links from them</li>
<li>o &#8211; save output to file links.txt</li>
</ul>
<p>But what if we would want to restrict the crawling only under a directory, and filter out CSS, images and Javascript files?</p>
<pre>wget -r -nd --spider -o links.txt -np -R js,css,jpg,png,gif http://insanesecurity.info/blog/</pre>
<p>Where:</p>
<ul>
<li>np &#8211; do not go to parent directory</li>
<li>R &#8211; one or more extensions to reject (comma separated)</li>
</ul>
<p>After the command finishes the content of <code>links.txt</code> would look like this (assuming you&#8217;ve run the first command):</p>
<pre>Spider mode enabled. Check if remote file exists.
--2009-07-07 16:46:58--  http://insanesecurity.info/
Resolving insanesecurity.info... 93.115.201.3
Connecting to insanesecurity.info|93.115.201.3|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://insanesecurity.info/blog/ [following]
Spider mode enabled. Check if remote file exists.
--2009-07-07 16:47:01--  http://insanesecurity.info/blog/
Connecting to insanesecurity.info|93.115.201.3|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Remote file exists and could contain links to other resources -- retrieving.

--2009-07-07 16:47:01--  http://insanesecurity.info/blog/
Connecting to insanesecurity.info|93.115.201.3|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `index.html'</pre>
<p>From this point, the retrieval of links is just a mater of using <code>grep</code>, <code>cut</code>, <code>sort</code> and <code>uniq</code>.</p>
<pre>cat links.txt | grep -P "\-\-\d{4}" | cut -d " " -f 4 | sort | uniq</pre>
<p>And the output would be like:</p>
<pre>http://insanesecurity.info/

http://insanesecurity.info/2009/01/hacking-yahoogmailhotmail-accounts-a-z-guide/

http://insanesecurity.info/2009/01/javascript-userscript-keylogger/

http://insanesecurity.info/2009/01/logging-the-http-requests/

http://insanesecurity.info/2009/01/password-insecurity-wordlists-dictionaries/

http://insanesecurity.info/2009/01/the-future-of-av-or-not/

http://insanesecurity.info/2009/01/the-hackers-underground-handbook-review/

http://insanesecurity.info/2009/01/useratuh-frontend-to-backend-encryption/</pre>
<h2>Copying websites</h2>
<p>Or website mirroring as people use to call it.</p>
<p>There are a couple of reasons why you would do this;</p>
<ul>
<li>To have a copy which you can transport on a CD/DVD/Memory card/etc, having the possibility to convert the links to point to local files.</li>
<li>Content to feed email scrappers</li>
</ul>
<p>For our first scenario, you would run <code>wget</code> in the following manner:</p>
<pre>wget -m -k -p -np http://www.freebsd.org/doc/en_US.ISO8859-1/books/handbook/</pre>
<p>Where:</p>
<ul>
<li>m &#8211; mirror website</li>
<li>k &#8211; convert html links to local files</li>
<li>p &#8211; get page dependencies (css, images)</li>
<li>np &#8211; do not go to parent folders</li>
</ul>
<p>And as far as using it for scrapping purpose, we use the most simplistic commands:</p>
<pre>
wget http://tinyurl.com/nqa48q
cat downloaded-file | grep -o -P "\w+\[at\]\w+" > emails.txt
</pre>
<p>And this way I have gathered a list of 40 email addresses&#8230; a sample of them:</p>
<pre>
ureachnirav[at]yahoo
reachjag[at]yahoo
g2[at]g2designindia
g2design[at]rediffmail
archsumitjoshi[at]yahoo
joshi[at]hexagon
rohankarswapnil[at]yahoo
sandy004[at]yahoo
idc[at]iitb
visakan[at]gmail
</pre>
<p>Of course that for replacing the [at] in them, we can simply use <code>sed</code>:</p>
<pre>
cat emails.txt | sed -r "s/\[at\]/@/g" > normal-emails.txt
</pre>
<p><script type="text/javascript"><!--
google_ad_client = "pub-4879499347590889";
/* 468x60, created 1/22/09 */
google_ad_slot = "0361207255";
google_ad_width = 468;
google_ad_height = 60;
// --></script><br />
<script src="http://pagead2.googlesyndication.com/pagead/show_ads.js"  type="text/javascript"></script></p>
<h2>Blog spam</h2>
<p>This is also possible (and simple) to achieve with <code>wget</code> and minor interference from <code>grep</code> and <code>sed</code>, but for this one I will not post an example&#8230; There are already hundreds of spammers out there, so why add more to the list? If interest exists in <code>wget</code> from your part you will find out how&#8230;</p>
<p>Of course for this you will need to script the behavior (bash, bat, perl, python, etc), or create a lengthy command line.</p>
<h2>FTP Copy</h2>
<p>As mentioned at the beginning of article <code>wget</code> can very well work with the ftp protocol as well.</p>
<pre>
wget -r --ftp-user=anonymous --ftp-password=some@email.com ftp://ftp.ro.freebsd.org/pub/FreeBSD/
</pre>
<p>Here I think is no need to explain the command line arguments, they are pretty obvious.</p>
<h2>Anonymous mode</h2>
<p>When I&#8217;m referring to anonymous mode, I&#8217;m referring to channel <code>wget</code> requests through proxy servers. First you need to configure your <code>.wgetrc</code> file. If you haven&#8217;t got one, you may as well create it now.</p>
<pre>
#############################
###
### Sample Wget initialization file .wgetrc
###

## You can use this file to change the default behaviour of wget or to
## avoid having to type many many command-line options. This file does
## not contain a comprehensive list of commands -- look at the manual
## to find out what you can put into this file.
##
## Wget initialization file can reside in /usr/local/etc/wgetrc
## (global, for all users) or $HOME/.wgetrc (for a single user).
##
## To use the settings in this file, you will have to uncomment them,
## as well as change them, in most cases, as the values on the
## commented-out lines are the default values (e.g. "off").

##
## Global settings (useful for setting up in /usr/local/etc/wgetrc).
## Think well before you change them, since they may reduce wget's
## functionality, and make it behave contrary to the documentation:
##

# You can set retrieve quota for beginners by specifying a value
# optionally followed by 'K' (kilobytes) or 'M' (megabytes).  The
# default quota is unlimited.
#quota = inf

# You can lower (or raise) the default number of retries when
# downloading a file (default is 20).
#tries = 20

# Lowering the maximum depth of the recursive retrieval is handy to
# prevent newbies from going too "deep" when they unwittingly start
# the recursive retrieval.  The default is 5.
#reclevel = 5

# Many sites are behind firewalls that do not allow initiation of
# connections from the outside.  On these sites you have to use the
# `passive' feature of FTP.  If you are behind such a firewall, you
# can turn this on to make Wget use passive FTP by default.
#passive_ftp = off

# The "wait" command below makes Wget wait between every connection.
# If, instead, you want Wget to wait only between retries of failed
# downloads, set waitretry to maximum number of seconds to wait (Wget
# will use "linear backoff", waiting 1 second after the first failure
# on a file, 2 seconds after the second failure, etc. up to this max).
waitretry = 10

##
## Local settings (for a user to set in his $HOME/.wgetrc).  It is
## *highly* undesirable to put these settings in the global file, since
## they are potentially dangerous to "normal" users.
##
## Even when setting up your own ~/.wgetrc, you should know what you
## are doing before doing so.
##

# Set this to on to use timestamping by default:
#timestamping = off

# It is a good idea to make Wget send your email address in a `From:'
# header with your request (so that server administrators can contact
# you in case of errors).  Wget does *not* send `From:' by default.
#header = From: Your Name 

# You can set up other headers, like Accept-Language.  Accept-Language
# is *not* sent by default.
#header = Accept-Language: en

# You can set the default proxies for Wget to use for http and ftp.
# They will override the value in the environment.
http_proxy = http://1.2.3.4:8080/
#ftp_proxy = http://proxy.yoyodyne.com:18023/

# If you do not want to use proxy at all, set this to off.
use_proxy = on

# You can customize the retrieval outlook.  Valid options are default,
# binary, mega and micro.
#dot_style = default

# Setting this to off makes Wget not download /robots.txt.  Be sure to
# know *exactly* what /robots.txt is and how it is used before changing
# the default!
#robots = on

# It can be useful to make Wget wait between connections.  Set this to
# the number of seconds you want Wget to wait.
#wait = 0

# You can force creating directory structure, even if a single is being
# retrieved, by setting this to on.
#dirstruct = off

# You can turn on recursive retrieving by default (don't do this if
# you are not sure you know what it means) by setting this to on.
#recursive = off

# To always back up file X as X.orig before converting its links (due
# to -k / --convert-links / convert_links = on having been specified),
# set this variable to on:
#backup_converted = off

# To have Wget follow FTP links from HTML files by default, set this
# to on:
#follow_ftp = off
</pre>
<p>I saved the file in my <code>C:/Windows</code> folder, but you may save it any other place you like. Under Linux you may already have this file in your <code>/etc</code> folder, so just modify it there.</p>
<p>As you may notice in the configuration file above (If you&#8217;ve looked closely) I have enabled the <code>http_proxy</code> and set up a proxy.</p>
<pre>
set WGETRC=C:/Windows/.wgetrc
wget --proxy=on http://insanesecurity.info/blog/
</pre>
<p>The first line is necessary under Windows if you haven&#8217;t set up till know the custom <code>.wgetrc</code>, while the second command enables the proxy and executes a request.<br />
<script type="text/javascript"><!--
google_ad_client = "pub-4879499347590889";
/* 468x60, created 1/22/09 */
google_ad_slot = "0361207255";
google_ad_width = 468;
google_ad_height = 60;
// --></script><br />
<script src="http://pagead2.googlesyndication.com/pagead/show_ads.js" type="text/javascript"></script></p>
<h2>Tweaking it</h2>
<p>This is just a quick intro on the most common usages of <code>wget</code>&#8230; Besides the ones mentioned here it also comes with a handful of other configuration options which you may look into&#8230;</p>
<p>Hopefully this article will be read by those who all day long write scrappers and spiders&#8230; I&#8217;m tired of bumping constantly over those types of scripts :)</p>
]]></content:encoded>
			<wfw:commentRss>http://insanesecurity.info/blog/wget-all-the-way/feed</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>
