How to Copy a website content using wget command in Linux

How to Copy a website content using wget command in Linux

   

This post was last updated on July 9th, 2020 at 05:21 pm

Similar to wget-ftp-index.html

Need to get a directory listing for a web site which is presented like an ftp site? Here’s my first working run for http:

I did the following:

# Do the following command to create an empty index.html file, this just starts us at index.html.1 rather than index.html
mkdir /tmp/theserver
cd /tmp/theserver
>index.html

# Grab the first page, it will be saved as index.html.1
wget http://theserver

# Now the script I used to start at page #1 and go through until no more pages:
COUNT=1; while :; do FILES=`cat index.html.$COUNT | grep “a href” | sed -e ‘s,^.*=”,,’ -e ‘s,”.*$,,’ | grep /$ | grep -v^/$`; if [ -n “$FILES” ]; then echo “$FILES” | while read; do PAGE=`cat index.html.$COUNT | grep “<b>Index of” | awk ‘{print $3}’ | sed -e ‘s,<.*$,,’ -e ‘s,^/*,/,’ -e ‘s,/*$,/,’`; URL=”http://theserver$PAGE$REPLY”; echo GETTING: $URL; wget $URL; done; fi; COUNT=`expr $COUNT + 1`; if [ ! -e index.html.$COUNT ]; then break; fi; done

# Here I’ve split it for easy reading:

COUNT=1
while :; do
     FILES=`cat index.html.$COUNT | grep "a href" | sed -e 's,^.*=",,' -e 's,".*$,,' | grep /$ | grep -v^/$`
     if [ -n "$FILES" ]; then
          echo "$FILES" | while read; do
               PAGE=`cat index.html.$COUNT | grep "<b>Index of" | awk '{print $3}' | sed -e 's,<.*$,,' -e 's,^/*,/,' -e 's,/*$,/,'`
               URL="http://theserver$PAGE$REPLY"
               echo GETTING: $URL
               wget $URL
          done
     fi
     COUNT=`expr $COUNT + 1`
     if [ ! -e index.html.$COUNT ]; then
          break
     fi
done 

These will be saved in 1 directory as index.html.x where x will increment until no more pages.

Here is a sample of the pages it was reading:




Index of /software/Docs



Index of /software/Docs
      Name                    Last modified       Size  
                                     

[DIR] Parent Directory        30-Aug-2004 14:41      -  
[DIR] linux/                  09-Jan-2004 15:07      -  
[DIR] windows/                  30-Nov-2004 20:04      -  

Then this is searchable any way you want. To begin here’s a sample search:

cd /tmp/theserver
grep linux *

Previous How to Use ifconfig to set MAC address in Linux, Unix, FreeBSD etc.
Next How to Configure vncserver in Fedora Linux

About author

Sibananda Sahu
Sibananda Sahu 158 posts

A Linux Kernel Developer and a Firmware Developer by profession. Have worked with few big companies: BROADCOM Corporation, Cypress Semiconductor, LSI Corporation, TOSHIBA Corporation, Western Digital; on various cutting edge technologies and product lines, such as: RAID storage Driver, SSD Firmware, WLAN Firmware etc. Having more than 9 years of experience in Software Engineering domain. Now, took a pledge to educate all aspirant students to teach about Linux Kernel Development.

View all posts by this author →

You might also like

Uncategorized 0 Comments

cpqlinux Site Map – Sorted by Date

This post was last updated on July 9th, 2020 at 04:57 pm Filename Date Size Title Description ifconfig.html 2005 Jul 07 – 10:06:46 3330 ext-rename.html 2005 Jun 22 – 11:17:09

Uncategorized 0 Comments

Using a tape drive on a CCISS controller

This post was last updated on May 27th, 2020 at 05:00 pmPer the cciss.txt documentation: You must enable “SCSI tape drive support for Smart Array 5xxx” and “SCSI support” in

Uncategorized 0 Comments

Cygwin

This post was last updated on July 9th, 2020 at 04:21 pmContents1 Cygwin ssh (OpenSSH), and Win98 – Use “crypt newpassword” to add a password into /etc/passwd2 Win98 Setup instructions

0 Comments

No Comments Yet!

You can be first to comment this post!

Leave a Reply