Scripting: how to extract links from html pages

How to do things, solutions, recipes, tutorials
Post Reply
Message
Author
User avatar
erikson
Posts: 735
Joined: Wed 27 Feb 2008, 09:22
Location: Ghent, Belgium
Contact:

Scripting: how to extract links from html pages

#1 Post by erikson »

Sometimes (e.g. for web authors) it may be handy to find and extract all links from a web page.

The patterns to be found from the web page are href="some_link" (i.e. href="[^"]*" as regular expression). Notice that several such links may appear on the same html line, and of course we want to extract them all, one per output line.

Here is a one-liner that does the job

Code: Select all

cat page.htm | grep -o 'href="[^"]*"'
In order to extract only the links some_link themselves, one can use

Code: Select all

cat page.htm | grep -o 'href="[^"]*"' | sed 's/.*"\([^"]*\)".*/\1/'
More commands may be piped to filter specific link types, such as e.g. grep -v '^#' (to exclude local-tag links such as #top) or grep -v '^ftp:' (to exclude ftp links) etcetera.
[size=84][i]If it ain't broke, don't fix it.[/i] --- erikson
hp/compaq nx9030 (1.6GHz/480MB/37.2GB), ADSL, Linksys wireless router
[url]http://www.desonville.net/[/url]
Puppy page: [url]http://www.desonville.net/en/joere.puppy.htm[/url][/size]

Mysp
Posts: 47
Joined: Mon 08 Jun 2009, 10:39
Location: Czech Republic

Scripting: how to extract links from html pages

#2 Post by Mysp »

I am beginner to regular expression and I tried for two hours construct myself
or find in articles abour regexp but without success. I would like example as the first one, that is:

Code: Select all

cat page.htm | grep -o 'href="[^"]*"'
but to list links ONLY ending with certain extension, e. g. ".php" (NOT case sensitive).
that is "href=".......php". It is probably easy, but not for me

Thank you

User avatar
MU
Posts: 13649
Joined: Wed 24 Aug 2005, 16:52
Location: Karlsruhe, Germany
Contact:

#3 Post by MU »

Code: Select all

cat default.html | grep -o 'href="[^"]*"' | grep -i "\.php[\"|>]" | sed 's/.*"\([^"]*\)".*/\1/'
This would list things like
.php"
or
.php>
grep -i
- search case Insensitive

\.php
- a dot followed by php
- a dot is a regular expression itself (any character), so it must be masked with a backslash, so that it is seen as dot.

[\"|>]
- you can put special conditions in square brackets.
- | means: or
- \" means: a quotation mark. As it is placed in the quotation marks of grep, it must be masked with a backslash.

http://tldp.org/LDP/Bash-Beginners-Guid ... ap_04.html

Mark
[url=http://murga-linux.com/puppy/viewtopic.php?p=173456#173456]my recommended links[/url]

big_bass
Posts: 1740
Joined: Mon 13 Aug 2007, 12:21

#4 Post by big_bass »

here's another tool to do something similar then an example how you can reformat it again to something useful for the forum

if you notice the HREF= is in capital letters so this was made for seamonkey

1.)what it does it finds the seamonkey bookmarks that you made
2.)then strips off all the html code from the links
3.)saves you a copy of the stripped links
4.)continues to read the stripped links one by one then reformats the links
with a custom made code to allow you to post into the forum with the correct
formatting for the "click_here" linking and BTW all you do is just click on the script
everything is auto generated for you :D

*there is an optional message string you can edit to say download_here
or something else if you need

Code: Select all

#!/bin/sh
#code from Joe Arose big_bass built for special use to bulk make lists 

#this is a special tool to remove the links from the html formatted bookmarks from seamonkey 
#with the end goal of easily making the correct formatting for the forum also


#this removes the bookmark links from the html code for seamonkey 
#It will get read then reformatted again 


grep 'HREF="http://' /root/.mozilla/default/*.slt/bookmarks.html| cut -f 2 -d '"'>/root/bookmarks_stripped.txt


#-----------------------------------------------
message=click_here
#message=download_here
echo >/root/bookmarks_post
cd /root/
for i in `cat /root/bookmarks_stripped.txt`
do echo "[url="$i"]"$message"[""/""url""]" >>/root/bookmarks_post
done

/root/bookmarks_stripped.txt is the links only no html code
/root/bookmarks_post is the nice "click here format" for the forum posting


I also made a drag N drop URL script with a GUI
I posted in the slaxer_pup thread


*with firefox 3 the bookmarks need to be exported first before you can format them *


Joe

Bruce B

#5 Post by Bruce B »

This code is specifically for getting nicely formatted,
ready to use links, (not just URLs) from the SeaMonkey
bookmarks file.


The Script with no comments, if questions, ask.

Code: Select all

#!/bin/bash

main() {

    vars
    process_file
    view_output

}

vars() {

    bm=`find /root/.mozilla/default -name bookmarks.html`
    [ x$bm = x ] && echo "didn't find bookmarks.html" && exit
    dest=/root/my-documents/seamonkey-links.html

}

process_file() {

    whiteout() {
        while read f
        do echo $f
        done
    }

    <$bm dos2unix | grep HREF | whiteout \
    | cut -b 5-300 | getlinks-bin | whiteout \
    | sort >>$dest

}

view_output() {

    viewer=more
    viewer=`which less`
    $viewer $dest

}

main
The Pipe source for binary pipe, with comments

Code: Select all

/*	C source by Bruce B (Puppy forum)  */
/*	distributed as GPL2 licensed       */
/*	source code                        */
	
/*	original filename: getlinks-bin    */

/*	purpose: to assist in stripping    */
/*	out characters from the seamonkey  */
/*	bookmarks.html file, resulting in  */
/*	nicely formatted html links        */

/*	intended usage is as a pipe with   */
/*	a bash script called getlinks      */

#include <stdio.h>

main() {

int ch, cnt = 0;

	/*  process input one   */
	/*  character at a time */	

    while((ch = getchar()) !=EOF) {

		/* each line starts with cnt at 0, */
		/* increments at each quote, after */
		/* two quotes, no printing until   */
		/* the > character found           */

        if (ch == 0x22 && cnt < 2) {
            putchar(ch);
            cnt++;
            continue;
        }

		/* don't print after second quote  */		
		/* until we reach the > character  */		

		if (ch != 0x3E && cnt >= 2)
			continue;

		/* first > character encountered   */
		/* print it and set cnt negative   */

    	if (ch == 0x3E) {
            cnt = -10;
            putchar(ch);
            continue;
    	}


		/* at linefeed set counter to zero  */
		/* print the <BR>+linefeed  */
		 
        if (ch == 0x0A) {
        	cnt = 0;
		printf("<BR>\n");
    	        continue;
        }

		/* if no other conditions were     */
		/* encountered from above, we      */
		/* will filter three weird         */
		/* characters, and print the rest */
		
		if (ch != 0xE2 && ch != 0x80 && ch != 0x94)
                     putchar(ch);

	}
}
The attachment

Code: Select all

Archive:  getlinks.zip

 Length     Date   Time    Name
 --------    ----   ----    ----
      607  06-25-09 05:53   getlinks
     3000  06-25-09 05:53   getlinks-bin
     1549  06-25-09 06:04   getlinks-bin.c
 --------                   -------
     5156                   3 files
Enjoy

~
Yes, it took some thought, darn. Barry and crew
have to fix things so that no thought is necessary.
~
Attachments
getlinks.zip
(2.76 KiB) Downloaded 390 times

Bruce B

#6 Post by Bruce B »

Script for lower case tags

SeaMonkey bookmarks makes tags like this;
<A HREF

if you prefer tags like this;
<a href

use this script instead, rename it as the other script, but I
changed the name so you wouldn't have conflicts or
overwrites

Script for lower case tags for your review, but use the
downloaded one is better as we don't want any
unintentional white space after the \ characters

Code: Select all

#!/bin/bash

main() {

    vars
    process_file
    view_output

}

vars() {

    bm=`find /root/.mozilla/default -name bookmarks.html`
    [ x$bm = x ] && echo "didn't find bookmarks.html" && exit
    dest=/root/my-documents/seamonkey-links.html
}

process_file() {

    whiteout() {
        while read f
        do echo $f
        done
    }

    <$bm dos2unix | grep HREF | whiteout \
    | cut -b 5-300 | getlinks-bin | whiteout \
    | sort | sed "s/A HREF/a href/g" \
    | sed "s/\/A/\/a/g" | sed "s/BR>$/br>/g" \
    >>$dest

}

view_output() {

    viewer=more
    viewer=`which less`
    $viewer $dest

}

main
Attachments
getlinks-lcase.zip
(499 Bytes) Downloaded 431 times

big_bass
Posts: 1740
Joined: Mon 13 Aug 2007, 12:21

#7 Post by big_bass »

you could just do this also in the console

Code: Select all

gtkmoz /root/.mozilla/default/*.slt/bookmarks.html
***********************************************************

Bruce thanks for posting
I first tried copying and pasting your code then compile it
I did see that you attached the bin already compiled
I just like to go through the steps
it needs just one more space on the last line

so copied the above text saved it as getlinks-bin.c remember to add one space at the end

then at console to compile it

Code: Select all

 gcc getlinks-bin.c -o getlinks-bin
it worked outputting to /root/my-documents/seamonkey-links.html


having some tools for formatting html
is a good project

Joe

Bruce B

#8 Post by Bruce B »

big_bass wrote: I first tried copying and pasting your code then compile it
I did see that you attached the bin already compiled
I just like to go through the steps
it needs just one more space on the last line
Joe,

Thanks for the compliment.

* the idea I had was if someone compiled, they'd do it from
the attachment which has two line feeds at the end of the
source.

* the seamonkey bookmark file has lots of junk between
the <A HREF and the first >, as in lots of junk.

Part of the information we need of course, but the rest is
junk for making an links file. And the junk was not
necessarily consistent either. Thus making it harder to
figure out how to keep what I want and discard the rest.

Soo;

I got the idea of checking each character as it goes
through the pipe, and making decisions based on the
character. It works and the pipe is fast.

Bruce

big_bass
Posts: 1740
Joined: Mon 13 Aug 2007, 12:21

#9 Post by big_bass »

Hey Bruce


I was playing a bit today with just showing the stripped links
and actually have them work as html at the same time
so you could do the copy link location or a drag and drop URL

you would have to run the first script I wrote so you have the correct input file
after you can change the input file to just a list of clean URL's
and make your own quick "clickable " reference links


run this first just to have an already formatted list auto generated for you

Code: Select all

#!/bin/sh
#code from Joe Arose big_bass built for special use to bulk make lists

#this is a special tool to remove the links from the html formatted bookmarks from seamonkey
#with the end goal of easily making the correct formatting for the forum also


#this removes the bookmark links from the html code for seamonkey
#It will get read then reformatted again


grep 'HREF="http://' /root/.mozilla/default/*.slt/bookmarks.html| cut -f 2 -d '"'>/root/bookmarks_stripped.txt


#-----------------------------------------------
message=click_here
#message=download_here
echo >/root/bookmarks_post
cd /root/
for i in `cat /root/bookmarks_stripped.txt`
do echo "[url="$i"]"$message"[""/""url""]" >>/root/bookmarks_post
done 


now run this new script URLS to html

Code: Select all

#!/bin/sh
echo '
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>
 <head>
  <title>Index of </title>
 </head>
 <body>
<h1>Index of  new_index_list</h1>
<pre><img src="/usr/local/lib/X11/pixmaps/archive48.png" alt="Icon "> <a href="?C=N;O=D">Name</a>' >/root/new_index_list


cd /root/
for i in `cat /root/bookmarks_stripped.txt`
do echo "<"a href="$i"">"$i"<"/a">" >>/root/new_index_list
done

echo '<hr></pre>
<address>Apache Server at distro.ibiblio.org Port 80</address>
</body></html>' >>/root/new_index_list


then click on the file called /root/new_index_list

to see what happened :D

Joe

Bruce B

#10 Post by Bruce B »

Joe,

Sure, I'm happy to test the script, but can you fix it so it doesn't filter out saved ftp and https sites?

Bruce

Bruce B

#11 Post by Bruce B »

Joe,

It works great, do you want to do more work on it?

Bruce

big_bass
Posts: 1740
Joined: Mon 13 Aug 2007, 12:21

#12 Post by big_bass »

Hey Bruce

note: this is safe this doesn't change any of your "original files"
it just makes new ones with different names

thanks for reminding me I forgot ftp and https
no problem to add it into the code
I rolled both scripts together to make it just a one clicker

and I removed the "click here" formatting I'll just have that alone as a special
script

It will now open the viewer to make it user friendly

thanks for the feedback 8)

Joe

Code: Select all

#!/bin/sh

#code from Joe Arose big_bass built for special use
#to strip URL's from seamonkey then generate a custom index

#with the end goal of easily making the correct formatting
#simple and clean 
#as just the URL's then generate a new index 

#/root/new_index_list  #this is the index generated 


#added in ftp and https filtering   6-29-09 

#-----------------strip URL'S------------------------------
rm /root/bookmarks_stripped.txt #start clean 

grep 'HREF="http://' /root/.mozilla/default/*.slt/bookmarks.html| cut -f 2 -d '"'>>/root/bookmarks_stripped.txt
grep 'HREF="https://' /root/.mozilla/default/*.slt/bookmarks.html| cut -f 2 -d '"'>>/root/bookmarks_stripped.txt
grep 'HREF="ftp://' /root/.mozilla/default/*.slt/bookmarks.html| cut -f 2 -d '"'>>/root/bookmarks_stripped.txt



#---------------index maker--------------------------------
echo '
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>
 <head>
  <title>Index of </title>
 </head>
 <body>
<h1>Index of  new_index_list</h1>
<pre><img src="/usr/local/lib/X11/pixmaps/archive48.png" alt="Icon "> <a href="?C=N;O=D">Name</a>' >/root/new_index_list


cd /root/
for i in `cat /root/bookmarks_stripped.txt`
do echo "<"a href="$i"">"$i"<"/a">" >>/root/new_index_list
done

echo '<hr></pre>
<address>Apache Server at distro.ibiblio.org Port 80</address>
</body></html>' >>/root/new_index_list

gtkmoz /root/new_index_list
Attachments
seamonkey_bookmarks_reformatted.tar.gz
(768 Bytes) Downloaded 713 times

Post Reply