By Marco Fioretti
January 2, 2012, 9:00 AM PST
Takeaway: Marco Fioretti shows two examples of shell functions that you can use for web scraping when all you need is a quick way to extract text from a given website.
Use shell functions to fetch information online
Use shell functions to fetch information online
I don't know if this will be useful or even belongs in the Programming section. I just saw it and thought it looked like it might.
- technosaurus
- Posts: 4853
- Joined: Mon 19 May 2008, 01:24
- Location: Blue Springs, MO
- Contact:
If anyone wants more examples, I have written quite a few examples of web scraping. L18L is using my google translate code for localizing shell scripts, jpeps has started using my yahoo finance example, Barry incorporated my google search grokking into puppy's alternative man command after die.net changed their formatting, there are a lot more, but that is all I can remember.
here is the basic process:
use the forms to get the appropriate results (keep a note of what does what)
save and open the html of the page and look for <form> .... </form>
(you will need to add the website and any subdirectories to the "action")
grok the hell out of that till you get it down to a minimum
[stop here if you just want to use it in a web page]
each one of the name=name1 value=value1 pairs translates to a corresponding &name1=value1
you can simulate the form being submitted by opening a browser to:
<URLofpage><action>?name1=value1&name2=value2....
[stop here if you just want to use it to get a page]
if that works - try it with wget (you may need to add -U firefox to wget to defeat anticrawler blocks)
if you output wget to stdout, you can pipe it through sed, grep, cut, etc... to format however you like
[see other tutorials for various types of formatting]
here is the basic process:
use the forms to get the appropriate results (keep a note of what does what)
save and open the html of the page and look for <form> .... </form>
(you will need to add the website and any subdirectories to the "action")
grok the hell out of that till you get it down to a minimum
[stop here if you just want to use it in a web page]
each one of the name=name1 value=value1 pairs translates to a corresponding &name1=value1
you can simulate the form being submitted by opening a browser to:
<URLofpage><action>?name1=value1&name2=value2....
[stop here if you just want to use it to get a page]
if that works - try it with wget (you may need to add -U firefox to wget to defeat anticrawler blocks)
if you output wget to stdout, you can pipe it through sed, grep, cut, etc... to format however you like
[see other tutorials for various types of formatting]
Check out my [url=https://github.com/technosaurus]github repositories[/url]. I may eventually get around to updating my [url=http://bashismal.blogspot.com]blogspot[/url].