12

I am trying to parse a fairly simple web page for information in a shell script. The web page I'm working with now is generated here. For example, I would like to pull the information on the internet service provider into a shell variable. It may make sense to use one of the programs xmllint, XMLStarlet or xpath for this purpose. I am quite familiar with shell scripting, but I am new to XPath syntax and the utilities used to implement the XPath syntax, so I would appreciate a few pointers in the right direction.

Here's the beginnings of the shell script:

HTMLISPInformation="$(curl --user-agent "Mozilla/5.0" http://aruljohn.com/details.php)"
# ISP="$(<XPath magic goes here.>)"

For your convenience, here is a utility for dynamically testing XPath syntax online:

http://www.bit-101.com/xpath/

1
  • Take a look at this. Commented Dec 26, 2012 at 20:04

5 Answers 5

10

Quick and dirty solution...

xmllint --html -xpath "//table/tbody/tr[6]/td[2]" page.html

You can find the xpath of your node using Chrome and the Developer Tools. When inspecting the node, right click on it and select copy XPath.

I wouldn't use this too much, this is not very reliable.

All the information on your page can be found elsewhere: run whois on your own IP for instance...

Sign up to request clarification or add additional context in comments.

2 Comments

I get "XPath set is empty", even piping through `sed -e 's/xmlns=".*"//g'``
I had some success with specifying full path from root /html/body/table/tr/td ... and also removing the <!DOCTYPE directive.
5

You could use my Xidel. Extracting values from html pages in the cli is its main purpose. Although it is not a standard tool, it is a single, dependency-free binary, and can be installed/run without being root.

It can directly read the value from the webpage without involving other programs.

With XPath:

xidel http://aruljohn.com/details.php -e '//td[text()="Internet Provider"]/following-sibling::td'

Or with pattern-matching:

xidel http://aruljohn.com/details.php -e '<td>Internet Provider</td><td>{.}</td>' --hide-variable-names

Comments

3

Consider on using PhantomJs. It is a headless WebKit, which allows you to execute JavaScript/CoffeeScript on a web page. I think it could help you solve your issue.

Pjscrape is a useful web scraping tool based on PhantomJs.

2 Comments

Thank you. I will take a look at it for my personal use. However, the task I hope to accomplish is to be done on a server on which I am not granted root access, which is why I mentioned standard tools such as xmllint.
Do you need root access? You could just copy it in your user folder and run it from there.
3

xpup

XML

A command line XML parsing tool written in Go. For example:

$ curl -sL https://www.w3schools.com/xml/note.xml | xpup '/*/body'
Don't forget me this weekend!

or:

$ xpup '/note/from' < <(curl -sL https://www.w3schools.com/xml/note.xml)
Jani

HTML

Here is the example of parsing HTML page:

$ xpup '/*/head/title' < <(curl -sL https://example.com/)
Example Domain

pup

For HTML parsing, try pup. For example:

$ pup 'title text{}' -f <(curl -sL https://example.com/)
Example Domain

See related Feature Request for XPath.

Installation

Install by: go get github.com/ericchiang/pup.

Comments

1

HTML-XML-utils

There are many command-line tools in HTML-XML-utils package which can parse HTML files (e.g. hxselect to match a CSS selector).

Also there is xpath which is command-line wrapper around Perl's XPath library (XML::Path).

Related: Command line tool to query HTML elements at SU

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.