Parsing for data in HTML using XPath (in a shell script)

Question

I am trying to parse a fairly simple web page for information in a shell script. The web page I'm working with now is generated here. For example, I would like to pull the information on the internet service provider into a shell variable. It may make sense to use one of the programs xmllint, XMLStarlet or xpath for this purpose. I am quite familiar with shell scripting, but I am new to XPath syntax and the utilities used to implement the XPath syntax, so I would appreciate a few pointers in the right direction.

Here's the beginnings of the shell script:

HTMLISPInformation="$(curl --user-agent "Mozilla/5.0" http://aruljohn.com/details.php)"
# ISP="$(<XPath magic goes here.>)"

For your convenience, here is a utility for dynamically testing XPath syntax online:

http://www.bit-101.com/xpath/

Take a look at this.

rae1
– rae1

2012-12-26 20:04:02 +00:00
Commented Dec 26, 2012 at 20:04 — rae1
– rae1, Commented Dec 26, 2012 at 20:04

Benjamin Loison · Accepted Answer · 2024-07-08 17:44:48Z

10

Quick and dirty solution...

xmllint --html -xpath "//table/tbody/tr[6]/td[2]" page.html

You can find the xpath of your node using Chrome and the Developer Tools. When inspecting the node, right click on it and select copy XPath.

I wouldn't use this too much, this is not very reliable.

All the information on your page can be found elsewhere: run whois on your own IP for instance...

edited Jul 8, 2024 at 17:44

Benjamin Loison

5,7604 gold badges20 silver badges37 bronze badges

answered Dec 26, 2012 at 21:16

Michel Guillet

4553 silver badges7 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Pablo Bianchi Over a year ago

I get "XPath set is empty", even piping through `sed -e 's/xmlns=".*"//g'``

Justin Over a year ago

I had some success with specifying full path from root /html/body/table/tr/td ... and also removing the <!DOCTYPE directive.

Benjamin Loison · Accepted Answer · 2024-07-08 17:45:33Z

5

You could use my Xidel. Extracting values from html pages in the cli is its main purpose. Although it is not a standard tool, it is a single, dependency-free binary, and can be installed/run without being root.

It can directly read the value from the webpage without involving other programs.

With XPath:

xidel http://aruljohn.com/details.php -e '//td[text()="Internet Provider"]/following-sibling::td'

Or with pattern-matching:

xidel http://aruljohn.com/details.php -e '<td>Internet Provider</td><td>{.}</td>' --hide-variable-names

edited Jul 8, 2024 at 17:45

Benjamin Loison

5,7604 gold badges20 silver badges37 bronze badges

answered Dec 26, 2012 at 21:13

BeniBela

17.1k4 gold badges48 silver badges55 bronze badges

Comments

asgoth · Accepted Answer · 2012-12-26 20:08:18Z

3

Consider on using PhantomJs. It is a headless WebKit, which allows you to execute JavaScript/CoffeeScript on a web page. I think it could help you solve your issue.

Pjscrape is a useful web scraping tool based on PhantomJs.

answered Dec 26, 2012 at 20:08

asgoth

35.8k12 gold badges92 silver badges98 bronze badges

2 Comments

d3pd Over a year ago

Thank you. I will take a look at it for my personal use. However, the task I hope to accomplish is to be done on a server on which I am not granted root access, which is why I mentioned standard tools such as xmllint.

asgoth Over a year ago

Do you need root access? You could just copy it in your user folder and run it from there.

Benjamin Loison · Accepted Answer · 2024-07-08 17:46:48Z

3

`xpup`

XML

A command line XML parsing tool written in Go. For example:

$ curl -sL https://www.w3schools.com/xml/note.xml | xpup '/*/body'
Don't forget me this weekend!

or:

$ xpup '/note/from' < <(curl -sL https://www.w3schools.com/xml/note.xml)
Jani

HTML

Here is the example of parsing HTML page:

$ xpup '/*/head/title' < <(curl -sL https://example.com/)
Example Domain

`pup`

For HTML parsing, try pup. For example:

$ pup 'title text{}' -f <(curl -sL https://example.com/)
Example Domain

See related Feature Request for XPath.

Installation

Install by: go get github.com/ericchiang/pup.

edited Jul 8, 2024 at 17:46

Benjamin Loison

5,7604 gold badges20 silver badges37 bronze badges

answered Apr 11, 2018 at 21:10

kenorb

169k95 gold badges712 silver badges796 bronze badges

Comments

kenorb · Accepted Answer · 2018-04-11 12:01:04Z

1

HTML-XML-utils

There are many command-line tools in HTML-XML-utils package which can parse HTML files (e.g. hxselect to match a CSS selector).

Also there is xpath which is command-line wrapper around Perl's XPath library (XML::Path).

Related: Command line tool to query HTML elements at SU

edited Apr 11, 2018 at 12:01

answered Oct 17, 2015 at 0:27

kenorb

169k95 gold badges712 silver badges796 bronze badges

Collectives™ on Stack Overflow

Parsing for data in HTML using XPath (in a shell script)

5 Answers 5

2 Comments

Comments

2 Comments

`xpup`

XML

HTML

`pup`

Installation

Comments

HTML-XML-utils

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

2 Comments

Comments

2 Comments

XML

HTML

Installation

Comments

HTML-XML-utils

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related