Parse Wiley Online Library

Question

I would like to extract the DOIs of all chapters from Ullmann's Encyclopedia of Industrial Chemistry with Python and BeautifulSoup.

So from

<h2 class="meta__title meta__title__margin"><span class="hlFld-Title"><a href="https://stackoverflow.com/doi/10.1002/14356007.c01_c01.pub2">Aerogels</a></span></h2>

I would like to get "Aerogels" and "/doi/full/10.1002/14356007.c01_c01.pub2"

Bigger sample:

     <ul class="chapter_meta meta__authors rlist--inline comma">
        <li><span class="hlFld-ContribAuthor"><a href="https://stackoverflow.com/action/doSearch?ContribAuthorStored=H%C3%BCsing%2C+Nicola"><span>Nicola Hüsing</span></a></span></li>
        <li><span class="hlFld-ContribAuthor"><a href="https://stackoverflow.com/action/doSearch?ContribAuthorStored=Schubert%2C+Ulrich"><span>Ulrich Schubert</span></a></span></li>
     </ul><span class="meta__epubDate"><span>First published: </span>15 December 2006</span><div class="content-item-format-links">
        <ul class="rlist--inline separator">
           <li><a title="Abstract" href="https://stackoverflow.com/doi/abs/10.1002/14356007.c01_c01.pub2">Abstract</a></li>
           <li><a title="Full text" href="https://stackoverflow.com/doi/full/10.1002/14356007.c01_c01.pub2">
                 Full text
                 </a></li>

For the title I've tried:

span['hlFld-Title'].a

For the DOI I've tried:

for link in soup.find_all('a'.title):
    print(link.get('href'))

But sadly I'm a full noob (fool) and it doesn't work.

The URLs are https://onlinelibrary.wiley.com/browse/book/10.1002/14356007/title?startPage={1..59}

Thanks for any help.

agmoermann · Accepted Answer · 2018-04-18 14:50:42Z

0

Here a quick solution, prints "DOI;title" pairs to command line:

import requests
from bs4 import BeautifulSoup

for i in range(59):
    page = requests.get("https://onlinelibrary.wiley.com/browse/book/10.1002/14356007/title?startPage={}".format(i))

    soup = BeautifulSoup(page.content, 'lxml')

    content = soup.findAll("span", class_="hlFld-Title")

    for c in content:
        print(c.a.get('href')+";"+c.get_text())

answered Apr 18, 2018 at 14:50

agmoermann

867 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Parse Wiley Online Library

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related