3

I want to find all links in a div, for example:

<div>
  <a href="#0"></a>
  <a href="#1"></a>
  <a href="#2"></a>
</div>

So I write a func as follow:

def get_links(div):
    links = []
    if div.tag == 'a':
        links.append(div)
        return links   
    else:
        for a in div:
            links + get_links(a)
        return links

why the results is [] rather than [a, a, a]? ------- question

I know this is a question of list reference, could you show some detail

This is the complete module:

import lxml.html


def get_links(div):
    links = []
    if div.tag == 'a':
        links.append(div)
        return links   
    else:
        for a in div:
            links + get_links(a)
        return links


if __name__ == '__main__':

    fragment = '''
        <div>
          <a href="#0">1</a>
          <a href="#1">2</a>
          <a href="#2">3</a>
        </div>'''
    fragment = lxml.html.fromstring(fragment)
    links = get_links(fragment)    # <---------------
3
  • 2
    Try changing links + get_links(a) to links += get_links(a) Commented Jan 5, 2015 at 8:04
  • If you don't change links, who else should do it? Commented Jan 5, 2015 at 8:17
  • Yes, This is the right way. Thanks. I want to write +=, but I forget, and I think I write is +=. so I dont find the error... and I think this is a question of list reference Commented Jan 5, 2015 at 8:26

3 Answers 3

2

List addition in Python returns a new list obtained from the concatenation of the arugments, doesn't change them:

x = [1, 2, 3, 4]
print(x + [5, 6])  # displays [1, 2, 3, 4, 5, 6]
print(x)           # here x is still [1, 2, 3, 4]

you can use the extend method:

x.extend([5, 6])

or also +=

x += [5, 6]

The latter is IMO a bit "strange" because it's a case in which x=x+y is not the same as x+=y and therefore I prefer to avoid it and make the in-place extension more explicit.

For your code

links = links + get_links(a)

would also be acceptable but remember that it does a different thing: it allocates a new list with the concatenation and then assign the name links to point to it: it doesn't change the original object referenced by links:

x = [1, 2, 3, 4]
y = x
x = x + [5, 6]
print(x)   # displays [1, 2, 3, 4, 5, 6]
print(y)   # displays [1, 2, 3, 4]

but

x = [1, 2, 3, 4]
y = x
x += [5, 6]
print(x)   # displays [1, 2, 3, 4, 5, 6]
print(y)   # displays [1, 2, 3, 4, 5, 6]
Sign up to request clarification or add additional context in comments.

1 Comment

Yes, This is the right way.Thank you! I want to write +=, but I forget, and I think I write is +=. so I dont find the error... and I think this is a question of list reference
1

If tag is not 'a' your code looks like that.

# You create an empty list

links = []
for a in div:
    # You combine <links> with result of get_links() but you do not assign it to anything
    links + get_links(a)
# So you return an empty list   
return links

You should change + with +=:

links += get_links(a)

Or use extend()

links.extend(get_links(a))

1 Comment

Yes, This is the right way. I want to write +=, but I forget, and I think I write is +=. so I dont find the error... and I think this is a question of list reference
0

Other option is to use xpath method to get all a tags from div at any level.

Code:

from lxml import etree
root = etree.fromstring(content)
print root.xpath('//div//a')

Output:

[<Element a at 0xb6cef0cc>, <Element a at 0xb6cef0f4>, <Element a at 0xb6cef11c>]

2 Comments

Your code only returns a tags that are direct children to the div tag. '//div//a' is better.
@infgeoax: yes agree. Updated code to get a tags from div at any level. Thanx.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.