0

I am trying to scrape an image url from Zara, but the only think I get back is the url of the transparent background.

This is the link I'm trying to scrape: https://static.zara.net/photos///2022/V/0/1/p/9598/176/406/2/w/850/9598176406_1_1_1.jpg?ts=1640187784252

This is the link I keep getting: https://static.zara.net/stdstatic/1.249.0-b.13/images/transparent-background.png'

Any ideas? This is my code. Thank you in advance!! *Note: I used extract() in the image, not extract_first(), to see if there were several links, but they are all the same.

import scrapy
from scrapy.linkextractors import LinkExtractor

    from Zara.items import Producto

    class ZaraSpider(scrapy.Spider):
        name = 'zara'
        allowed_domains = ['zara.com']
        start_urls = [
        'https://www.zara.com/es/es/jersey-punto-cuello-subido-p09598176.html'
        ]
    def parse(self, response):
        
        producto = Producto()
        
        # Extraemos los enlaces
        links = LinkExtractor(
            allow_domains=['zara.com'],
            restrict_xpaths=["//a"],
            allow="/es/es/"
            ).extract_links(response)
        
        outlinks = [] # Lista con todos los enlaces
        for link in links:
            url = link.url
            outlinks.append(url) # Añadimos el enlace en la lista
            yield scrapy.Request(url, callback=self.parse) # Generamos la petición  

        
        product = response.xpath('//meta[@content="product"]').extract()
        if product:
        # Extraemos la url, el nombre del producto, la descripcion y su precio
            producto['url'] = response.request.url
            producto['nombre'] = response.xpath('//h1[@class="product-detail-info__name"]/text()').extract_first()
            producto['precio'] = response.xpath('//span[@class="price__amount-current"]/text()').extract_first()
            producto['descripcion'] = response.xpath('//div[@class="expandable-text__inner-content"]//text()').extract_first()
            
            producto['imagen'] = response.xpath('//img[@class="media-image__image media__wrapper--media"]/@src').extract()
            #producto['links'] = outlinks
        
        yield producto
2
  • What is the start_urls list? Commented Dec 28, 2021 at 19:01
  • I edit the code, that is the full code. Thank you! Commented Dec 29, 2021 at 9:28

1 Answer 1

0

So the problem that it's generated with javascript. Try to request a webpage with scrapy shell and view the response, then you'll see that you can find to requested image url in another way.

import scrapy
from scrapy.linkextractors import LinkExtractor
# from Zara.items import Producto


class Producto(scrapy.Item):
    url = scrapy.Field()
    nombre = scrapy.Field()
    precio = scrapy.Field()
    descripcion = scrapy.Field()
    imagen = scrapy.Field()
    links = scrapy.Field()


class ZaraSpider(scrapy.Spider):
    name = 'zara'
    allowed_domains = ['zara.com']
    start_urls = [
        'https://www.zara.com/es/es/jersey-punto-cuello-subido-p09598176.html'
    ]

    def parse(self, response):
        producto = Producto()
    
        # Extraemos los enlaces
        links = LinkExtractor(
            allow_domains=['zara.com'],
            restrict_xpaths=["//a"],
            allow="/es/es/"
        ).extract_links(response)
    
        outlinks = []   # Lista con todos los enlaces
        for link in links:
            url = link.url
            outlinks.append(url)    # Añadimos el enlace en la lista
            yield scrapy.Request(url, callback=self.parse)  # Generamos la petición  

        product = response.xpath('//meta[@content="product"]').get()
        if product:
            # Extraemos la url, el nombre del producto, la descripcion y su precio
            producto['url'] = response.request.url
            producto['nombre'] = response.xpath('//h1[@class="product-detail-info__name"]/text()').get()
            producto['precio'] = response.xpath('//span[@class="price__amount-current"]/text()').get()
            producto['descripcion'] = response.xpath('//div[@class="expandable-text__inner-content"]//text()').get()
            producto['imagen'] = response.xpath('//meta[@property="og:image"]/@content').get()
            #producto['links'] = outlinks
    
            yield producto

BTW check out CrawlSpider.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.