Scrapy-splash Can't find image source url

Question

I am trying to scrape a product page from ZARA. Like this one :https://www.zara.com/us/en/fitted-houndstooth-blazer-p07808160.html?v1=108967877&v2=1718115

My scrapy-splash container is running. In the shell I fetch the page

fetch('http://localhost:8050/render.html?url=https://www.zara.com/us/en/fitted-houndstooth-blazer-p07808160.html?v1=108967877&v2=1718115')
2021-05-14 14:30:42 [scrapy.core.engine] INFO: Spider opened
2021-05-14 14:30:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://localhost:8050/render.html?url=https://www.zara.com/us/en/fitted-houndstooth-blazer-p07808160.html?v1=108967877&v2=1718115> (referer: None)

Everything is working so far, and I am able to get the header and price. However, I want to get image URLs of the product.

I try to reach it by

response.css('img.media-image__image::attr(src)').getall()

But response is this:

['https://static.zara.net/stdstatic/1.211.0-b.44/images/transparent-background.png', 'https://static.zara.net/stdstatic/1.211.0-b.44/images/transparent-background.png', 'https://static.zara.net/stdstatic/1.211.0-b.44/images/transparent-background.png', 'https://static.zara.net/stdstatic/1.211.0-b.44/images/transparent-background.png', 'https://static.zara.net/stdstatic/1.211.0-b.44/images/transparent-background.png', 'https://static.zara.net/stdstatic/1.211.0-b.44/images/transparent-background.png']

Which is all background image and not the real one. I can display images on the browser and I see that images coming in the network requests. Is it because it is loaded with AJAX requests? How do I solve this?

If it's an ajax request, it might take a lot of time to recreate it. Maybe it's easier, if you automate a browser using Playwright for Python (github.com/microsoft/playwright-python) which show allow you to get the image url. — 576i
– 576i, Commented May 16, 2021 at 20:03

tomjn · Accepted Answer · 2021-05-19 09:47:15Z

2

+50

@samuelhogg deserves the credit for finding the json, but here is an example spider showing how to get all the image urls from the page. Note that you don't even need to use splash here, I've not tested it with splash but I think it should still work.

from scrapy import Spider
import json


class Zara(Spider):
    name = "zara"
    start_urls = [
        "https://www.zara.com/us/en/fitted-houndstooth-blazer-p07808160.html?v1=108967877&v2=1718115"
    ]
  
    def parse(self, response):
        # Find the json identified by @samuelhogg
        data = response.css("script[type='application/ld+json']::text").get()
        # Make a set of all the images in the json
        images = {image for i in json.loads(data) for image in i["image"]}
        # Do what you want with them!
        print(images)

answered May 19, 2021 at 9:47

tomjn

5,4091 gold badge12 silver badges27 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

samuelhogg · Accepted Answer · 2021-05-16 19:42:00Z

I have only started looking into web scraping in the last week, so I am not sure if I can be much help, but I did find something.

The source code showed this in the script at the top:

_mkt_imageDir = /BASE_IMAGES_URL=(.*?);/.test(document.cookie) && RegExp.$1 || 'https://static.zara.net/photos/';

and this further down:

"originalUrl":"/us/en/fitted-houndstooth-blazer-p07808160.html?v1=108967877&v2=1718115","imageBaseUrl":"https://static.zara.net/photos/"

then all the images here in which appears to be in a javascript:

[{"@context":"http://schema.org/","@type":"Product","sku":"108967877-046-1","name":"FITTED HOUNDSTOOTH BLAZER","mpn":"108967877-046-1","brand":"ZARA","description":"","image":["https://static.zara.net/photos///2021/I/0/1/p/7808/160/046/2/w/1920/7808160046_1_1_1.jpg?ts=1620821843383","https://static.zara.net/photos///2021/I/0/1/p/7808/160/046/2/w/1920/7808160046_2_1_1.jpg?ts=1620821851988","https://static.zara.net/photos///2021/I/0/1/p/7808/160/046/2/w/1920/7808160046_2_2_1.jpg?ts=1620821839280","https://static.zara.net/photos///2021/I/0/1/p/7808/160/046/2/w/1920/7808160046_6_1_1.jpg?ts=1620655538200","https://static.zara.net/photos///2021/I/0/1/p/7808/160/046/2/w/1920/7808160046_6_2_1.jpg?ts=1620655535611","https://static.zara.net/photos///2021/I/0/1/p/7808/160/046/2/w/1920/7808160046_6_3_1.jpg?ts=1620656141718","https://static.zara.net/photos///contents/cm/w/1920/sustainability-extrainfo-label-JL78_0.jpg?ts=1602602200357"]

I have no idea how you will scrape them but I will be interested to know the answer when you find out.

Regards Samuel

samuelhogg · Accepted Answer · 2021-05-16 20:37:02Z

1

It looks the urls are in a json file, which I believe you can scrape urls from. json

There is some info/code about scraping from json here

edited May 16, 2021 at 20:37

answered May 16, 2021 at 20:26

samuelhogg

413 bronze badges

Collectives™ on Stack Overflow

Scrapy-splash Can't find image source url

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related