1

I am trying to scrape a product page from ZARA. Like this one :https://www.zara.com/us/en/fitted-houndstooth-blazer-p07808160.html?v1=108967877&v2=1718115

My scrapy-splash container is running. In the shell I fetch the page

fetch('http://localhost:8050/render.html?url=https://www.zara.com/us/en/fitted-houndstooth-blazer-p07808160.html?v1=108967877&v2=1718115')
2021-05-14 14:30:42 [scrapy.core.engine] INFO: Spider opened
2021-05-14 14:30:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://localhost:8050/render.html?url=https://www.zara.com/us/en/fitted-houndstooth-blazer-p07808160.html?v1=108967877&v2=1718115> (referer: None)

Everything is working so far, and I am able to get the header and price. However, I want to get image URLs of the product.

I try to reach it by

response.css('img.media-image__image::attr(src)').getall()

But response is this:

['https://static.zara.net/stdstatic/1.211.0-b.44/images/transparent-background.png', 'https://static.zara.net/stdstatic/1.211.0-b.44/images/transparent-background.png', 'https://static.zara.net/stdstatic/1.211.0-b.44/images/transparent-background.png', 'https://static.zara.net/stdstatic/1.211.0-b.44/images/transparent-background.png', 'https://static.zara.net/stdstatic/1.211.0-b.44/images/transparent-background.png', 'https://static.zara.net/stdstatic/1.211.0-b.44/images/transparent-background.png']

Which is all background image and not the real one. I can display images on the browser and I see that images coming in the network requests. Is it because it is loaded with AJAX requests? How do I solve this?

1
  • If it's an ajax request, it might take a lot of time to recreate it. Maybe it's easier, if you automate a browser using Playwright for Python (github.com/microsoft/playwright-python) which show allow you to get the image url. Commented May 16, 2021 at 20:03

3 Answers 3

2
+50

@samuelhogg deserves the credit for finding the json, but here is an example spider showing how to get all the image urls from the page. Note that you don't even need to use splash here, I've not tested it with splash but I think it should still work.

from scrapy import Spider
import json


class Zara(Spider):
    name = "zara"
    start_urls = [
        "https://www.zara.com/us/en/fitted-houndstooth-blazer-p07808160.html?v1=108967877&v2=1718115"
    ]
  
    def parse(self, response):
        # Find the json identified by @samuelhogg
        data = response.css("script[type='application/ld+json']::text").get()
        # Make a set of all the images in the json
        images = {image for i in json.loads(data) for image in i["image"]}
        # Do what you want with them!
        print(images)
Sign up to request clarification or add additional context in comments.

Comments

2

I have only started looking into web scraping in the last week, so I am not sure if I can be much help, but I did find something.

The source code showed this in the script at the top:

_mkt_imageDir = /BASE_IMAGES_URL=(.*?);/.test(document.cookie) && RegExp.$1 || 'https://static.zara.net/photos/';

and this further down:

"originalUrl":"/us/en/fitted-houndstooth-blazer-p07808160.html?v1=108967877&v2=1718115","imageBaseUrl":"https://static.zara.net/photos/"

then all the images here in which appears to be in a javascript:

[{"@context":"http://schema.org/","@type":"Product","sku":"108967877-046-1","name":"FITTED HOUNDSTOOTH BLAZER","mpn":"108967877-046-1","brand":"ZARA","description":"","image":["https://static.zara.net/photos///2021/I/0/1/p/7808/160/046/2/w/1920/7808160046_1_1_1.jpg?ts=1620821843383","https://static.zara.net/photos///2021/I/0/1/p/7808/160/046/2/w/1920/7808160046_2_1_1.jpg?ts=1620821851988","https://static.zara.net/photos///2021/I/0/1/p/7808/160/046/2/w/1920/7808160046_2_2_1.jpg?ts=1620821839280","https://static.zara.net/photos///2021/I/0/1/p/7808/160/046/2/w/1920/7808160046_6_1_1.jpg?ts=1620655538200","https://static.zara.net/photos///2021/I/0/1/p/7808/160/046/2/w/1920/7808160046_6_2_1.jpg?ts=1620655535611","https://static.zara.net/photos///2021/I/0/1/p/7808/160/046/2/w/1920/7808160046_6_3_1.jpg?ts=1620656141718","https://static.zara.net/photos///contents/cm/w/1920/sustainability-extrainfo-label-JL78_0.jpg?ts=1602602200357"]

I have no idea how you will scrape them but I will be interested to know the answer when you find out.

Regards Samuel

Comments

1

It looks the urls are in a json file, which I believe you can scrape urls from. json

There is some info/code about scraping from json here

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.