A few days ago, I watched the movie Clueless and saw how the main character, Cher had her own virtual wardrobe. I was so inspired of making one for myself that I needed to gather my own set of clothing images to create a diverse wardrobe. Therefore, I resorted to building my own "Clothes Scraper" to gather information of the clothes as well as obtain all the images in one go. I will walk through with you my thought process of building this scraper and the nitty gritty details of scraping.
The site we will be scraping will be Cotton On's Sales for Women page.
We will need have a few things installed before we start web scraping!
Have all these applications installed and you are good to go!
Now open up your visual studio code editor and put your Chromedriver file inside your directory for scraping.
Now we will need to pip install a few packages!
Alright now will need to import these packages at the start of our Python file.
Now that we have imported these libraries let's start by instantiating the webdriver object using Selenium.
driver = webdriver.Chrome(executable_path="webScraping/chromedriver") url="https://cotton.com/MY/sale/sale-womens/" driver.get(url)
Make sure to point the driver to your own path!
Cool! Now watch selenium work its magic by running the python file and watch your brower automatically open Cotton On's Sales page.
So now we have to tackle the website's usage of lazy loading, which enables the user to view the images only once the user has scrolled down to view those images. If we were to scrape the site right away, we would not be able to get the entire list of images on this page as the browser has only shown a few images upon open opening the page.
#increments the page y coordinates by 1000 for every second y = 1000 for timer in range(0,23): driver.execute_script("window.scrollTo(0, "+str(y)+")") y += 1000 time.sleep(1)
Alright now open up your page inspector using Ctrl+I on the page and notice how you can actually get the huge "div" that contains the list of images as well as the number of images on the page.
#get number of clothes on the page numberOfClothesID= "search-result-items" print("Number Of Clothes on this page : "+ driver.find_element_by_id(numberOfClothesID).get_attribute("data-total-page-tiles"))
Cool ! We have managed to get the number of items on the page. We will now need to be able to access every "li" element in the big "div" that we have gotten just now.
listofitems=driver.find_element_by_id(numberOfClothesID) elems=listofitems.find_elements_by_tag_name("li")
Okay now that we have got that done, we would need to initialise a folder for the date of the scrape as well as create a csv file for storing the clothe's information.
#create new folder for images currentdate=date.today() newpath = f"/YOURPATH/{currentdate}" if not os.path.exists(newpath): os.makedirs(newpath) with open(str(currentdate)+".csv", "w") as fp: wr = csv.writer(fp, dialect='excel')
Now we can start looping through the list items and begin extracting their information as well as download the images.
counter=1 for img in elems: print(counter) counter=counter+1 #get product id imageidori=img.find_element_by_class_name("product-tile").get_attribute("id") imageid='"'+imageidori+'"' imagedetails=img.find_element_by_class_name("product-tile").text clothesinfo=imagedetails.splitlines()
Here we are looping through the li items and getting the specific products' details
I have also converted the products ID into a string as you can see above as we will need to use the ID in downloading the images. One thing we need to take note is also splitting the information based on their new lines so that we could easily write them based on columns into the csv.
Now we can start to download the item based on their product ID. Upon inspection of the download links by their Xpath, they all produce the same format in their paths. Therefore we can use the urllib library to retrieve the links based on the product ID's and download them.
imagesrc=driver.find_element_by_xpath('//*[@id='+imageid+"]/div[2]/a/img").get_attribute("src") urllib.request.urlretrieve(imagesrc, str(imageidori)+".jpg")
So now that the files get downloaded they go straight into our root folder! However, in order to stay organised we should move them into their particular folders that are based on the date of the scrape.
#move file into their respective directory originalfile=f"/yourpath/{str(imageidori)}.jpg" movedfile=f"/yourpath/{currentdate}/{str(imageidori)}.jpg" shutil.move(originalfile,movedfile)
Finally, we are done. You will get a beautiful CSV with all the names, original prices, discounted prices and type of sale all in one CSV. You will also have a folder of full images of the sale's items.
My scraped images in one folder
Congratulations! You made it to the end of this tutorial! Give yourself a pat on the back for doing so! This was an introduction to web scraping and of course there are many other libraries out there that can be used as well such as Pupeteer, BeautifulSoup, Scrapy and many others.
Github Link: https://github.com/eliseching99/FashionScrape
Email us directly at hello@sigmaschool.co!
Want to learn to find out more about what we do?
Learn more here: https://sigmaschool.co
Let’s get social! Find us on:
Facebook: https://www.facebook.com/joinsigma/
Instagram: https://www.instagram.com/joinsigma/
Linkedin: https://linkedin.com/company/79085028/