How to Scrape Dynamic E-Commerce Product Pages in Python Using


  • Web Scraping in Python using BeautifulSoup and Selenium


    There are a lot of Python libraries you can utilize for data scraping as well as many online tutorials are available on how to start.


    Today, we will discuss about scraping e-commerce products data from dynamic pages and concentrate on how you could do it with BeautifulSoup and Selenium.


    Usually, e-commerce product list pages are dynamic so, various product details is produced for various users — for example, airline price change depending on users’ locations or products getting ranked by significance based on perusing behaviour. The product information is generally populated using Javascript in-browser. That is where Selenium has a role to play. It could programmatically load as well as interact with the web pages within a browser. Then, we can use BeautifulSoup for parsing the page resource and scrape required product data from the HTML elements.


    This blog will show how you could automatically recover products data from pages like these…


    screenshot

    …for a clean and useable format for use and analysis.


    sample-data

    Why to do this? Knowing your competitors, price comparision across different retailers as well as analyzing the market trends are only some practical applications.




    Installation


    This blog will utilize Pandas, BeautifulSoup, and Selenium. Non-compulsory for more superior progressions include Re, Requests, as well as Time. In case, you don’t have all the things installed, the best way is installation through pip. 



    <pre>pip install selenium
    pip install beautifulsoup4
    pip install requests
    </pre>


    We will need to install the web driver. For instance, for Chrome, you need to download the ChromeDriver. Position the executable file in among the directories within PATH variable.




    Page Scraping


    For demo, we will scrape books.toscrape.com, a fiction book store. Its pages are not dynamic, or static, however, its functionality might be similar.



    <pre>import pandas as pd
    from selenium import webdriver
    from bs4 import BeautifulSoup
    import re
    import requests
    import time

    url = 'https://books.toscrape.com/catalogue/page-1.html'
    driver = webdriver.Chrome()
    driver.implicitly_wait(30)
    driver.get(url)
    soup = BeautifulSoup(driver.page_source,'lxml')
    driver.quit()
    </pre>


    The beyond might load the URL within a Chrome browser as well as wait for elements to load, pass the page resources to BeautifulSoup as welll as end a browser session. For the pages, which take long time for loading, you might need to mess around with waiting time (in seconds).


    Our soup looks like this. It’s time to start scraping useful elements!


    page-scraping



    Scraping Elements


    To get an element, we could filter through its tag names or attribute name as well as attribute value.


    For scraping all product names at the initial page of the fictional book store, let’s recognize which elements they got stored in. This looks like the text is reliably stored in

    tag.


    scraping-elements

    soup.find() get the initial element, which matches with our filter: the tag name matches ‘h3’.


    scraping-elements

    Adding .string returns the element texts only.


    scraping-elements

    soup.find_all() gets all the elements, which match with our filter as well as returns them within the list. Note: soup.find_all() as well as soup() would function similar in cse, you’re a brevity fan.


    scraping-elements

    Finally, looping through.string in the list comprehension returns the elements’ texts. Now, we have got the list of 20 products’ names!


    scraping-elements

    The similar can be made with all the product details. To find all product prices, we have filtered through attribute name called ‘class’ as well as attribute value called ‘price_color’.


    scraping-elements

    You can stop here as wella s focus on lists of various product details and it might work very well for the websites having clean HTML. However, e-commerce websites are not always clear as well as troubleshooting for the exceptions could be the most time-consuming part of the procedure.




    Missing Elements


    It is the most general exception we have encountered.


    What occurs when elements are lost for certain products? For instance, if any product is provisionally unavailable as well as there are no tags having prices for the product. Rather than having null values in a list, we might get the price list, which is shorter than list of different product names as well as run risks of getting incorrect pricing against the products.


    To avoid that, we found it best for first filtering to the outer elements, which contain all the product data then within every outer element get particular inner elements like product’s name, pricing, etc. We could include the condition for returning the null value in case, the inner elements are missing from the product tiles. It will make sure all the product data is in same order within our lists.


    missing-elements

    missing-elements


    <pre>[Return null value if inner element is missing else
    return text of inner element
    for x in all outer elements]
    </pre>




    Put that all together in the Pandas DataFrame



    <pre>df = pd.DataFrame(list(zip([None if x == None else x.string for x in soup.find_all('h3')],
    [None if x.find(attrs={'class':'price_color'}) == None else x.find(attrs={'class':'price_color'}).string.replace('£','') for x in soup.find_all(attrs={'class':'col-xs-6 col-sm-4 col-md-3 col-lg-3'})],
    [None if x.find(attrs={'class':'instock availability'}).text == None else x.find(attrs={'class':'instock availability'}).text.strip() for x in soup.find_all(attrs={'class':'col-xs-6 col-sm-4 col-md-3 col-lg-3'})],
    [None if x.find(attrs={'class':re.compile(r'star-rating$')}).get('class') == None else x.find(attrs={'class':re.compile(r'star-rating$')}).get('class')[1] for x in soup.find_all(attrs={'class':'col-xs-6 col-sm-4 col-md-3 col-lg-3'})])),
    columns=['product_name','price','availability','rating'])
    </pre>


    We may put the lists of various product data straight in the Pandas DataFrame as well as name every column.


    For all ways, you can more navigate the elements, see a BeautifulSoup documentation.


    It is also a very good time for preprocessing some features including removing the currency symbols as well as removing the whitespaces around the text.


    sample-data

    To get it easily done, we may put that in the function to scrape the page having a single line of code.



    <pre>def scrape_page(url):
    driver = webdriver.Chrome()
    driver.implicitly_wait(30)
    driver.get(url)
    soup = BeautifulSoup(driver.page_source,'lxml')
    driver.quit()
    df = pd.DataFrame(list(zip([None if x == None else x.string for x in soup.find_all('h3')],
    [None if x.find(attrs={'class':'price_color'}) == None else x.find(attrs={'class':'price_color'}).string.replace('£','') for x in soup.find_all(attrs={'class':'col-xs-6 col-sm-4 col-md-3 col-lg-3'})],
    [None if x.find(attrs={'class':'instock availability'}).text == None else x.find(attrs={'class':'instock availability'}).text.strip() for x in soup.find_all(attrs={'class':'col-xs-6 col-sm-4 col-md-3 col-lg-3'})],
    [None if x.find(attrs={'class':re.compile(r'star-rating$')}).get('class') == None else x.find(attrs={'class':re.compile(r'star-rating$')}).get('class')[1] for x in soup.find_all(attrs={'class':'col-xs-6 col-sm-4 col-md-3 col-lg-3'})])),
    columns=['product_name','price','availability','rating'])
    return df

    scrape_page('https://books.toscrape.com/catalogue/page-1.html')
    </pre>




    Pagination


    For scraping products, which span across different pages, we could put that in the function, which iterates through every page’s url. This appends DataFrames from all the extracted pages.



    <pre>def scrape_multiple_pages(url,pages):
    #Input parameters of url and number of pages to scrape. Put {} in place of page number in url.
    page_number = list(range(pages))
    df = pd.DataFrame(columns=['product_name','price','availability','rating'])
    for i in range(len(page_number)): #Loops through each page number in url.
    if requests.get(url.format(i+1)).status_code == 200: #If the url returns an OK 200 reponse, scrape the page.
    df_page = scrape_page(url.format(i+1))
    df = df.append(df_page)
    time.sleep(5) #Wait 5 seconds.
    else:
    break
    return df

    scrape_multiple_pages('https://books.toscrape.com/catalogue/page-{}.html',pages=2)
    </pre>


    In this URL parameter, we dynamically populate page numbers using {} as well as .format(). The pages parameters define the maximum number of pages for scraping, beginning at 1.


    We have also added extra steps to run if the URL returns the OK 200 reply as well as sleep for merely 5 seconds between the pages.




    Conclusion


    Here are some things to think about:


    It’s very important to consider about how much you extract as extra server calls could easily add.


    You should consider that as we require to load as well as run Javascript for all pages, this technique is slower for all programmatic standards as well as not much scaleable.


    With any extracting, element scraping needs to get tailored for all sites as HTML structures would differ across websites.


    However, for scraping come pages at one time, it is a very easy and helpful solution, which only utilizes some code lines.


    For more information, you can contact 3i Data Scraping or ask for a free quote!


24 views