Friday, May 9, 2025
News PouroverAI
Visit PourOver.AI
No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
News PouroverAI
No Result
View All Result

Web Scraping with Scrapy (8 Code Examples)

May 10, 2024
in Data Science & ML
Reading Time: 6 mins read
0 0
A A
0
Share on FacebookShare on Twitter


In this Python tutorial, we’ll go over web scraping using Scrapy

— and we’ll work through a sample e-commerce website scraping project. By 2025 the internet will grow to more than 175 zetabytes of data. Unfortunately, a large portion of it is unstructured and not machine-readable. This means that you can access the data through websites and, technically speaking, in the form of HTML pages. Is there an easier way to not just access this web data but also download it in a structured format so it becomes machine-readable and ready to gain insights? This is where web scraping and Scrapy can help you! Web scraping is the process of extracting structured data from websites. Scrapy, being one of the most popular web scraping frameworks, is a great choice if you want to learn how to scrape data from the web. In this tutorial, you’ll learn how to get started with Scrapy and you’ll also implement an example project to scrape an e-commerce website. Let’s get started!

Prerequisites

To complete this tutorial, you need to have Python installed on your system and it’s recommended to have a basic knowledge of coding in Python.

Installing Scrapy

In order to use Scrapy, you need to install it. Luckily, there’s a very easy way to do it via pip. You can use pip install scrapy to install Scrapy. You can also find other installation options in the Scrapy docs. It’s recommended to install Scrapy within a Python virtual environment.


virtualenv env
source env/bin/activate
pip install scrapy

Scrapy Project Structure

Whenever you create a new Scrapy project you need to use a specific file structure to make sure Scrapy knows where to look for each of its modules. Luckily, Scrapy has a handy command that can help you create an empty Scrapy project with all the modules of Scrapy:


scrapy startproject bookscraper

If you run this command, this creates a new Scrapy project – based on a template – that looks like this:

📦bookscraper
┣ 📂bookscraper
┃ ┣ 📂spiders
┃ ┃ ┗ 📜bookscraper.py
┃ ┣ 📜items.py
┃ ┣ 📜middlewares.py
┃ ┣ 📜pipelines.py
┃ ┗ 📜settings.py
┗ 📜scrapy.cfg

This is a typical Scrapy project file structure. Let’s quickly examine these files and folders on a high level so you understand what each of the elements does:

  • spiders folder: This folder contains all of our future Scrapy spider files that extract the data.
  • items: This file contains item objects that behave like Python dictionaries and provide an abstraction layer to store scraped data within the Scrapy framework.
  • middlewares (advanced): Scrapy middlewares are useful if you want to modify how Scrapy runs and makes requests to the server (e.g., to get around antibot solutions). For simple scraping projects, you don’t need to modify middlewares.
  • pipelines: Scrapy pipelines are for extra data processing steps you want to implement after you extract data. You can clean, organize, or even drop data in these pipelines.
  • settings: General settings for how Scrapy runs, for example, delays between requests, caching, file download settings, etc.

In this tutorial, we focus on two Scrapy modules: spiders and items. With these two modules, you can implement simple and effective web scrapers that can extract data from any website. After you’ve successfully installed Scrapy and created a new Scrapy project, let’s learn how to write a Scrapy spider (also called a scraper) that extracts product data from an e-commerce store.

Scraping Logic

As an example, this tutorial uses a website that was specifically created for practicing web scraping: Books to Scrape. Before coding the spider, it’s important to have a look at the website and analyze the path the spider needs to take to access and scrape the data. We’ll use this website to scrape all the books that are available. As you can see on the site, there are multiple categories of books and multiple items in each category page. This means that our scraper needs to go to each category page and open each book page. Let’s break down what the scraper needs to do on the website:

  1. Open the website (http://books.toscrape.com/).
  2. Find all the category URLs (like this one).
  3. Find all the book URLs on the category pages (like this one).
  4. Open each URL one by one and extract book data.

In Scrapy, we have to store scraped data in Item classes. In our case, an Item will have fields like title, link, and posting_time. Let’s implement the item!

Scrapy Item

Create a new Scrapy item that stores the scraped data. Let’s call this item BookItem and add the data fields that represent each book:

  • title
  • price
  • upc
  • image_url
  • url

In code, this is how you create a new Item class in Scrapy:


from scrapy import Item, Field
class BookItem(Item):
title = Field()
price = Field()
upc = Field()
image_url = Field()
url = Field()

As you can see in the code snippet, you need to import two Scrapy objects: Item and Field. Item is used as the parent class for the BookItem so Scrapy knows this object will be used throughout the project to store and reference the scraped data fields. Field is an object stored as part of an Item class to indicate the data fields within the item. Once you created the BookItem class you can go ahead and work on the Scrapy spider that handles the scraping logic and extraction.

Scrapy Spider

Create a new Python file in the spiders folder called bookscraper.py


touch bookscraper.py

This spider file contains the spider logic and scraping code. In order to determine what needs to go in this file, let’s inspect the website!

Website Inspection

Website inspection is a tedious, but important step in the web scraping process. Without a proper inspection, you won’t know how to locate and extract the data from the websites efficiently. Inspection is usually done using your browser’s “inspect” tool or some 3rd party browser plugin that lets you “look under the hood” and analyze the source code of a website. It’s recommended that while you’re analyzing the website you turn off JS execution in your browser – this way you can see the website the same way your Scrapy spider will see it. Let’s recap what URLs and data fields we need to locate in the source code of the website:

  • category URLs
  • book URLs
  • finally, book data fields

Inspect the source code to locate category URLs in the HTML: What you can notice by inspecting the website is that category URLs are stored within a ul HTML element with a class nav nav-list. This is crucial information, because you can use this CSS and the surrounding HTML elements to locate all of the category URLs on the page – exactly what we need! Let’s keep this in mind and dig deeper to find other potential CSS selectors we can use in our spider.

Inspect the HTML to find book page URLs: Individual book page URLs are located under an article HTML element with the CSS class product pod. We can use this CSS rule to find the book page URLs with our scraper. Finally, inspect the website to find individual data fields on the book page: This time it’s slightly more tricky as we’re looking for multiple data fields on the page, not just one. So we’ll need multiple CSS selectors to find each field on the page. As you can see on the screenshot above, some data fields (like UPC and price) can be found in an HTML table, but other fields (like the title) are on the top of the page in a different kind of HTML element. After inspection, and finding all the data fields and URL locators we need, you can implement the spider:


from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from bookscraper.items import BookItem

class BookScraper(CrawlSpider):
name = "bookscraper"
start_urls = ["http://books.toscrape.com/"]

rules = (
Rule(LinkExtractor(restrict_css=".nav-list > li > ul > li > a"), follow=True),
Rule(LinkExtractor(restrict_css=".product_pod > h3 > a"), callback="parse_book")
)

def parse_book(self, response):
book_item = BookItem()
book_item["image_url"] = response.urljoin(response.css(".item.active > img::attr(src)").get())
book_item["title"] = response.css(".col-sm-6.product_main > h1::text").get()
book_item["price"] = response.css(".price_color::text").get()
book_item["upc"] = response.css(".table.table-striped > tr:nth-child(1) > td::text").get()
book_item["url"] = response.url
return book_item

Let’s break down what’s happening in this code snippet: Scrapy will open the website http://books.toscrape.com/. It will start iterating over the category pages defined by…



Source link

Tags: CodeExamplesScrapingScrapyWeb
Previous Post

These stocks could follow Meta, Alphabet in announcing dividend

Next Post

IA e analytics no setor público: uma jornada em evolução

Related Posts

AI Compared: Which Assistant Is the Best?
Data Science & ML

AI Compared: Which Assistant Is the Best?

June 10, 2024
5 Machine Learning Models Explained in 5 Minutes
Data Science & ML

5 Machine Learning Models Explained in 5 Minutes

June 7, 2024
Cohere Picks Enterprise AI Needs Over ‘Abstract Concepts Like AGI’
Data Science & ML

Cohere Picks Enterprise AI Needs Over ‘Abstract Concepts Like AGI’

June 7, 2024
How to Learn Data Analytics – Dataquest
Data Science & ML

How to Learn Data Analytics – Dataquest

June 6, 2024
Adobe Terms Of Service Update Privacy Concerns
Data Science & ML

Adobe Terms Of Service Update Privacy Concerns

June 6, 2024
Build RAG applications using Jina Embeddings v2 on Amazon SageMaker JumpStart
Data Science & ML

Build RAG applications using Jina Embeddings v2 on Amazon SageMaker JumpStart

June 6, 2024
Next Post
IA e analytics no setor público: uma jornada em evolução

IA e analytics no setor público: uma jornada em evolução

Unlocking License Portability for VMware Cloud Foundation

Unlocking License Portability for VMware Cloud Foundation

Keep it Simple, Storage – insideBIGDATA

Keep it Simple, Storage - insideBIGDATA

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • Trending
  • Comments
  • Latest
Is C.AI Down? Here Is What To Do Now

Is C.AI Down? Here Is What To Do Now

January 10, 2024
Porfo: Revolutionizing the Crypto Wallet Landscape

Porfo: Revolutionizing the Crypto Wallet Landscape

October 9, 2023
A Complete Guide to BERT with Code | by Bradney Smith | May, 2024

A Complete Guide to BERT with Code | by Bradney Smith | May, 2024

May 19, 2024
A faster, better way to prevent an AI chatbot from giving toxic responses | MIT News

A faster, better way to prevent an AI chatbot from giving toxic responses | MIT News

April 10, 2024
Part 1: ABAP RESTful Application Programming Model (RAP) – Introduction

Part 1: ABAP RESTful Application Programming Model (RAP) – Introduction

November 20, 2023
Saginaw HMI Enclosures and Suspension Arm Systems from AutomationDirect – Library.Automationdirect.com

Saginaw HMI Enclosures and Suspension Arm Systems from AutomationDirect – Library.Automationdirect.com

December 6, 2023
Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

June 10, 2024
AI Compared: Which Assistant Is the Best?

AI Compared: Which Assistant Is the Best?

June 10, 2024
How insurance companies can use synthetic data to fight bias

How insurance companies can use synthetic data to fight bias

June 10, 2024
5 SLA metrics you should be monitoring

5 SLA metrics you should be monitoring

June 10, 2024
From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

June 10, 2024
UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

June 10, 2024
Facebook Twitter LinkedIn Pinterest RSS
News PouroverAI

The latest news and updates about the AI Technology and Latest Tech Updates around the world... PouroverAI keeps you in the loop.

CATEGORIES

  • AI Technology
  • Automation
  • Blockchain
  • Business
  • Cloud & Programming
  • Data Science & ML
  • Digital Marketing
  • Front-Tech
  • Uncategorized

SITEMAP

  • Disclaimer
  • Privacy Policy
  • DMCA
  • Cookie Privacy Policy
  • Terms and Conditions
  • Contact us

Copyright © 2023 PouroverAI News.
PouroverAI News

No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing

Copyright © 2023 PouroverAI News.
PouroverAI News

Welcome Back!

Login to your account below

Forgotten Password? Sign Up

Create New Account!

Fill the forms bellow to register

All fields are required. Log In

Retrieve your password

Please enter your username or email address to reset your password.

Log In