WAN-scraper

Repo #634528101 | |
---|---|
Auteur | Lorenzo Rottigni |
Créé à | 2023-04-30 |
Mis à jour à | 2025-05-16 |
Poussé à | 2023-04-30 |
Taille | 9 MB |
Langage principal | Python |
Nombre d'étoiles | 0 |
Branche par défaut | main |
Python
README.md
WAN-scraper
Overview
This is a web scraping project using Python Scrapy to collect data from a list of random websites. The goal is to gather information on website structure, content, and other relevant data for analysis.
Requirements
- Python 3.x
- Scrapy
Installation & Usage
- Clone the repository:
git clone https://github.com/LorenzoRottigni/WAN-scraper.git
- Create venv "scraper":
python3 -m venv scraper
- Activate scraper venv:
source scraper/bin/activate
- Install dependencies:
pip3 install -r requirements.txt
- Run Scrapy spider:
scrapy crawl wan_scraper
- The spider will visit the URLs in the
start_urls
list inwan_scraper/spiders/wan_scraper.py
, collect data, and save it to a CSV file located in the project directory.
Additional Information
- The spider is set to obey the
robots.txt
file on each website visited, but please use caution and follow ethical scraping practices. - Feel free to modify the spider's behavior to suit your needs by editing the code in
wan_scraper/spiders/wan_scraper.py
. - For more information on using Scrapy, please refer to the official documentation.