WAN-scraper

回購 #634528101 | |
---|---|
作者 | Lorenzo Rottigni |
創建於 | 2023-04-30 |
更新於 | 2025-05-16 |
推到 | 2023-04-30 |
尺寸 | 9 MB |
主要語言 | Python |
星數 | 0 |
默認分支 | main |
Python
自述文件.md
WAN-scraper
Overview
This is a web scraping project using Python Scrapy to collect data from a list of random websites. The goal is to gather information on website structure, content, and other relevant data for analysis.
Requirements
- Python 3.x
- Scrapy
Installation & Usage
- Clone the repository:
git clone https://github.com/LorenzoRottigni/WAN-scraper.git
- Create venv "scraper":
python3 -m venv scraper
- Activate scraper venv:
source scraper/bin/activate
- Install dependencies:
pip3 install -r requirements.txt
- Run Scrapy spider:
scrapy crawl wan_scraper
- The spider will visit the URLs in the
start_urls
list inwan_scraper/spiders/wan_scraper.py
, collect data, and save it to a CSV file located in the project directory.
Additional Information
- The spider is set to obey the
robots.txt
file on each website visited, but please use caution and follow ethical scraping practices. - Feel free to modify the spider's behavior to suit your needs by editing the code in
wan_scraper/spiders/wan_scraper.py
. - For more information on using Scrapy, please refer to the official documentation.