A modern Python tool for extracting links, emails, social media profiles, author names, phone numbers, images, documents, tables, and metadata from any website. Output results directly to your terminal or save them in multiple formats.
- Extracts:
- Links
- Email addresses
- Social media profiles (Facebook, Twitter, Instagram, etc.)
- Author names
- Phone numbers (country-specific)
- Images (with optional download)
- Documents (PDF, DOCX, XLSX, etc.)
- Tables (with optional CSV export)
- Metadata (title, meta tags)
- Output to terminal (with colors) or file
- Supports TXT, JSON, CSV, Markdown, Excel, and SQLite formats
- Recursive and parallel scraping
- Live preview mode
- Scheduled scraping
- Data filtering and processing (deduplication, sorting)
- Modular codebase for easy extension
-
Clone the repository:
git clone https://github.com/Togeee12/web-scraper-project.git cd web-scraper-project
-
Install dependencies:
pip install -r requirements.txt
Run the scraper from the command line:
python main.py --url <website_url> --output <terminal|file> [options]
Key Arguments:
--url
(required): Website URL to scrape.--output
: Output mode (terminal
orfile
).--format
: File format (txt
,json
,csv
,md
,xlsx
,sqlite
).--filename
: Output filename.--country
: Country code for phone numbers (default:US
).--depth
: Depth for recursive scraping.--recursive
: Enable recursive scraping.--parallel
: Enable parallel scraping.--urls
: List of URLs for parallel scraping.--max-workers
: Number of parallel workers.--schedule
: Schedule scraping every X hours.--schedule-output
: Output file for scheduled scraping.--filter-keyword
: Filter results by keyword.--filter-regex
: Filter results by regex pattern.--process
: Deduplicate and sort data.--download-images
: Download images locally.--live-preview
: Enable live preview mode.
Examples:
-
Output to terminal:
python main.py --url https://example.com --output terminal
-
Output to file (JSON):
python main.py --url https://example.com --output file --format json --filename results.json
-
Recursive scraping:
python main.py --url https://example.com --recursive --depth 2
-
Parallel scraping:
python main.py --parallel --urls https://site1.com https://site2.com --output file --format csv
-
Live preview:
python main.py --url https://example.com --live-preview
web-scraper-project/
βββ config.json
βββ CONTRIBUTING.md
βββ main.py
βββ README.md
βββ requirements.txt
βββ scraper/
βββ extractors.py
βββ output.py
βββ scraper.py
- beautifulsoup4
- requests
- colorama
- phonenumbers
- tqdm
- pandas (for Excel/CSV export)
- openpyxl (for Excel export)
- schedule (for scheduled scraping)
Install all dependencies with:
pip install -r requirements.txt
Contributions are welcome!
See CONTRIBUTING.md for guidelines.
MIT License. See LICENSE for details.
- Created by Togeee12
- Thanks to the developers of the Python libraries