🕸️ Web Scraper Project

A modern Python tool for extracting links, emails, social media profiles, author names, phone numbers, images, documents, tables, and metadata from any website. Output results directly to your terminal or save them in multiple formats.

🚀 Features

Extracts:
- Links
- Email addresses
- Social media profiles (Facebook, Twitter, Instagram, etc.)
- Author names
- Phone numbers (country-specific)
- Images (with optional download)
- Documents (PDF, DOCX, XLSX, etc.)
- Tables (with optional CSV export)
- Metadata (title, meta tags)
Output to terminal (with colors) or file
Supports TXT, JSON, CSV, Markdown, Excel, and SQLite formats
Recursive and parallel scraping
Live preview mode
Scheduled scraping
Data filtering and processing (deduplication, sorting)
Modular codebase for easy extension

📦 Installation

Clone the repository:

git clone https://github.com/Togeee12/web-scraper-project.git
cd web-scraper-project

Install dependencies:
```
pip install -r requirements.txt
```

📝 Usage

Run the scraper from the command line:

python main.py --url <website_url> --output <terminal|file> [options]

Key Arguments:

--url (required): Website URL to scrape.
--output: Output mode (terminal or file).
--format: File format (txt, json, csv, md, xlsx, sqlite).
--filename: Output filename.
--country: Country code for phone numbers (default: US).
--depth: Depth for recursive scraping.
--recursive: Enable recursive scraping.
--parallel: Enable parallel scraping.
--urls: List of URLs for parallel scraping.
--max-workers: Number of parallel workers.
--schedule: Schedule scraping every X hours.
--schedule-output: Output file for scheduled scraping.
--filter-keyword: Filter results by keyword.
--filter-regex: Filter results by regex pattern.
--process: Deduplicate and sort data.
--download-images: Download images locally.
--live-preview: Enable live preview mode.

Examples:

Output to terminal:

python main.py --url https://example.com --output terminal

Output to file (JSON):

python main.py --url https://example.com --output file --format json --filename results.json

Recursive scraping:

python main.py --url https://example.com --recursive --depth 2

Parallel scraping:

python main.py --parallel --urls https://site1.com https://site2.com --output file --format csv

Live preview:

python main.py --url https://example.com --live-preview

🗂️ Project Structure

web-scraper-project/
├── config.json
├── CONTRIBUTING.md
├── main.py
├── README.md
├── requirements.txt
└── scraper/
    ├── extractors.py
    ├── output.py
    └── scraper.py

🛠️ Dependencies

beautifulsoup4
requests
colorama
phonenumbers
tqdm
pandas (for Excel/CSV export)
openpyxl (for Excel export)
schedule (for scheduled scraping)

Install all dependencies with:

pip install -r requirements.txt

🤝 Contributing

Contributions are welcome!
See CONTRIBUTING.md for guidelines.

📄 License

MIT License. See LICENSE for details.

🙏 Acknowledgments

Created by Togeee12
Thanks to the developers of the Python libraries

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🕸️ Web Scraper Project

🚀 Features

📦 Installation

📝 Usage

🗂️ Project Structure

🛠️ Dependencies

🤝 Contributing

📄 License

🙏 Acknowledgments

About

Uh oh!

Releases 1

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
scraper		scraper
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
config.json		config.json
main.py		main.py
requirements.txt		requirements.txt

Togeee12/web-scraper-project

Folders and files

Latest commit

History

Repository files navigation

🕸️ Web Scraper Project

🚀 Features

📦 Installation

📝 Usage

🗂️ Project Structure

🛠️ Dependencies

🤝 Contributing

📄 License

🙏 Acknowledgments

About

Topics

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Languages

Packages