Skip to content

The Web Scraping Tool is a Python script designed to simplify the process of extracting valuable data from web pages. It empowers users to effortlessly collect various types of information, including links, email addresses, social media links, author names, and phone numbers, from websites of their choice.

Notifications You must be signed in to change notification settings

Togeee12/web-scraper-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

πŸ•ΈοΈ Web Scraper Project

A modern Python tool for extracting links, emails, social media profiles, author names, phone numbers, images, documents, tables, and metadata from any website. Output results directly to your terminal or save them in multiple formats.


πŸš€ Features

  • Extracts:
    • Links
    • Email addresses
    • Social media profiles (Facebook, Twitter, Instagram, etc.)
    • Author names
    • Phone numbers (country-specific)
    • Images (with optional download)
    • Documents (PDF, DOCX, XLSX, etc.)
    • Tables (with optional CSV export)
    • Metadata (title, meta tags)
  • Output to terminal (with colors) or file
  • Supports TXT, JSON, CSV, Markdown, Excel, and SQLite formats
  • Recursive and parallel scraping
  • Live preview mode
  • Scheduled scraping
  • Data filtering and processing (deduplication, sorting)
  • Modular codebase for easy extension

πŸ“¦ Installation

  1. Clone the repository:

    git clone https://github.com/Togeee12/web-scraper-project.git
    cd web-scraper-project
  2. Install dependencies:

    pip install -r requirements.txt

πŸ“ Usage

Run the scraper from the command line:

python main.py --url <website_url> --output <terminal|file> [options]

Key Arguments:

  • --url (required): Website URL to scrape.
  • --output: Output mode (terminal or file).
  • --format: File format (txt, json, csv, md, xlsx, sqlite).
  • --filename: Output filename.
  • --country: Country code for phone numbers (default: US).
  • --depth: Depth for recursive scraping.
  • --recursive: Enable recursive scraping.
  • --parallel: Enable parallel scraping.
  • --urls: List of URLs for parallel scraping.
  • --max-workers: Number of parallel workers.
  • --schedule: Schedule scraping every X hours.
  • --schedule-output: Output file for scheduled scraping.
  • --filter-keyword: Filter results by keyword.
  • --filter-regex: Filter results by regex pattern.
  • --process: Deduplicate and sort data.
  • --download-images: Download images locally.
  • --live-preview: Enable live preview mode.

Examples:

  • Output to terminal:

    python main.py --url https://example.com --output terminal
  • Output to file (JSON):

    python main.py --url https://example.com --output file --format json --filename results.json
  • Recursive scraping:

    python main.py --url https://example.com --recursive --depth 2
  • Parallel scraping:

    python main.py --parallel --urls https://site1.com https://site2.com --output file --format csv
  • Live preview:

    python main.py --url https://example.com --live-preview

πŸ—‚οΈ Project Structure

web-scraper-project/
β”œβ”€β”€ config.json
β”œβ”€β”€ CONTRIBUTING.md
β”œβ”€β”€ main.py
β”œβ”€β”€ README.md
β”œβ”€β”€ requirements.txt
└── scraper/
    β”œβ”€β”€ extractors.py
    β”œβ”€β”€ output.py
    └── scraper.py

πŸ› οΈ Dependencies

Install all dependencies with:

pip install -r requirements.txt

🀝 Contributing

Contributions are welcome!
See CONTRIBUTING.md for guidelines.


πŸ“„ License

MIT License. See LICENSE for details.


πŸ™ Acknowledgments

  • Created by Togeee12
  • Thanks to the developers of the Python libraries

About

The Web Scraping Tool is a Python script designed to simplify the process of extracting valuable data from web pages. It empowers users to effortlessly collect various types of information, including links, email addresses, social media links, author names, and phone numbers, from websites of their choice.

Topics

Resources

Contributing

Stars

Watchers

Forks

Packages

No packages published

Languages