Intrusion-Detection-Systems-using-ML

📜 Project Overview

This project focuses on building an Intrusion Detection System (IDS) using machine learning techniques. The IDS analyzes automotive cybersecurity datasets to identify potential intrusions and anomalies. It emphasizes efficient preprocessing of large datasets, preparing high-quality data for downstream ML modeling.

🚀 Features

Dual Data Processing Engines: Supports both Polars and Pandas for flexible and efficient data handling.
Optimized Dataset Loading: Uses Polars for high-speed data ingestion with large files (~9 million rows).
Data Preprocessing: Clean, transform, and sample data using intuitive Pandas workflows.
Exploratory Analysis: Offers visual and statistical analysis tools via Jupyter notebooks.
Modular Design: Shared utility functions for easier maintenance and reusability.

📂 Repository Structure

notebooks: Used for exploratory data analysis, visualization, and prototyping. You can find scripts here to test and experiment with the data before finalizing the methods.
src: Contains the finalized Python scripts that implement the core functionality of the project. After testing and refining methods in the notebooks, the final code is written into these files for consistent and optimized execution.
notebooks/eda.ipynb: This is the first exploration of the data, where no sampling or preprocessing has been done yet. The goal here is to learn from the raw data, understand its structure, detect any anomalies, and identify potential features for further analysis.
notebooks/solve_dlc_flag_issue.ipynb: Handles the issue of misplaced flag values in datasets with variable DLC (Data Length Code). When dlc < 8, the flag sometimes appears in one of the byte columns instead of the flag column. This notebook detects and corrects such cases.

Intrusion-Detection-Systems-using-ML/
├── input/                                   # Raw dataset files from Car Hacking Dataset
│   ├── attack_free.txt
│   ├── dos_dataset.csv
│   ├── fuzzy_dataset.csv
├── output/                                  # Processed datasets ready for analysis
│   ├── attack_free_df.csv
│   ├── dos_df.csv
│   ├── fuzzy_df.csv
├── notebooks/                              # Jupyter notebooks for analysis
│   ├── eda.ipynb                           # Exploratory data analysis using Pandas
│   ├── preprocess_data_with_pandas.ipynb   # Data cleaning and transformation using Pandas
│   ├── preprocess_data_with_polars.ipynb   # Data cleaning and transformation using Polars
│   ├── solve_dlc_flag_issue.ipynb          # Fixing misplaced DLC/flag column
│   ├── utils.ipynb                         # Helper functions for notebooks
│   ├── visualize_data.ipynb                # Data visualization with charts
├── src/                                    # Python scripts for production-ready data processing
│   ├── load_data_with_polars.py            # 🚀 Actively used: Efficient loading using Polars
│   ├── preprocess_data_with_pandas.py      # ✅ Actively used: Sampling & cleaning using Pandas
│   ├── utils.py                            # ✅ Actively used: Shared helper functions
│   ├── load_data_with_pandas.py            # ⚠️ Not used (slow on large data, kept for reference)
│   ├── preprocess_data_with_polars.py      # ⚠️ Not used (replaced with Pandas version)
│   ├── train_model.py                      # ML model training (coming soon)
├── README.md                               # Project documentation

✅ Currently Used Code Files

File	Purpose
`load_data_with_polars.py`	Loads full datasets efficiently using Polars
`preprocess_data_with_pandas.py`	Preprocesses sampled data using Pandas, suitable for ML workflows
`utils.py`	Stores common helper functions used across both engines

❌ Deprecated / Reference Files

File	Notes
`load_data_with_pandas.py`	Legacy loader; not recommended for large-scale data loading
`preprocess_data_with_polars.py`	Old preprocessing logic; replaced for better maintainability with Pandas

🧠 Why Both Pandas & Polars?

Polars is preferred for initial full data loading due to its speed and memory efficiency.
Pandas is used for preprocessing sampled data—it's more intuitive and integrates well with visualization and ML tools.

📊 Datasets

The raw datasets are taken from the Car Hacking Dataset, which contains records for intrusion detection, such as:

attack_free.txt: Attack-free dataset.
dos_dataset.csv: Denial of Service (DoS) dataset.
fuzzy_dataset.csv: Fuzzy intrusion dataset.

Processed datasets are saved in the output folder as:

attack_free_df.csv
dos_df.csv
fuzzy_df.csv

🛠️ Setup Instructions

Clone the repository:

git clone https://github.com/yourusername/Intrusion-Detection-Systems-using-ML.git
cd Intrusion-Detection-Systems-using-ML

Install dependencies:
Create and activate a virtual environment, then install requirements:

python -m venv venv
source venv/bin/activate   # For Linux/Mac
venv\Scripts\activate      # For Windows
pip install -r requirements.txt

Run preprocessing scripts: Use the src scripts to generate processed datasets:

python src/load_data_with_polars.py
python src/preprocess_data_with_pandas.py

📝 Usage

Load Full Dataset: Use src/load_data_with_polars.py for quick ingestion of large files.
Preprocess Data: Run src/preprocess_data_with_pandas.py after sampling for manageable processing.
Explore Data: Open notebooks/eda.ipynb for insights into distributions, anomalies, and patterns.
Visualize Data: Generate visual summaries using notebooks/visualize_data.ipynb.

📜 License

This project is licensed under the MIT License.

🙌 Acknowledgments

Car Hacking Dataset — source of real CAN bus data.
Polars — for blazing-fast data loading.
Open-source community — for tools and guidance that power this project.

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
notebooks		notebooks
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.template.yaml		config.template.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Intrusion-Detection-Systems-using-ML

📜 Project Overview

🚀 Features

📂 Repository Structure

✅ Currently Used Code Files

❌ Deprecated / Reference Files

🧠 Why Both Pandas & Polars?

📊 Datasets

🛠️ Setup Instructions

📝 Usage

📜 License

🙌 Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Intrusion-Detection-Systems-using-ML

📜 Project Overview

🚀 Features

📂 Repository Structure

✅ Currently Used Code Files

❌ Deprecated / Reference Files

🧠 Why Both Pandas & Polars?

📊 Datasets

🛠️ Setup Instructions

📝 Usage

📜 License

🙌 Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages