This project focuses on building an Intrusion Detection System (IDS) using machine learning techniques. The IDS analyzes automotive cybersecurity datasets to identify potential intrusions and anomalies. It emphasizes efficient preprocessing of large datasets, preparing high-quality data for downstream ML modeling.
- Dual Data Processing Engines: Supports both Polars and Pandas for flexible and efficient data handling.
- Optimized Dataset Loading: Uses Polars for high-speed data ingestion with large files (~9 million rows).
- Data Preprocessing: Clean, transform, and sample data using intuitive Pandas workflows.
- Exploratory Analysis: Offers visual and statistical analysis tools via Jupyter notebooks.
- Modular Design: Shared utility functions for easier maintenance and reusability.
- notebooks: Used for exploratory data analysis, visualization, and prototyping. You can find scripts here to test and experiment with the data before finalizing the methods.
- src: Contains the finalized Python scripts that implement the core functionality of the project. After testing and refining methods in the notebooks, the final code is written into these files for consistent and optimized execution.
- notebooks/eda.ipynb: This is the first exploration of the data, where no sampling or preprocessing has been done yet. The goal here is to learn from the raw data, understand its structure, detect any anomalies, and identify potential features for further analysis.
- notebooks/solve_dlc_flag_issue.ipynb: Handles the issue of misplaced flag values in datasets with variable DLC (Data Length Code). When
dlc< 8, the flag sometimes appears in one of the byte columns instead of theflagcolumn. This notebook detects and corrects such cases.
Intrusion-Detection-Systems-using-ML/
├── input/ # Raw dataset files from Car Hacking Dataset
│ ├── attack_free.txt
│ ├── dos_dataset.csv
│ ├── fuzzy_dataset.csv
├── output/ # Processed datasets ready for analysis
│ ├── attack_free_df.csv
│ ├── dos_df.csv
│ ├── fuzzy_df.csv
├── notebooks/ # Jupyter notebooks for analysis
│ ├── eda.ipynb # Exploratory data analysis using Pandas
│ ├── preprocess_data_with_pandas.ipynb # Data cleaning and transformation using Pandas
│ ├── preprocess_data_with_polars.ipynb # Data cleaning and transformation using Polars
│ ├── solve_dlc_flag_issue.ipynb # Fixing misplaced DLC/flag column
│ ├── utils.ipynb # Helper functions for notebooks
│ ├── visualize_data.ipynb # Data visualization with charts
├── src/ # Python scripts for production-ready data processing
│ ├── load_data_with_polars.py # 🚀 Actively used: Efficient loading using Polars
│ ├── preprocess_data_with_pandas.py # ✅ Actively used: Sampling & cleaning using Pandas
│ ├── utils.py # ✅ Actively used: Shared helper functions
│ ├── load_data_with_pandas.py # ⚠️ Not used (slow on large data, kept for reference)
│ ├── preprocess_data_with_polars.py # ⚠️ Not used (replaced with Pandas version)
│ ├── train_model.py # ML model training (coming soon)
├── README.md # Project documentation
| File | Purpose |
|---|---|
load_data_with_polars.py |
Loads full datasets efficiently using Polars |
preprocess_data_with_pandas.py |
Preprocesses sampled data using Pandas, suitable for ML workflows |
utils.py |
Stores common helper functions used across both engines |
| File | Notes |
|---|---|
load_data_with_pandas.py |
Legacy loader; not recommended for large-scale data loading |
preprocess_data_with_polars.py |
Old preprocessing logic; replaced for better maintainability with Pandas |
- Polars is preferred for initial full data loading due to its speed and memory efficiency.
- Pandas is used for preprocessing sampled data—it's more intuitive and integrates well with visualization and ML tools.
The raw datasets are taken from the Car Hacking Dataset, which contains records for intrusion detection, such as:
attack_free.txt: Attack-free dataset.dos_dataset.csv: Denial of Service (DoS) dataset.fuzzy_dataset.csv: Fuzzy intrusion dataset.
Processed datasets are saved in the output folder as:
attack_free_df.csvdos_df.csvfuzzy_df.csv
- Clone the repository:
git clone https://github.com/yourusername/Intrusion-Detection-Systems-using-ML.git cd Intrusion-Detection-Systems-using-ML - Install dependencies:
Create and activate a virtual environment, then install requirements:python -m venv venv source venv/bin/activate # For Linux/Mac venv\Scripts\activate # For Windows pip install -r requirements.txt
- Run preprocessing scripts:
Use the src scripts to generate processed datasets:
python src/load_data_with_polars.py python src/preprocess_data_with_pandas.py
- Load Full Dataset: Use
src/load_data_with_polars.pyfor quick ingestion of large files. - Preprocess Data: Run
src/preprocess_data_with_pandas.pyafter sampling for manageable processing. - Explore Data: Open
notebooks/eda.ipynbfor insights into distributions, anomalies, and patterns. - Visualize Data: Generate visual summaries using
notebooks/visualize_data.ipynb.
This project is licensed under the MIT License.
- Car Hacking Dataset — source of real CAN bus data.
- Polars — for blazing-fast data loading.
- Open-source community — for tools and guidance that power this project.