A machine learning pipeline for detecting exoplanets from Kepler mission light curve data, combining a hierarchical KNN+DTW classifier with a 2D Convolutional Neural Network. Built as part of MATH336 Mathematical Modelling at FLAME University.
Authors: Piya Shah, Saharsh Kanchan, Prithvi Dhyani
This project uses the transit method — detecting periodic dips in stellar brightness caused by an orbiting planet crossing the line of sight — to classify stars as exoplanet hosts or non-hosts from Kepler light curve data.
Two independent detection approaches were implemented and evaluated:
| Approach | Accuracy | Exoplanet F1 | Notes |
|---|---|---|---|
| Hierarchical KNN + DTW | 95% | 0.78 | +10pp recall over baseline |
| 2D CNN (phase-folded images) | 99.2% | — | AUC: 0.97 |
When a planet of radius
The duration of the transit
Kepler measures stellar brightness continuously. A transit event appears as a periodic, symmetric dip in the light curve
Raw Kepler data arrives as Target Pixel Files (TPFs) — 2D arrays of pixel intensities over time. Let
Background signal
Pixel-level sensitivity variations are corrected using a flat-field factor
Flux is extracted by summing calibrated intensities over an optimal aperture
The optimal aperture maximises the signal-to-noise ratio:
where
The observed flux contains both systematic and random noise:
Systematic noise is modelled as a linear combination of regressors via a design matrix
PCA is applied to reduce the dimensionality of
The corrected flux is then:
Uncertainty is propagated as:
The observed flux is shaped not just by astrophysical signals but by the instrument's Pixel Response Function (PRF):
The PRF encodes the optical and electronic response of the detector, including the Point Spread Function (PSF). Residual PRF systematics are implicitly absorbed by the classifiers during training.
Comparing two light curves
DTW finds the optimal alignment between two time series by solving a dynamic programming problem.
Cost matrix — pairwise squared Euclidean distances:
Cumulative cost matrix — computed via the recurrence:
with boundary conditions
DTW distance:
This allows flexible many-to-one alignments, making it robust to timing distortions that would mislead Euclidean distance. DTW can be viewed as a discrete subordinated process: the warping path
Naively, DTW between two series of length
This is computationally infeasible for large datasets and long light curves.
A two-stage filter reduces the search space:
Stage 1 — Feature-space KNN filter:
Extract a compact feature vector from each light curve:
where
The DC component is zeroed out (
Nearest neighbours are found in feature space using Manhattan (L1) distance:
Stage 2 — DTW re-ranking:
Apply DTW only within the
| Model | Class | Precision | Recall | F1 |
|---|---|---|---|---|
| Baseline KNN (k=20, Manhattan) | Non-host | 0.96 | 0.97 | 0.97 |
| Exoplanet | 0.75 | 0.65 | 0.70 | |
| Overall accuracy | 0.94 | |||
| Hierarchical KNN+DTW (k=10) | Non-host | 0.97 | 0.98 | 0.98 |
| Exoplanet | 0.82 | 0.75 | 0.78 | |
| Overall accuracy | 0.95 |
DTW post-filtering improved exoplanet precision by +7pp and recall by +10pp without sacrificing non-host classification.
A light curve
This stacks all transit events on top of each other, producing a folded curve with a clearly visible dip. Each folded light curve is rendered as a
A 2D convolutional layer applies learned filters
Each filter learns to detect a local spatial pattern — e.g. the characteristic symmetric dip shape of a planetary transit. Successive layers with increasing filter counts (32 → 64 → 128) learn increasingly abstract representations.
ReLU activation introduces non-linearity after each convolution:
Max pooling reduces spatial dimensionality and provides local translation invariance:
Input: 128×128×1 grayscale image
→ Conv2D(32, 3×3) + ReLU + MaxPool(2×2)
→ Conv2D(64, 3×3) + ReLU + MaxPool(2×2)
→ Conv2D(128, 3×3) + ReLU + MaxPool(2×2)
→ Flatten
→ Dense(64, ReLU)
→ Dropout(0.4)
→ Dense(1, Sigmoid)
Loss function — binary crossentropy:
Optimised with Adam. Early stopping monitors validation loss with patience=3.
Confirmed exoplanet hosts are a small minority of all Kepler targets. Three measures were applied:
- Positive class (exoplanet hosts) oversampled to balance training distribution
- Random temporal shifts applied as data augmentation to improve generalisation
- Classification threshold tuned to 0.67 (rather than default 0.50) based on the precision-recall tradeoff — maximising the F1-equivalent operating point on the ROC curve (AUC: 0.97)
Test accuracy: 99.20%
AUC score: 0.97
Optimal threshold: 0.67
Training images: 5,657 (3 corrupted files skipped)
Remaining misclassifications arise from:
- Eclipsing binaries: produce periodic transit-like dips photometrically indistinguishable from planetary transits
- Residual PRF systematics: instrumental effects not fully removed by detrending, implicitly learned but occasionally misleading
├── DTW_Kepler.ipynb # Hierarchical KNN+DTW classifier
├── cnn_2d_model.ipynb # 2D CNN training and evaluation
├── cnn_2d_image.ipynb # Phase-folding and image generation
├── END_TERM_REPORT.pdf # Full mathematical report with derivations
├── requirements.txt
└── README.md
git clone https://github.com/piyarshah/kepler-exoplanet-detection
cd kepler-exoplanet-detection
pip install -r requirements.txtLight curve data sourced from the Kepler mission via the MAST archive. Additional labels from the NASA Exoplanet Archive.
The CNN training set consists of 5,660 phase-folded light curve images labelled by KOI disposition (0 = non-host, 1 = confirmed exoplanet host).
- Shallue & Vanderburg (2018) — AstroNet: dual-branch CNN for Kepler exoplanet detection
- Tiensuu et al. (2019) — Image encoding of light curves for CNN classification
- Malik et al. (2020) — Cross-mission pipeline; 98% accuracy on TESS, AUC 0.948 on Kepler
- Priyadarshini & Puri (2021) — Stacked CNN ensemble, 99.62% accuracy on Kepler
- Feinstein et al. (2019) — Lightkurve library
- Bishop (2006) — Pattern Recognition and Machine Learning
Full bibliography in END_TERM_REPORT.pdf.