AI Voice Clone is an open-source project aimed at developing advanced voice synthesis and cloning technologies using artificial intelligence. The project focuses on creating realistic voice replicas from audio samples, enabling applications in entertainment, accessibility, education, and more.
- Programming Language: Python 3.8+
- Deep Learning Framework: PyTorch
- Audio Processing: Torchaudio, Librosa
- Voice Synthesis: Tacotron2, WaveGlow, or similar TTS models
- Machine Learning: Scikit-learn for preprocessing
- Web Framework (future): FastAPI or Flask for API deployment
- Basic voice recording and preprocessing
- Audio feature extraction (MFCC, spectrograms)
- Model training pipeline setup
- Phase 1: Implement basic voice cloning with pre-trained models
- Phase 2: Custom model training from user audio samples
- Phase 3: Real-time voice conversion
- Phase 4: Multi-speaker voice cloning
- Phase 5: Web interface and API deployment
The model architecture is currently a compact encoder/decoder skeleton. Studying Tacotron2-style architectures, attention mechanisms, and multi-speaker conditioning will align with upcoming roadmap goals.
- Data collection and preprocessing pipeline
- Neural network architectures for voice synthesis
- Training scripts and utilities
- Evaluation metrics and testing framework
- Deployment and inference scripts
git clone https://github.com/PtiCalin/ai_voice-clone.git
cd ai_voice-clone
pip install -r requirements.txt# Record a voice sample
python ai_voice-clone/main.py --mode record --duration 5 --output my_voice.wav
# Train a model with your voice
python ai_voice-clone/main.py --mode train --input my_voice.wav
# Generate cloned voice
python ai_voice-clone/main.py --mode clone --input my_voice.wav --text "Hello, this is my cloned voice!" --output cloned_voice.wavfrom ai_voice_clone import VoiceCloner, AudioInput, FeatureExtractor, Trainer, InferenceEngine, Config
# Initialize components
config = Config()
config.load()
audio_input = AudioInput()
feature_extractor = FeatureExtractor(config)
model = VoiceCloner(config)
trainer = Trainer(model, feature_extractor, config)
inference_engine = InferenceEngine(model, feature_extractor, config)
# Record or load audio
audio_data = audio_input.record_audio(duration=5)
# or
audio_data, sr = audio_input.load_audio("path/to/audio.wav")
# Train model (if needed)
trainer.train("path/to/training/audio.wav")
# Generate cloned voice
cloned_audio = inference_engine.generate_voice("Hello, world!", "path/to/reference/audio.wav")
# Save result
audio_input.save_audio(cloned_audio, "output.wav")ai_voice-clone/
├── main.py # CLI entry point
├── config.py # Configuration management
├── audio_input.py # Audio recording and loading
├── feature_extraction.py # Audio feature extraction
├── model.py # Neural network models
├── training.py # Model training logic
├── inference.py # Voice generation
├── __init__.py # Package initialization
├── requirements.txt # Dependencies
└── ...
The system uses a YAML configuration file (config.yaml) with the following main sections:
audio: Audio processing parametersfeatures: Feature extraction settingsmodel: Neural network architecturetraining: Training hyperparametersinference: Generation parametersvocoder: Mel-to-audio backend selection (Griffin-Lim, HiFi-GAN, or WaveGlow)
Example vocoder configuration:
vocoder:
backend: hifigan
hifigan:
model_path: /path/to/hifigan-torchscript.ptSee CONTRIBUTION.md for guidelines.
This project is licensed under the MIT License - see LICENSE.md for details.
See TESTING.md for testing procedures, automated test expansion guidance, and evaluation metric reporting.
See UPDATE-LOG.md for version history.