-
Notifications
You must be signed in to change notification settings - Fork 410
Support Multi-Source Blending in Grain Data Pipeline #1801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking βSign up for GitHubβ, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Multi-Source Blending in Grain Data Pipeline #1801
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this support! Could you also add a comment of the supported data_file_patterns in base.yml and in the doc https://github.com/AI-Hypercomputer/maxtext/blob/main/getting_started/Data_Input_Pipeline.md#using-grain ?
Thanks for checking and comments! I added documentation and testing code as well. |
LGTM in general, just a minor suggestion. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bzantium - Few things that we would require before merging:
Thanks for the checking! I run
Could you help me to check [2]? thanks a lot! |
The tests are passing now. Thanks you! |
f827a03
into
AI-Hypercomputer:main
Description
The current data loading pipeline in MaxText utilizing Grain (as configured via dataset_type: grain and related flags/config) primarily supports loading data from a single source path or dataset specification (e.g., grain_train_files). This feature request proposes enhancing this existing pipeline to natively support loading, blending, and processing data records from multiple distinct sources simultaneously. This would enable the definition and efficient handling of complex data mixtures directly within the MaxText training framework, leveraging Grain's underlying capabilities like MapDataset.mix.
Motivation / Use Case:
Creating optimal data mixtures by blending multiple corpora (e.g., web text, books, code) with specific sampling ratios is fundamental for LLM pre-training. Fine-tuning also often requires mixing pre-training data with smaller, specialized datasets.
Currently, users of the MaxText Grain pipeline need to perform potentially complex and storage-intensive offline pre-processing to merge diverse datasets into a single input source consumable by MaxText. This lacks flexibility and makes experimenting with different data mixture configurations cumbersome.
Directly supporting multi-source blending within the MaxText pipeline would significantly streamline these common LLM workflows by allowing users to:
Define complex pre-training data mixtures (e.g., {C4: 60%, GitHub: 20%, Books: 20%}) via MaxText configuration.
Easily blend pre-training and fine-tuning datasets with controlled ratios during fine-tuning runs.
Manage multilingual datasets more effectively.
FIXES: #1610
Tests
Please describe how you tested this change, and include any instructions and/or
commands to reproduce.
Checklist
Before submitting this PR, please make sure (put X in square brackets):