Skip to content

Conversation

bzantium
Copy link
Contributor

@bzantium bzantium commented Jun 3, 2025 β€’

Description

The current data loading pipeline in MaxText utilizing Grain (as configured via dataset_type: grain and related flags/config) primarily supports loading data from a single source path or dataset specification (e.g., grain_train_files). This feature request proposes enhancing this existing pipeline to natively support loading, blending, and processing data records from multiple distinct sources simultaneously. This would enable the definition and efficient handling of complex data mixtures directly within the MaxText training framework, leveraging Grain's underlying capabilities like MapDataset.mix.

Motivation / Use Case:
Creating optimal data mixtures by blending multiple corpora (e.g., web text, books, code) with specific sampling ratios is fundamental for LLM pre-training. Fine-tuning also often requires mixing pre-training data with smaller, specialized datasets.

Currently, users of the MaxText Grain pipeline need to perform potentially complex and storage-intensive offline pre-processing to merge diverse datasets into a single input source consumable by MaxText. This lacks flexibility and makes experimenting with different data mixture configurations cumbersome.

Directly supporting multi-source blending within the MaxText pipeline would significantly streamline these common LLM workflows by allowing users to:

Define complex pre-training data mixtures (e.g., {C4: 60%, GitHub: 20%, Books: 20%}) via MaxText configuration.
Easily blend pre-training and fine-tuning datasets with controlled ratios during fine-tuning runs.
Manage multilingual datasets more effectively.

FIXES: #1610

Tests

Please describe how you tested this change, and include any instructions and/or
commands to reproduce.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed.

Copy link
Collaborator

@aireenmei aireenmei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this support! Could you also add a comment of the supported data_file_patterns in base.yml and in the doc https://github.com/AI-Hypercomputer/maxtext/blob/main/getting_started/Data_Input_Pipeline.md#using-grain ?

@bzantium
Copy link
Contributor Author

Thanks for checking and comments! I added documentation and testing code as well.
to: @aireenmei @SurbhiJainUSC

@aireenmei
Copy link
Collaborator

Thanks for checking and comments! I added documentation and testing code as well. to: @aireenmei @SurbhiJainUSC

LGTM in general, just a minor suggestion. Thanks!

Copy link
Collaborator

@SurbhiJainUSC SurbhiJainUSC left a comment β€’

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bzantium - Few things that we would require before merging:

  1. Run bash code_style.sh.
  2. Squash the commits into one.
  3. Check unit tests:
    image

@bzantium
Copy link
Contributor Author

bzantium commented Jun 14, 2025 β€’

Thanks for the checking! I run bash code_style.sh and fixed some bugs.

  1. I think we can squash commits when merging on github
  2. It's hard to unit test for the case because it requires datasets on gcs to be ready...

Could you help me to check [2]? thanks a lot!

@SurbhiJainUSC
Copy link
Collaborator

Thanks for the checking! I run bash code_style.sh and fixed some bugs.

  1. I think we can squash commits when merging on github
  2. It's hard to unit test for the case because it requires datasets on gcs to be ready...

Could you help me to check [2]? thanks a lot!

The tests are passing now. Thanks you!

@copybara-service copybara-service bot merged commit f827a03 into AI-Hypercomputer:main Jun 17, 2025
32 of 37 checks passed
@bzantium bzantium deleted the feature/#1610 branch June 18, 2025 01:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Multi-Source Blending in Grain Data Pipeline
3 participants