Support Multi-Source Blending in Grain Data Pipeline #1801

bzantium · 2025-06-03T23:34:51Z

Description

The current data loading pipeline in MaxText utilizing Grain (as configured via dataset_type: grain and related flags/config) primarily supports loading data from a single source path or dataset specification (e.g., grain_train_files). This feature request proposes enhancing this existing pipeline to natively support loading, blending, and processing data records from multiple distinct sources simultaneously. This would enable the definition and efficient handling of complex data mixtures directly within the MaxText training framework, leveraging Grain's underlying capabilities like MapDataset.mix.

Motivation / Use Case:
Creating optimal data mixtures by blending multiple corpora (e.g., web text, books, code) with specific sampling ratios is fundamental for LLM pre-training. Fine-tuning also often requires mixing pre-training data with smaller, specialized datasets.

Currently, users of the MaxText Grain pipeline need to perform potentially complex and storage-intensive offline pre-processing to merge diverse datasets into a single input source consumable by MaxText. This lacks flexibility and makes experimenting with different data mixture configurations cumbersome.

Directly supporting multi-source blending within the MaxText pipeline would significantly streamline these common LLM workflows by allowing users to:

Define complex pre-training data mixtures (e.g., {C4: 60%, GitHub: 20%, Books: 20%}) via MaxText configuration.
Easily blend pre-training and fine-tuning datasets with controlled ratios during fine-tuning runs.
Manage multilingual datasets more effectively.

FIXES: #1610

Tests

Please describe how you tested this change, and include any instructions and/or
commands to reproduce.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed.

aireenmei

Thanks for adding this support! Could you also add a comment of the supported data_file_patterns in base.yml and in the doc https://github.com/AI-Hypercomputer/maxtext/blob/main/getting_started/Data_Input_Pipeline.md#using-grain ?

MaxText/input_pipeline/_grain_data_processing.py

bzantium · 2025-06-11T23:48:10Z

Thanks for checking and comments! I added documentation and testing code as well.
to: @aireenmei @SurbhiJainUSC

MaxText/tests/grain_data_processing_test.py

aireenmei · 2025-06-12T18:04:13Z

Thanks for checking and comments! I added documentation and testing code as well. to: @aireenmei @SurbhiJainUSC

LGTM in general, just a minor suggestion. Thanks!

SurbhiJainUSC

@bzantium - Few things that we would require before merging:

Run bash code_style.sh.
Squash the commits into one.
Check unit tests:

…antium/maxtext into feature/AI-Hypercomputer#1610

bzantium · 2025-06-14T03:29:23Z

Thanks for the checking! I run bash code_style.sh and fixed some bugs.

I think we can squash commits when merging on github
It's hard to unit test for the case because it requires datasets on gcs to be ready...

Could you help me to check [2]? thanks a lot!

MaxText/input_pipeline/_grain_data_processing.py

SurbhiJainUSC · 2025-06-16T17:11:21Z

Thanks for the checking! I run bash code_style.sh and fixed some bugs.

I think we can squash commits when merging on github

It's hard to unit test for the case because it requires datasets on gcs to be ready...

Could you help me to check [2]? thanks a lot!

The tests are passing now. Thanks you!

bzantium added 2 commits June 4, 2025 08:30

implement multi-source blending for arrayrecord

01d4110

fix typo

88ae1e5

bzantium requested review from aireenmei, SurbhiJainUSC and richjames0 as code owners June 3, 2025 23:34

aireenmei approved these changes Jun 11, 2025

View reviewed changes

SurbhiJainUSC reviewed Jun 11, 2025

View reviewed changes

MaxText/input_pipeline/_grain_data_processing.py Show resolved Hide resolved

add testing and documentation

2685238

bzantium requested review from gobbleturk, khatwanimohit, bvandermoon, vipannalla, RissyRan, gagika, shralex, yangyuwei, hengtaoguo and A9isha as code owners June 11, 2025 23:46

aireenmei requested changes Jun 12, 2025

View reviewed changes

MaxText/tests/grain_data_processing_test.py Outdated Show resolved Hide resolved

rename test name

2b694b9

aireenmei approved these changes Jun 13, 2025

View reviewed changes

SurbhiJainUSC approved these changes Jun 13, 2025

View reviewed changes

SurbhiJainUSC reviewed Jun 13, 2025

View reviewed changes

bzantium added 5 commits June 14, 2025 10:25

make weights float

4354f38

apply style

2011c8f

Merge branch 'main' into feature/AI-Hypercomputer#1610

028cf2f

pull main

8d5dd0a

Merge branch 'feature/AI-Hypercomputer#1610' of https://github.com/bz…

62f949e

…antium/maxtext into feature/AI-Hypercomputer#1610

SurbhiJainUSC reviewed Jun 16, 2025

View reviewed changes

MaxText/input_pipeline/_grain_data_processing.py Show resolved Hide resolved

add logging for files

2fa06eb

SurbhiJainUSC added the pull ready label Jun 17, 2025

copybara-service bot merged commit f827a03 into AI-Hypercomputer:main Jun 17, 2025
32 of 37 checks passed

bzantium deleted the feature/#1610 branch June 18, 2025 01:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support Multi-Source Blending in Grain Data Pipeline #1801

Support Multi-Source Blending in Grain Data Pipeline #1801

Uh oh!

bzantium commented Jun 3, 2025 •

edited by SurbhiJainUSC

Loading

Uh oh!

aireenmei left a comment

Uh oh!

Uh oh!

bzantium commented Jun 11, 2025

Uh oh!

Uh oh!

aireenmei commented Jun 12, 2025

Uh oh!

SurbhiJainUSC left a comment •

edited

Loading

Uh oh!

bzantium commented Jun 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

SurbhiJainUSC commented Jun 16, 2025

Uh oh!

Uh oh!

Uh oh!

Support Multi-Source Blending in Grain Data Pipeline #1801

Support Multi-Source Blending in Grain Data Pipeline #1801

Uh oh!

Conversation

bzantium commented Jun 3, 2025 • edited by SurbhiJainUSC Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

aireenmei left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bzantium commented Jun 11, 2025

Uh oh!

Uh oh!

aireenmei commented Jun 12, 2025

Uh oh!

SurbhiJainUSC left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bzantium commented Jun 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

SurbhiJainUSC commented Jun 16, 2025

Uh oh!

Uh oh!

Uh oh!

bzantium commented Jun 3, 2025 •

edited by SurbhiJainUSC

Loading

SurbhiJainUSC left a comment •

edited

Loading

bzantium commented Jun 14, 2025 •

edited

Loading