Skip to content

[C++][Python] Expose sorting_columns in RowGroupMetaData for Parquet files #35331

@ei-grad

Description

@ei-grad

Summary

Currently, the pyarrow.parquet.RowGroupMetaData class does not expose the sorting_columns information available in the Parquet format's RowGroup struct. This information is useful for users who need to understand the local sorting order of columns within each RowGroup. It would be beneficial to expose this information in the RowGroupMetaData class.

Details

The Parquet format includes an optional sorting_columns field in the RowGroup struct, which stores information about the sorting order of columns within the RowGroup. This information is defined in the SortingColumn struct in the parquet.thrift file:

struct SortingColumn {
  1: required int32 column_idx;
  2: required bool descending;
  3: optional bool nulls_first;
}

In the RowGroup struct, the sorting_columns field is defined as follows:

struct RowGroup {
  1: required list<ColumnChunk> columns;
  2: required i64 total_byte_size;
  3: required i64 num_rows;
  4: optional list<SortingColumn> sorting_columns;
}

However, the pyarrow.parquet.RowGroupMetaData class does not expose this information. As a result, users cannot access the local sorting information of columns within RowGroups.

Proposal

I propose adding a new method or property in the RowGroupMetaData class to expose the sorting_columns information. This could be implemented as a new method, such as get_sorting_columns(), or as a property, such as sorting_columns. The output should include the column index, sorting order (ascending or descending), and whether null values appear first or last in the sorted order.

Use Case

Users working with sorted Parquet files can benefit from understanding the local sorting order of columns within RowGroups. This information is particularly useful when analyzing large datasets or performing operations that require knowledge of the sort order, such as range queries, filtering, or merging.

By exposing the sorting_columns information in the RowGroupMetaData class, users can more easily work with sorted Parquet files and perform advanced data processing operations.

Component(s)

Python

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions