Skip to content

feat: V3 DeleteFile fields end-to-end through Java + protocol + bridge (#27789)#27789

Open
apurva-meta wants to merge 1 commit into
prestodb:masterfrom
apurva-meta:export-D104959645
Open

feat: V3 DeleteFile fields end-to-end through Java + protocol + bridge (#27789)#27789
apurva-meta wants to merge 1 commit into
prestodb:masterfrom
apurva-meta:export-D104959645

Conversation

@apurva-meta
Copy link
Copy Markdown
Contributor

@apurva-meta apurva-meta commented May 13, 2026

Summary:

Stacked on D104959593 (which added V3 typed fields to
velox::IcebergDeleteFile and the dataSequenceNumber parameter on
the production HiveIcebergSplit constructor). This diff plumbs all
of those fields through the Java side and the Prestissimo bridge so
the typed Velox structs are populated end-to-end.

Adds:

  • dataSequenceNumber, referencedDataFile, contentOffset,
    contentSize on presto-iceberg/.../delete/DeleteFile.java
  • The same four fields on the C++ protocol struct
    presto::protocol::iceberg::DeleteFile
  • All bridge wiring in
    presto_cpp/main/connectors/IcebergPrestoToVeloxConnector.cpp:
    • Forwards the four DeleteFile fields into the Velox
      IcebergDeleteFile struct (fields added in D104959593)
    • Forwards protocol::iceberg::IcebergSplit::dataSequenceNumber
      into the Velox HiveIcebergSplit ctor (parameter added in
      D104959593)
    • Populates infoColumns[$first_row_id] from
      icebergSplit->firstRowId alongside the existing
      $data_sequence_number entry, so the V3 _row_id system
      column finally returns non-NULL values in production
    • Uses the constants from IcebergMetadataColumns.h instead of
      raw string literals (no behavior change beyond removing
      duplication)
    • Uses folly::to<std::string> in place of std::to_string
      (Android-portable per Meta C++ convention)
    • Routes the protocol pointer through a const-reference local
      after the VELOX_CHECK_NOT_NULL so clang-tidy
      bugprone-unchecked-pointer-access can prove null-safety

Why these are needed

  • Equality-delete conflict resolution (V2+ spec): without the typed
    dataSequenceNumber plumbed all the way from manifest -> protocol
    -> bridge -> Velox struct, shouldSkipBySequenceNumber() always
    evaluates against 0 and the filter never fires -- equality deletes
    can over-apply to rows from later snapshots that share the same
    equality key.

  • V3 _row_id system column: without $first_row_id in
    infoColumns, IcebergSplitReader::firstRowId_ stays nullopt
    and _row_id returns NULL for every row instead of the per-row
    file-position-based ID.

  • DV blob location: the existing lowerBounds[100] /
    upperBounds[101] smuggling kept the C++ reader functional but
    meant the Velox struct couldn't be reasoned about by anyone who
    didn't know about the secret field IDs. Typed fields make DV
    blob location a first-class concern.

  • referencedDataFile filter: not strictly required while the
    coordinator filters per-split, but adds protection against
    pre-V3-spec compliance bugs in the planner.

Pairing requirement

D104959593 MUST land first, or the Prestissimo bridge will fail to
compile when it tries to pass contentOffset / contentLength /
referencedDataFile (and the trailing dataSequenceNumber ctor
arg) into Velox structs that don't yet declare them.

Differential Revision: D104959645

@apurva-meta apurva-meta requested review from a team, ZacBlanco and hantangwangd as code owners May 13, 2026 09:07
@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai Bot commented May 13, 2026

Reviewer's Guide

Plumbs newly added Iceberg V3 delete file metadata (data sequence number, content offset/size, referenced data file) from the Java Iceberg DeleteFile representation through the Presto native protocol into Velox, uses these typed fields in the C++ DeletionVectorReader and IcebergSplitReader, and adds tests to cover the new code paths and migration away from the old bounds-map hack.

Sequence diagram for plumbed Iceberg V3 DeleteFile metadata

sequenceDiagram
  participant JavaDeleteFile as Java_DeleteFile
  participant ProtoDeleteFile as Proto_DeleteFile
  participant Connector as IcebergPrestoToVeloxConnector
  participant VeloxDeleteFile as Velox_IcebergDeleteFile
  participant SplitReader as IcebergSplitReader
  participant DVReader as DeletionVectorReader

  JavaDeleteFile->>JavaDeleteFile: fromIceberg(deleteFile)
  JavaDeleteFile->>ProtoDeleteFile: new DeleteFile(..., dataSequenceNumber, referencedDataFile, contentOffset, contentSize)

  ProtoDeleteFile->>ProtoDeleteFile: to_json(j, DeleteFile)
  ProtoDeleteFile->>ProtoDeleteFile: from_json(j, DeleteFile)

  ProtoDeleteFile->>Connector: toVeloxSplit(DeleteFile)
  Connector->>VeloxDeleteFile: IcebergDeleteFile(..., dataSequenceNumber, contentOffset, contentSize, referencedDataFile)
  Connector->>Connector: infoColumns[kDataSequenceNumberInfoColumn]
  Connector->>Connector: infoColumns[kFirstRowIdInfoColumn]

  VeloxDeleteFile->>SplitReader: prepareSplit()
  alt [referencedDataFile != split.dataFile]
    SplitReader-->>SplitReader: [skip deletion vector]
  end

  VeloxDeleteFile->>DVReader: [dvFile_.contentOffset, dvFile_.contentLength]
  alt [contentLength == 0]
    DVReader-->>DVReader: [fallback to bounds map]
  end
Loading

File-Level Changes

Change Details Files
Extend Presto Iceberg DeleteFile Java model and native protocol schema to carry V3 delete-file metadata fields.
  • Add dataSequenceNumber, referencedDataFile, contentOffset, and contentSize fields to the Java DeleteFile class, including JSON bindings and fromIceberg() wiring with null-safe defaults.
  • Update the C++ DeleteFile protocol struct to include the same fields and extend to_json/from_json to serialize/deserialize them.
presto-iceberg/src/main/java/com/facebook/presto/iceberg/delete/DeleteFile.java
presto-native-execution/presto_cpp/presto_protocol/connector/iceberg/presto_protocol_iceberg.h
presto-native-execution/presto_cpp/presto_protocol/connector/iceberg/presto_protocol_iceberg.cpp
Wire new delete-file metadata through the Presto-to-Velox Iceberg bridge into Velox IcebergDeleteFile and infoColumns.
  • Pass dataSequenceNumber, contentOffset, contentSize, and referencedDataFile from the protocol DeleteFile into the Velox IcebergDeleteFile constructor.
  • Populate the $first_row_id info column (using IcebergMetadataColumns constants) alongside $data_sequence_number when building HiveIcebergSplit infoColumns, instead of using string literals.
presto-native-execution/presto_cpp/main/connectors/IcebergPrestoToVeloxConnector.cpp
(Context from description, not fully shown in diff) Update DeletionVectorReader and IcebergSplitReader to prefer typed DV location fields and enforce referencedDataFile matching, and adjust tests accordingly.
  • Make DeletionVectorReader prefer dvFile_.contentOffset/contentLength over legacy bounds-map field IDs, falling back only when contentLength is zero.
  • Add a referencedDataFile guard in IcebergSplitReader::prepareSplit() to skip mismatched deletion vectors, as a belt-and-suspenders safety check.
  • Adjust DeletionVectorReader tests to construct IcebergDeleteFile using the new typed contentOffset/contentLength fields and ensure behavior parity with the old bounds-map path.
  • Regenerate BUCK deps for the DeletionVectorReader test target after the rename/migration.
velox/connectors/hive/iceberg/DeletionVectorReader.cpp
velox/connectors/hive/iceberg/IcebergSplitReader.cpp
velox/connectors/hive/iceberg/tests/DeletionVectorReaderTest.cpp
velox/connectors/hive/iceberg/tests/BUCK

Possibly linked issues

  • #native(Iceberg) V3 DVs: PR wires V3 DeleteFile metadata and DV reader behavior needed for correct Iceberg V3 deletion vector support.
  • #native-iceberg-v3-support: PR implements V3 DeleteFile fields and DV behavior in Prestissimo/Velox, contributing to Iceberg V3 native support

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • In IcebergPrestoToVeloxConnector::toVeloxSplit, consider also replacing the hardcoded "$path" key with a constant from IcebergMetadataColumns (or a similar central place) for consistency with the other metadata columns.
  • In DeleteFile (Java), representing referencedDataFile as an empty string when absent may conflate "unset" with a legitimately empty value; consider using Optional<String> or a nullable field in the protocol to preserve the distinction between unset and empty.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `IcebergPrestoToVeloxConnector::toVeloxSplit`, consider also replacing the hardcoded `"$path"` key with a constant from `IcebergMetadataColumns` (or a similar central place) for consistency with the other metadata columns.
- In `DeleteFile` (Java), representing `referencedDataFile` as an empty string when absent may conflate "unset" with a legitimately empty value; consider using `Optional<String>` or a nullable field in the protocol to preserve the distinction between unset and empty.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

apurva-meta added a commit to apurva-meta/presto that referenced this pull request May 13, 2026
…gh bridge + readers (prestodb#27789)

Summary:
Pull Request resolved: prestodb#27789

Stacked on the previous diff (which added the V3 fields to the Java,
C++ protocol, and Velox structs). This diff wires the new fields
through the Prestissimo->Velox bridge into the typed Velox struct,
populates the missing `$first_row_id` info column, switches the
C++ DeletionVectorReader from the bounds-map smuggling hack to the
typed `contentOffset` / `contentLength` fields, and adds a
belt-and-suspenders `referencedDataFile` filter to IcebergSplitReader.

## What this changes

- `presto_cpp/main/connectors/IcebergPrestoToVeloxConnector.cpp`:
  - Forward `deleteFile.dataSequenceNumber`, `.contentOffset`,
    `.contentSize`, and `.referencedDataFile` into the Velox
    `IcebergDeleteFile` constructor. Without this, the typed
    fields stay at their defaults (0 / "") and the V3 reader paths
    silently degrade.
  - Populate `infoColumns[kFirstRowIdInfoColumn]` from
    `icebergSplit->firstRowId` alongside the existing
    `kDataSequenceNumberInfoColumn`. Without this entry,
    `IcebergSplitReader::firstRowId_` is always `nullopt` in production
    and the V3 `_row_id` system column always returns NULL.
  - Use the constants from `IcebergMetadataColumns.h` instead of new
    string literals (no behavior change beyond removing duplication).

- `velox/connectors/hive/iceberg/DeletionVectorReader.cpp`: Prefer the
  typed `dvFile_.contentOffset` / `dvFile_.contentLength` fields over
  the legacy `kDvOffsetFieldId` / `kDvLengthFieldId` bounds-map
  smuggling. Falls back to the bounds map only when `contentLength == 0`,
  giving callers a one-release graceful migration window.

- `velox/connectors/hive/iceberg/IcebergSplitReader.cpp`: In the
  `kDeletionVector` branch of `prepareSplit()`, skip the DV when
  `referencedDataFile` is set and does not equal the split's data file.
  Belt-and-suspenders -- the coordinator already filters per-split, so
  this guard rarely fires in practice but protects against split
  shipping bugs that would otherwise mask wrong rows.

- `velox/connectors/hive/iceberg/tests/DeletionVectorReaderTest.cpp`:
  Migrate `makeDvDeleteFile` to populate `contentOffset` /
  `contentLength` directly on `IcebergDeleteFile` instead of via the
  bounds-map smuggling. Validates that the new typed-field path
  produces the same output as the old bounds-map path.

- `velox/connectors/hive/iceberg/tests/BUCK`: autodeps regenerated the
  `DeletionVectorReader_cpp_lib` -> `DeletionVectorReader` rename
  consequence after the migration touched the test target.

## Why these are needed

- Equality-delete conflict resolution (V2+ spec): without the typed
  `dataSequenceNumber` on the Velox `IcebergDeleteFile`,
  `shouldSkipBySequenceNumber()` always evaluates against 0 and the
  filter never fires -- equality deletes can over-apply to rows from
  *later* snapshots that share the same equality key.
- V3 `_row_id` system column: without `$first_row_id` in `infoColumns`,
  `IcebergSplitReader::firstRowId_` stays `nullopt` and `_row_id`
  returns NULL for every row instead of the per-row file-position-based
  ID.
- DV blob location: the existing `lowerBounds[100]` /
  `upperBounds[101]` smuggling kept the C++ reader functional but
  meant the Velox struct couldn't be reasoned about by anyone who
  didn't know about the secret field IDs. Typed fields make DV blob
  location a first-class concern.
- `referencedDataFile` filter: not strictly required while the
  coordinator filters per-split, but adds protection against
  pre-V3-spec compliance bugs in the planner.

Differential Revision: D104959645
@meta-codesync meta-codesync Bot changed the title feat: [velox][hive-iceberg][iceberg] Plumb V3 DeleteFile fields through bridge + readers feat: [velox][hive-iceberg][iceberg] Plumb V3 DeleteFile fields through bridge + readers (#27789) May 13, 2026
@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla Bot commented May 13, 2026

CLA Signed
The committers listed above are authorized under a signed CLA.

apurva-meta added a commit to apurva-meta/presto that referenced this pull request May 13, 2026
…gh bridge + readers (prestodb#27789)

Summary:
Pull Request resolved: prestodb#27789

Stacked on the previous diff (which added the V3 fields to the Java,
C++ protocol, and Velox structs). This diff wires the new fields
through the Prestissimo->Velox bridge into the typed Velox struct,
populates the missing `$first_row_id` info column, switches the
C++ DeletionVectorReader from the bounds-map smuggling hack to the
typed `contentOffset` / `contentLength` fields, and adds a
belt-and-suspenders `referencedDataFile` filter to IcebergSplitReader.

## What this changes

- `presto_cpp/main/connectors/IcebergPrestoToVeloxConnector.cpp`:
  - Forward `deleteFile.dataSequenceNumber`, `.contentOffset`,
    `.contentSize`, and `.referencedDataFile` into the Velox
    `IcebergDeleteFile` constructor. Without this, the typed
    fields stay at their defaults (0 / "") and the V3 reader paths
    silently degrade.
  - Populate `infoColumns[kFirstRowIdInfoColumn]` from
    `icebergSplit->firstRowId` alongside the existing
    `kDataSequenceNumberInfoColumn`. Without this entry,
    `IcebergSplitReader::firstRowId_` is always `nullopt` in production
    and the V3 `_row_id` system column always returns NULL.
  - Use the constants from `IcebergMetadataColumns.h` instead of new
    string literals (no behavior change beyond removing duplication).

- `velox/connectors/hive/iceberg/DeletionVectorReader.cpp`: Prefer the
  typed `dvFile_.contentOffset` / `dvFile_.contentLength` fields over
  the legacy `kDvOffsetFieldId` / `kDvLengthFieldId` bounds-map
  smuggling. Falls back to the bounds map only when `contentLength == 0`,
  giving callers a one-release graceful migration window.

- `velox/connectors/hive/iceberg/IcebergSplitReader.cpp`: In the
  `kDeletionVector` branch of `prepareSplit()`, skip the DV when
  `referencedDataFile` is set and does not equal the split's data file.
  Belt-and-suspenders -- the coordinator already filters per-split, so
  this guard rarely fires in practice but protects against split
  shipping bugs that would otherwise mask wrong rows.

- `velox/connectors/hive/iceberg/tests/DeletionVectorReaderTest.cpp`:
  Migrate `makeDvDeleteFile` to populate `contentOffset` /
  `contentLength` directly on `IcebergDeleteFile` instead of via the
  bounds-map smuggling. Validates that the new typed-field path
  produces the same output as the old bounds-map path.

- `velox/connectors/hive/iceberg/tests/BUCK`: autodeps regenerated the
  `DeletionVectorReader_cpp_lib` -> `DeletionVectorReader` rename
  consequence after the migration touched the test target.

## Why these are needed

- Equality-delete conflict resolution (V2+ spec): without the typed
  `dataSequenceNumber` on the Velox `IcebergDeleteFile`,
  `shouldSkipBySequenceNumber()` always evaluates against 0 and the
  filter never fires -- equality deletes can over-apply to rows from
  *later* snapshots that share the same equality key.
- V3 `_row_id` system column: without `$first_row_id` in `infoColumns`,
  `IcebergSplitReader::firstRowId_` stays `nullopt` and `_row_id`
  returns NULL for every row instead of the per-row file-position-based
  ID.
- DV blob location: the existing `lowerBounds[100]` /
  `upperBounds[101]` smuggling kept the C++ reader functional but
  meant the Velox struct couldn't be reasoned about by anyone who
  didn't know about the secret field IDs. Typed fields make DV blob
  location a first-class concern.
- `referencedDataFile` filter: not strictly required while the
  coordinator filters per-split, but adds protection against
  pre-V3-spec compliance bugs in the planner.

Differential Revision: D104959645
apurva-meta added a commit to apurva-meta/presto that referenced this pull request May 13, 2026
…gh bridge + readers (prestodb#27789)

Summary:

Stacked on the previous diff (which added the V3 fields to the Java,
C++ protocol, and Velox structs). This diff wires the new fields
through the Prestissimo->Velox bridge into the typed Velox struct,
populates the missing `$first_row_id` info column, switches the
C++ DeletionVectorReader from the bounds-map smuggling hack to the
typed `contentOffset` / `contentLength` fields, and adds a
belt-and-suspenders `referencedDataFile` filter to IcebergSplitReader.

## What this changes

- `presto_cpp/main/connectors/IcebergPrestoToVeloxConnector.cpp`:
  - Forward `deleteFile.dataSequenceNumber`, `.contentOffset`,
    `.contentSize`, and `.referencedDataFile` into the Velox
    `IcebergDeleteFile` constructor. Without this, the typed
    fields stay at their defaults (0 / "") and the V3 reader paths
    silently degrade.
  - Populate `infoColumns[kFirstRowIdInfoColumn]` from
    `icebergSplit->firstRowId` alongside the existing
    `kDataSequenceNumberInfoColumn`. Without this entry,
    `IcebergSplitReader::firstRowId_` is always `nullopt` in production
    and the V3 `_row_id` system column always returns NULL.
  - Use the constants from `IcebergMetadataColumns.h` instead of new
    string literals (no behavior change beyond removing duplication).

- `velox/connectors/hive/iceberg/DeletionVectorReader.cpp`: Prefer the
  typed `dvFile_.contentOffset` / `dvFile_.contentLength` fields over
  the legacy `kDvOffsetFieldId` / `kDvLengthFieldId` bounds-map
  smuggling. Falls back to the bounds map only when `contentLength == 0`,
  giving callers a one-release graceful migration window.

- `velox/connectors/hive/iceberg/IcebergSplitReader.cpp`: In the
  `kDeletionVector` branch of `prepareSplit()`, skip the DV when
  `referencedDataFile` is set and does not equal the split's data file.
  Belt-and-suspenders -- the coordinator already filters per-split, so
  this guard rarely fires in practice but protects against split
  shipping bugs that would otherwise mask wrong rows.

- `velox/connectors/hive/iceberg/tests/DeletionVectorReaderTest.cpp`:
  Migrate `makeDvDeleteFile` to populate `contentOffset` /
  `contentLength` directly on `IcebergDeleteFile` instead of via the
  bounds-map smuggling. Validates that the new typed-field path
  produces the same output as the old bounds-map path.

- `velox/connectors/hive/iceberg/tests/BUCK`: autodeps regenerated the
  `DeletionVectorReader_cpp_lib` -> `DeletionVectorReader` rename
  consequence after the migration touched the test target.

## Why these are needed

- Equality-delete conflict resolution (V2+ spec): without the typed
  `dataSequenceNumber` on the Velox `IcebergDeleteFile`,
  `shouldSkipBySequenceNumber()` always evaluates against 0 and the
  filter never fires -- equality deletes can over-apply to rows from
  *later* snapshots that share the same equality key.
- V3 `_row_id` system column: without `$first_row_id` in `infoColumns`,
  `IcebergSplitReader::firstRowId_` stays `nullopt` and `_row_id`
  returns NULL for every row instead of the per-row file-position-based
  ID.
- DV blob location: the existing `lowerBounds[100]` /
  `upperBounds[101]` smuggling kept the C++ reader functional but
  meant the Velox struct couldn't be reasoned about by anyone who
  didn't know about the secret field IDs. Typed fields make DV blob
  location a first-class concern.
- `referencedDataFile` filter: not strictly required while the
  coordinator filters per-split, but adds protection against
  pre-V3-spec compliance bugs in the planner.

Differential Revision: D104959645
@meta-codesync meta-codesync Bot changed the title feat: [velox][hive-iceberg][iceberg] Plumb V3 DeleteFile fields through bridge + readers (#27789) feat: Plumb V3 DeleteFile fields through bridge + readers [velox][hive-iceberg][iceberg] (#27789) May 13, 2026
apurva-meta added a commit to apurva-meta/presto that referenced this pull request May 13, 2026
…e-iceberg][iceberg] (prestodb#27789)

Summary:

Stacked on the previous diff (which added the V3 fields to the Java,
C++ protocol, and Velox structs). This diff wires the new fields
through the Prestissimo->Velox bridge into the typed Velox struct,
populates the missing `$first_row_id` info column, switches the
C++ DeletionVectorReader from the bounds-map smuggling hack to the
typed `contentOffset` / `contentLength` fields, and adds a
belt-and-suspenders `referencedDataFile` filter to IcebergSplitReader.

## What this changes

- `presto_cpp/main/connectors/IcebergPrestoToVeloxConnector.cpp`:
  - Forward `deleteFile.dataSequenceNumber`, `.contentOffset`,
    `.contentSize`, and `.referencedDataFile` into the Velox
    `IcebergDeleteFile` constructor. Without this, the typed
    fields stay at their defaults (0 / "") and the V3 reader paths
    silently degrade.
  - Populate `infoColumns[kFirstRowIdInfoColumn]` from
    `icebergSplit->firstRowId` alongside the existing
    `kDataSequenceNumberInfoColumn`. Without this entry,
    `IcebergSplitReader::firstRowId_` is always `nullopt` in production
    and the V3 `_row_id` system column always returns NULL.
  - Use the constants from `IcebergMetadataColumns.h` instead of new
    string literals (no behavior change beyond removing duplication).

- `velox/connectors/hive/iceberg/DeletionVectorReader.cpp`: Prefer the
  typed `dvFile_.contentOffset` / `dvFile_.contentLength` fields over
  the legacy `kDvOffsetFieldId` / `kDvLengthFieldId` bounds-map
  smuggling. Falls back to the bounds map only when `contentLength == 0`,
  giving callers a one-release graceful migration window.

- `velox/connectors/hive/iceberg/IcebergSplitReader.cpp`: In the
  `kDeletionVector` branch of `prepareSplit()`, skip the DV when
  `referencedDataFile` is set and does not equal the split's data file.
  Belt-and-suspenders -- the coordinator already filters per-split, so
  this guard rarely fires in practice but protects against split
  shipping bugs that would otherwise mask wrong rows.

- `velox/connectors/hive/iceberg/tests/DeletionVectorReaderTest.cpp`:
  Migrate `makeDvDeleteFile` to populate `contentOffset` /
  `contentLength` directly on `IcebergDeleteFile` instead of via the
  bounds-map smuggling. Validates that the new typed-field path
  produces the same output as the old bounds-map path.

- `velox/connectors/hive/iceberg/tests/BUCK`: autodeps regenerated the
  `DeletionVectorReader_cpp_lib` -> `DeletionVectorReader` rename
  consequence after the migration touched the test target.

## Why these are needed

- Equality-delete conflict resolution (V2+ spec): without the typed
  `dataSequenceNumber` on the Velox `IcebergDeleteFile`,
  `shouldSkipBySequenceNumber()` always evaluates against 0 and the
  filter never fires -- equality deletes can over-apply to rows from
  *later* snapshots that share the same equality key.
- V3 `_row_id` system column: without `$first_row_id` in `infoColumns`,
  `IcebergSplitReader::firstRowId_` stays `nullopt` and `_row_id`
  returns NULL for every row instead of the per-row file-position-based
  ID.
- DV blob location: the existing `lowerBounds[100]` /
  `upperBounds[101]` smuggling kept the C++ reader functional but
  meant the Velox struct couldn't be reasoned about by anyone who
  didn't know about the secret field IDs. Typed fields make DV blob
  location a first-class concern.
- `referencedDataFile` filter: not strictly required while the
  coordinator filters per-split, but adds protection against
  pre-V3-spec compliance bugs in the planner.

=== NO RELEASE NOTES ===

Differential Revision: D104959645
@meta-codesync meta-codesync Bot changed the title feat: Plumb V3 DeleteFile fields through bridge + readers [velox][hive-iceberg][iceberg] (#27789) feat: V3 DeleteFile fields end-to-end through Java + protocol + bridge May 13, 2026
prestodb#27789)

Summary:

Stacked on D104959593 (which added V3 typed fields to
`velox::IcebergDeleteFile` and the `dataSequenceNumber` parameter on
the production `HiveIcebergSplit` constructor).  This diff plumbs all
of those fields through the Java side and the Prestissimo bridge so
the typed Velox structs are populated end-to-end.

Adds:
- `dataSequenceNumber`, `referencedDataFile`, `contentOffset`,
  `contentSize` on `presto-iceberg/.../delete/DeleteFile.java`
- The same four fields on the C++ protocol struct
  `presto::protocol::iceberg::DeleteFile`
- All bridge wiring in
  `presto_cpp/main/connectors/IcebergPrestoToVeloxConnector.cpp`:
  - Forwards the four DeleteFile fields into the Velox
    `IcebergDeleteFile` struct (fields added in D104959593)
  - Forwards `protocol::iceberg::IcebergSplit::dataSequenceNumber`
    into the Velox `HiveIcebergSplit` ctor (parameter added in
    D104959593)
  - Populates `infoColumns[$first_row_id]` from
    `icebergSplit->firstRowId` alongside the existing
    `$data_sequence_number` entry, so the V3 `_row_id` system
    column finally returns non-NULL values in production
  - Uses the constants from `IcebergMetadataColumns.h` instead of
    raw string literals (no behavior change beyond removing
    duplication)
  - Uses `folly::to<std::string>` in place of `std::to_string`
    (Android-portable per Meta C++ convention)
  - Routes the protocol pointer through a const-reference local
    after the VELOX_CHECK_NOT_NULL so clang-tidy
    bugprone-unchecked-pointer-access can prove null-safety

## Why these are needed

- Equality-delete conflict resolution (V2+ spec): without the typed
  `dataSequenceNumber` plumbed all the way from manifest -> protocol
  -> bridge -> Velox struct, `shouldSkipBySequenceNumber()` always
  evaluates against 0 and the filter never fires -- equality deletes
  can over-apply to rows from later snapshots that share the same
  equality key.

- V3 `_row_id` system column: without `$first_row_id` in
  `infoColumns`, `IcebergSplitReader::firstRowId_` stays `nullopt`
  and `_row_id` returns NULL for every row instead of the per-row
  file-position-based ID.

- DV blob location: the existing `lowerBounds[100]` /
  `upperBounds[101]` smuggling kept the C++ reader functional but
  meant the Velox struct couldn't be reasoned about by anyone who
  didn't know about the secret field IDs.  Typed fields make DV
  blob location a first-class concern.

- `referencedDataFile` filter: not strictly required while the
  coordinator filters per-split, but adds protection against
  pre-V3-spec compliance bugs in the planner.

## Pairing requirement

D104959593 MUST land first, or the Prestissimo bridge will fail to
compile when it tries to pass `contentOffset` / `contentLength` /
`referencedDataFile` (and the trailing `dataSequenceNumber` ctor
arg) into Velox structs that don't yet declare them.

Differential Revision: D104959645
@meta-codesync meta-codesync Bot changed the title feat: V3 DeleteFile fields end-to-end through Java + protocol + bridge feat: V3 DeleteFile fields end-to-end through Java + protocol + bridge (#27789) May 14, 2026
// reader VELOX_CHECK_GE-asserts non-negativity if the key is present, so
// emitting -1 would crash every V1/V2 read. Mirrors the Java guard in
// `IcebergPageSourceProvider.java`.
if (icebergSplit.firstRowId >= 0) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this change in this PR? Its also present in https://github.com/prestodb/presto/pull/27743/changes where row lineage related changes are being made.

@nmahadevuni
Copy link
Copy Markdown
Member

Thanks @apurva-meta for this change. You mentioned changes in D104959593 must precede this PR. Can you link the PR for D104959593 here?

@apurva-meta
Copy link
Copy Markdown
Contributor Author

Thanks @apurva-meta for this change. You mentioned changes in D104959593 must precede this PR. Can you link the PR for D104959593 here?

It has already been merged in velox: facebookincubator/velox#17503

@aditi-pandit
Copy link
Copy Markdown
Contributor

@apurva-meta : Can you check the test failures in prestocpp builds.

Copilot is suggesting
The failure is a compile break in presto-native-execution/presto_cpp/main/connectors/IcebergPrestoToVeloxConnector.cpp during the Build engine step of .github/workflows/prestocpp-linux-build-and-unit-test.yml at 36c8c14d2e3f4d45d381b33b9b8afcb6208cf4f7.

Failing code:

Error:

  • error: no matching function for call to '...IcebergDeleteFile::IcebergDeleteFile(...)'

Root cause:

  • The PR is constructing velox::connector::hive::iceberg::IcebergDeleteFile with an outdated argument list.
  • The new Velox submodule version changed the IcebergDeleteFile constructor signature, so this call no longer matches.

Solution

Update the construction of IcebergDeleteFile in toVeloxSplit() to match the current Velox API instead of passing the old positional parameter list directly.

Current code:

velox::connector::hive::iceberg::IcebergDeleteFile icebergDeleteFile(
    toVeloxFileContent(deleteFile.content),
    deleteFile.path,
    toVeloxFileFormat(deleteFile.format),
    deleteFile.recordCount,
    deleteFile.fileSizeInBytes,
    std::vector(deleteFile.equalityFieldIds),
    lowerBounds,
    upperBounds,
    deleteFile.dataSequenceNumber,
    deleteFile.contentOffset,
    deleteFile.contentSize,
    deleteFile.referencedDataFile);

Recommended fix

Adjust the constructor call to the updated Velox parameter order / field set. Most likely the break is from one of these:

  • a removed or reordered trailing argument
  • changed optional types for contentOffset, contentSize, or referencedDataFile
  • a move from positional construction to aggregate/struct-style initialization

A robust fix is to switch to named/aggregate initialization if IcebergDeleteFile is now an aggregate in Velox. For example:

velox::connector::hive::iceberg::IcebergDeleteFile icebergDeleteFile{
    .content = toVeloxFileContent(deleteFile.content),
    .path = deleteFile.path,
    .format = toVeloxFileFormat(deleteFile.format),
    .recordCount = deleteFile.recordCount,
    .fileSizeInBytes = deleteFile.fileSizeInBytes,
    .equalityFieldIds = std::vector<int32_t>(
        deleteFile.equalityFieldIds.begin(),
        deleteFile.equalityFieldIds.end()),
    .lowerBounds = std::move(lowerBounds),
    .upperBounds = std::move(upperBounds),
    .dataSequenceNumber = deleteFile.dataSequenceNumber,
    .contentOffset = deleteFile.contentOffset,
    .contentSize = deleteFile.contentSize,
    .referencedDataFile = deleteFile.referencedDataFile,
};

If Velox now expects fewer fields, remove the no-longer-supported trailing ones from the initializer/call. If it expects std::optional<...> wrappers, convert explicitly, e.g.:

.contentOffset = deleteFile.contentOffset ? std::optional(*deleteFile.contentOffset) : std::nullopt,
.contentSize = deleteFile.contentSize ? std::optional(*deleteFile.contentSize) : std::nullopt,
.referencedDataFile = deleteFile.referencedDataFile,

Minimal likely patch

If the API still uses a constructor but dropped the last field, try:

velox::connector::hive::iceberg::IcebergDeleteFile icebergDeleteFile(
    toVeloxFileContent(deleteFile.content),
    deleteFile.path,
    toVeloxFileFormat(deleteFile.format),
    deleteFile.recordCount,
    deleteFile.fileSizeInBytes,
    std::vector<int32_t>(
        deleteFile.equalityFieldIds.begin(),
        deleteFile.equalityFieldIds.end()),
    lowerBounds,
    upperBounds,
    deleteFile.dataSequenceNumber,
    deleteFile.contentOffset,
    deleteFile.contentSize);

Why this is the right place to fix

The workflow’s failing build step is:

The compile failure occurs specifically in:

Suggested final edit

Replace the direct positional constructor with whatever matches the current Velox IcebergDeleteFile declaration, preferably aggregate initialization if supported. This will make the code resilient to constructor churn in the updated submodule and should unblock the build.

@aditi-pandit
Copy link
Copy Markdown
Contributor

aditi-pandit commented May 15, 2026

Thanks — the overall direction makes sense, but I don’t think this is ready to approve yet. The main blocker is the Prestocpp build: this PR depends on the Velox-side API from PR 17503, and the new IcebergDeleteFile / HiveIcebergSplit construction in IcebergPrestoToVeloxConnector.cpp looks fragile against any dependency skew. Can you confirm the exact Velox version/submodule state expected by Presto CI and fix the current build failure first?

Two smaller nits:

  • referencedDataFile currently normalizes null to "" in Java — can you confirm losing null-vs-empty distinction is intentional?
  • Consider replacing the raw "$path" info column key with a constant as well for consistency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants