Skip to content

Simplify get_field over inline struct constructors#22239

Open
adriangb wants to merge 6 commits into
apache:mainfrom
adriangb:simplify-get-field-named-struct
Open

Simplify get_field over inline struct constructors#22239
adriangb wants to merge 6 commits into
apache:mainfrom
adriangb:simplify-get-field-named-struct

Conversation

@adriangb
Copy link
Copy Markdown
Contributor

@adriangb adriangb commented May 16, 2026

Which issue does this PR close?

#22240

Rationale for this change

Constructing a struct with named_struct(...) (or struct(...)) and then immediately reading a field back out of it is pure overhead — the intermediate struct never needs to be materialized. This pattern shows up after view/CTE inlining and projection pushdown, where a named_struct projection feeds a get_field in a parent node.

For example:

CREATE VIEW t AS (
  SELECT named_struct('type', type, 'value', value) as s
);

SELECT get_field(s, 'value') FROM t;

get_field(named_struct('type', type, 'value', value), 'type') is equivalent to type, so the simplifier can drop the struct entirely. This is especially important because without this simplification no statistics pruning can be applied.

What changes are included in this PR?

A new logical simplification, added to GetFieldFunc::simplify (the same hook that already flattens nested get_field calls):

  • get_field(named_struct('min', a, 'max', b), 'max') => b (lookup by name)
  • get_field(struct(a, b), 'c1') => b (positional c0, c1, ... fields)
  • nested constructors collapse all the way through, e.g. named_struct('outer', named_struct('inner', a))['outer']['inner'] => a

The rewrite is conservative and bails out (leaving the expression untouched) whenever it cannot be proven safe:

  • a non-literal field key,
  • a named_struct with a non-literal field name (which could shadow the requested field at runtime),
  • a field the constructor does not produce,
  • non-canonical struct field spellings such as c01.

Casts are intentionally not unwrapped: a struct→struct cast can rename, retype and reorder fields, so resolving through one correctly is a larger, separate change.

Are these changes tested?

Yes:

  • Unit tests in getfield.rs covering matches, duplicate names, nested constructors, flatten-then-resolve, and every bail-out guard.
  • struct.slt: query + EXPLAIN tests showing the field access collapses to the underlying column.
  • order.slt: two EXPLAIN expectations updated — resolving get_field(named_struct(...), 'a') lets the extract_leaf_expressions rule skip a now-pointless sort-key extraction. The SortExec is still present, as those tests intend, and the sort-elimination cases are unchanged.

The full sqllogictest suite, the optimizer crate tests, cargo clippy --all-targets --all-features -D warnings and cargo fmt all pass.

Are there any user-facing changes?

No API changes. Query plans involving get_field over an inline struct constructor are simpler; results are unchanged.

🤖 Generated with Claude Code

Add a logical simplifier rule that resolves field access against inline
`named_struct(...)` and `struct(...)` constructors at plan time, so a
struct that is built and immediately read back never needs to be
materialized:

* `get_field(named_struct('min', a, 'max', b), 'max')` => `b`
* `get_field(struct(a, b), 'c1')` => `b`
* nested constructors collapse all the way through

The rewrite lives in `GetFieldFunc::simplify` (the same hook that already
flattens nested `get_field` calls) and bails out whenever it cannot be
proven safe.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels May 16, 2026
Copy link
Copy Markdown
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adriangb
The simplification logic looks solid overall and I did not find any blocking issues. I left a few small suggestions around readability, test coverage, and documenting a couple of optimizer tradeoffs.

};
assert_eq!(simplified(args), col("b"));
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small readability nit: this test reconstructs get_field(...), immediately destructures it back into args, and then calls simplified(args). The neighboring tests already pass args directly, which feels a bit easier to follow. It may be worth switching this one to the same style for consistency.

Copy link
Copy Markdown
Contributor Author

@adriangb adriangb May 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 11a390e — the test now builds args directly and passes it to simplified(args), matching the neighboring tests.

Arc::new(ScalarUDF::new_from_impl(GetFieldFunc::new())),
new_args,
)))
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The re-wrap path creates a fresh Arc::new(ScalarUDF::new_from_impl(GetFieldFunc::new())) each time it builds an intermediate get_field node. Since the same construction already appears elsewhere in simplify, it might be nice to centralize this behind a small helper like get_field_udf() or a shared OnceLock. Not a big deal performance-wise, but it would make the intent and reuse a bit clearer.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in cd20ebb — added a get_field_udf() helper backed by a OnceLock and routed both construction sites (the re-wrap path here and the flatten path in simplify) through it.

.try_as_str()
.flatten()
.filter(|s| !s.is_empty())?;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try_as_str().flatten() correctly handles ScalarValue::Utf8(None) here, so the guard is safe. It could still be nice to add a dedicated test for a null literal field name, something like simplify_get_field_null_field_name_left_alone, just to document the invariant explicitly.

Copy link
Copy Markdown
Contributor Author

@adriangb adriangb May 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in b2e0978 — added simplify_get_field_null_field_name_left_alone covering the ScalarValue::Utf8(None) case, asserting the expression is left unchanged.

02)--DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/data/composite_order.csv]]}, projection=[named_struct(a, a@0, b, b@1) as s], file_type=csv, has_header=true

# Simple column ordering tests using a table ordered by (a)
statement ok
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that stood out in the updated plan is that the sort key now evaluates get_field(s@0, a) against the materialized struct instead of sorting directly on an extracted column. Functionally this looks correct, but it may introduce some extra per-row sort-key evaluation work compared with the previous shape. Might be worth tracking in a follow-up issue whether physical sort-key normalization could simplify get_field(named_struct(...), f) back into a direct column reference.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good observation. Filed a tracking issue for the physical sort-key normalization opportunity: #22487. As noted there, this is a tradeoff rather than a clear regression (sort keys are materialized into an array once before sorting, and the new shape drops the extra extracted column + recovery projection), so the issue calls for a benchmark before investing in the rewrite.


impl GetFieldFunc {
pub fn new() -> Self {
Self {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This simplification can eliminate evaluation of struct fields that are never accessed, for example get_field(named_struct('a', 1/0, 'b', x), 'b') simplifying down to x. That matches the existing immutable-expression optimizer contract, but I think a short doc comment here would help future readers understand that unused field expressions are intentionally not evaluated after simplification.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 4f70d80 — added a paragraph to the function doc comment noting that replacing the access drops the other fields' expressions so they are no longer evaluated (e.g. get_field(named_struct('a', 1/0, 'b', x), 'b') => x), which matches the optimizer's immutable-expression contract.

if format!("c{index}") != field_name {
return None;
}
ctor_args.get(index)?.clone()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The early return None on a dynamic field name makes sense because a later runtime name could shadow an earlier literal match. One subtle detail is that this also prevents simplification even when a matching literal field was already seen earlier in the loop. It may be worth calling out in the comment that this is a deliberate conservative approximation rather than an accidental missed optimization.

Copy link
Copy Markdown
Contributor Author

@adriangb adriangb May 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — expanded the comment to call out that bailing on a non-literal field name is a deliberate conservative approximation rather than an accidental missed optimization.

To be precise about the rationale: a non-literal name appearing after the first literal match can't change the result, since column_by_name returns the first match. The bail is only strictly required when the non-literal name appears before the first match (it could shadow it at runtime); applying it unconditionally is the conservative-for-simplicity choice. Reworded the comment accordingly in f799333 (force-pushed; new branch tip b2e0978).

@adriangb
Copy link
Copy Markdown
Contributor Author

thank you @kosiew . i will go through the feedback next week!

adriangb and others added 2 commits May 23, 2026 22:36
Both the re-wrap path in simplify_get_field_over_struct_constructor and
the flatten path in GetFieldFunc::simplify built a fresh
Arc::new(ScalarUDF::new_from_impl(GetFieldFunc::new())) each time. Share a
single OnceLock-backed instance via get_field_udf() to make the reuse and
intent clearer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Document on simplify_get_field_over_struct_constructor that replacing the
access with the selected field drops the expressions for the other fields,
so they are no longer evaluated (e.g. get_field(named_struct('a', 1/0, 'b',
x), 'b') => x). This matches the optimizer's immutable-expression contract.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
adriangb and others added 3 commits May 23, 2026 22:51
Rewrite the comment in the named_struct branch to state the rationale
accurately. A non-literal name appearing *before* the first match could
evaluate to field_name at runtime and become the real first match, so we
must bail. But once a literal match has been found, a later non-literal name
can never precede it (column_by_name returns the first match), so bailing
there is a deliberate approximation we accept for simplicity, not a
correctness requirement.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
simplify_get_field_named_struct_returns_matching_value built a get_field
expr only to immediately destructure it back into args. Pass args directly,
matching the style of the neighboring tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add simplify_get_field_null_field_name_left_alone to explicitly cover the
ScalarValue::Utf8(None) case, where try_as_str().flatten() yields no field
name and the expression must be left unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@adriangb adriangb force-pushed the simplify-get-field-named-struct branch from 52e7567 to b2e0978 Compare May 24, 2026 03:51
@adriangb adriangb requested a review from kosiew May 24, 2026 13:25
@adriangb
Copy link
Copy Markdown
Contributor Author

@kosiew I think I've addressed the feedback, ready for another round of review 🙏🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants