[[["易于理解","easyToUnderstand","thumb-up"],["解决了我的问题","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["很难理解","hardToUnderstand","thumb-down"],["信息或示例代码不正确","incorrectInformationOrSampleCode","thumb-down"],["没有我需要的信息/示例","missingTheInformationSamplesINeed","thumb-down"],["翻译问题","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["最后更新时间 (UTC):2025-09-02。"],[],[],null,["# Feature engineering\n\n| **Preview**\n|\n|\n| This feature is subject to the \"Pre-GA Offerings Terms\" in the General Service Terms section\n| of the [Service Specific Terms](/terms/service-terms#1).\n|\n| Pre-GA features are available \"as is\" and might have limited support.\n|\n| For more information, see the\n| [launch stage descriptions](/products#product-launch-stages).\n\nThis document describes how Feature Transform Engine performs feature\nengineering.\nFeature Transform Engine performs feature selection and feature transformations.\nIf feature selection is enabled, Feature Transform Engine creates a ranked set of important\nfeatures. If feature transformations are enabled, Feature Transform Engine\nprocesses the features to ensure that the input for model training and model\nserving is consistent. Feature Transform Engine can be used on its own or together with any of\nthe [tabular training workflows](/vertex-ai/docs/tabular-data/tabular-workflows/overview).\nIt supports both TensorFlow and non-TensorFlow frameworks.\n\n\u003cbr /\u003e\n\nInputs\n------\n\nProvide the following inputs to Feature Transform Engine:\n\n- Raw data (BigQuery or CSV dataset).\n- Data split configuration.\n- Feature selection configuration.\n- Feature transformation configuration.\n\nOutputs\n-------\n\nFeature Transform Engine generates the following outputs:\n\n- `dataset_stats`: Statistics that describe the raw dataset. For example, `dataset_stats` gives the number of rows in the dataset.\n- `feature_importance`: The importance score of the features. This output is generated if [feature selection](#feature-selection) is enabled.\n- `materialized_data`, which is the transformed version of a data split group containing the training split, the evaluation split, and the test split.\n- `training_schema`: Training data schema in OpenAPI specification, which describes the data types of the training data.\n- `instance_schema`: Instance schema in OpenAPI specification, which describes the data types of the inference data.\n- `transform_output`: Metadata of the transformation. If you use TensorFlow for transformation, the metadata includes the TensorFlow graph.\n\nProcessing steps\n----------------\n\nFeature Transform Engine performs the following steps:\n\n- Generate [dataset splits](/vertex-ai/docs/tabular-data/data-splits) for training, evaluation, and testing.\n- Generate input dataset statistics `dataset_stats` that describe the raw dataset.\n- Perform [feature selection](#feature-selection).\n- Process the transform configuration using the dataset statistics, resolving automatic transformation parameters into manual transformation parameters.\n- [Transform raw features into engineered features](/vertex-ai/docs/datasets/data-types-tabular). Different transformations are done for different types of features.\n\nFeature selection\n-----------------\n\nThe main purpose of feature selection is to reduce the number of features used\nin the model. The reduced feature set captures most of the label's\ninformation in a more compact manner. Feature selection allows you to reduce the\ncost of training and serving models without significantly impacting model quality.\n\nIf you enable feature selection, Feature Transform Engine assigns an importance\nscore to each feature. You can choose to output the importance scores of the\nfull set of features or of a reduced subset of the most important features.\n\nVertex AI offers the following feature selection algorithms:\n\n- [Adjusted Mutual Information (AMI)](#ami)\n- [Conditional Mutual Information Maximization (CMIM)](#cmim)\n- [Joint Mutual Information Maximization (JMIM)](#jmim)\n- [Maximum Relevance Minimum Redundancy (MRMR)](#mrmr)\n\nNote that no feature selection algorithm always works best on all\ndatasets and for all purposes. If possible, run all the algorithms and combine\nthe results.\n\n### Adjusted Mutual Information (AMI)\n\nAMI is an adjustment of the Mutual Information (MI) score to account for chance.\nIt accounts for the fact that the MI is generally higher for two clusterings\nwith a larger number of clusters, regardless of whether there is actually more\ninformation shared.\n\nAMI is good at detecting the relevance of features and the label, but it is\ninsensitive to feature redundancy. Consider AMI if there are many\nfeatures (for example, more than 2000) and not much feature redundancy. It is\nfaster than the other algorithms described here, but it could pick up redundant\nfeatures.\n\n### Conditional Mutual Information Maximization (CMIM)\n\nCMIM is a greedy algorithm that chooses features iteratively based on conditional mutual information of candidate features with respect to selected features. In each iteration, it selects the feature that maximizes the minimum mutual information with the label that hasn't been captured by selected features yet.\n\nCMIM is robust in dealing with feature redundancy, and it works well in typical cases.\n\n### Joint Mutual Information Maximization (JMIM)\n\nJMIM is a greedy algorithm that is similar to CMIM. JMIM selects the feature that\nmaximizes the joint mutual information of the new one and pre-selected features\nwith the label, while CMIM takes redundancy more into account.\n\nJMIM is a high-quality feature selection algorithm.\n\n### Maximum Relevance Minimum Redundancy (MRMR)\n\nMRMR is a greedy algorithm that works iteratively. It is similar to CMIM. Each\niteration chooses the feature that maximizes relevance with respect to the label\nwhile minimizing pair-wise redundancy with respect to the selected features in\nprevious iterations.\n\nMRMR is a high-quality feature selection algorithm.\n\nWhat's next\n-----------\n\nAfter performing feature engineering, you can train a model for classification\nor regression:\n\n- Train a model with [End-to-End AutoML](/vertex-ai/docs/tabular-data/tabular-workflows/e2e-automl).\n- Train a model with [TabNet](/vertex-ai/docs/tabular-data/tabular-workflows/tabnet).\n- Train a model with [Wide \\& Deep](/vertex-ai/docs/tabular-data/tabular-workflows/wide-and-deep)."]]