\n\n content \n0 {\"access_urls\":{\"expiry_time\":\"2025-08-19T02:3... \n1 {\"access_urls\":{\"expiry_time\":\"2025-08-19T02:3... \n\n[2 rows x 5 columns]","text/html":"\n\n
\n \n \n | \n ml_generate_embedding_result | \n ml_generate_embedding_status | \n ml_generate_embedding_start_sec | \n ml_generate_embedding_end_sec | \n content | \n
\n \n \n \n 0 | \n [ 0.00638822 0.01666385 0.00451817 ... -0.02... | \n | \n <NA> | \n <NA> | \n {\"access_urls\":{\"expiry_time\":\"2025-08-19T02:3... | \n
\n \n 1 | \n [ 0.00973672 0.02148364 0.00244308 ... 0.00... | \n | \n <NA> | \n <NA> | \n {\"access_urls\":{\"expiry_time\":\"2025-08-19T02:3... | \n
\n \n
\n
2 rows × 5 columns
\n
[2 rows x 5 columns in total]"},"metadata":{}}],"execution_count":14},{"cell_type":"code","source":"","metadata":{"trusted":true},"outputs":[],"execution_count":null}]}
diff --git a/notebooks/kaggle/vector-search-with-bigframes-over-national-jukebox.ipynb b/notebooks/kaggle/vector-search-with-bigframes-over-national-jukebox.ipynb
new file mode 100644
index 0000000000..fe2d567d1b
--- /dev/null
+++ b/notebooks/kaggle/vector-search-with-bigframes-over-national-jukebox.ipynb
@@ -0,0 +1,1137 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "@deathbeds/jupyterlab-fonts": {
+ "styles": {
+ "": {
+ "body[data-jp-deck-mode='presenting'] &": {
+ "zoom": "194%"
+ }
+ }
+ }
+ },
+ "editable": true,
+ "slideshow": {
+ "slide_type": "subslide"
+ },
+ "tags": []
+ },
+ "source": [
+ "# Creating a searchable index of the National Jukebox\n",
+ "\n",
+ "_Extracting text from audio and indexing it with BigQuery DataFrames_\n",
+ "\n",
+ "* Tim Swena (formerly, Swast)\n",
+ "* swast@google.com\n",
+ "* https://vis.social/@timswast on Mastodon\n",
+ "\n",
+ "This notebook lives in\n",
+ "\n",
+ "* https://github.com/tswast/code-snippets\n",
+ "* at https://github.com/tswast/code-snippets/blob/main/2025/national-jukebox/transcribe_songs.ipynb\n",
+ "\n",
+ "To follow along, you'll need a Google Cloud project\n",
+ "\n",
+ "* Go to https://cloud.google.com/free to start a free trial."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "@deathbeds/jupyterlab-fonts": {
+ "styles": {
+ "": {
+ "body[data-jp-deck-mode='presenting'] &": {
+ "z-index": "0",
+ "zoom": "216%"
+ }
+ }
+ }
+ },
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "source": [
+ "The National Jukebox is a project of the USA Library of Congress to provide access to thousands of acoustic sound recordings from the very earliest days of the commercial record industry.\n",
+ "\n",
+ "* Learn more at https://www.loc.gov/collections/national-jukebox/about-this-collection/\n",
+ "\n",
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "@deathbeds/jupyterlab-fonts": {
+ "styles": {
+ "": {
+ "body[data-jp-deck-mode='presenting'] &": {
+ "z-index": "0",
+ "zoom": "181%"
+ }
+ }
+ }
+ },
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "source": [
+ "\n",
+ "To search the National Jukebox, we combine powerful features of BigQuery:\n",
+ "\n",
+ "
\n",
+ "\n",
+ "1. Integrations with multi-modal AI models to extract information from unstructured data, in this case audio files.\n",
+ "\n",
+ " https://cloud.google.com/bigquery/docs/multimodal-data-dataframes-tutorial\n",
+ " \n",
+ "2. Vector search to find similar text using embedding models.\n",
+ "\n",
+ " https://cloud.google.com/bigquery/docs/vector-index-text-search-tutorial\n",
+ "\n",
+ "3. BigQuery DataFrames to use Python instead of SQL.\n",
+ "\n",
+ " https://cloud.google.com/bigquery/docs/bigquery-dataframes-introduction"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "@deathbeds/jupyterlab-fonts": {
+ "styles": {
+ "": {
+ "body[data-jp-deck-mode='presenting'] &": {
+ "zoom": "275%"
+ }
+ }
+ }
+ },
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "source": [
+ "## Getting started with BigQuery DataFrames (bigframes)\n",
+ "\n",
+ "Install the bigframes package."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "@deathbeds/jupyterlab-fonts": {
+ "styles": {
+ "": {
+ "body[data-jp-deck-mode='presenting'] &": {
+ "zoom": "214%"
+ }
+ }
+ }
+ },
+ "execution": {
+ "iopub.execute_input": "2025-08-14T15:53:02.494188Z",
+ "iopub.status.busy": "2025-08-14T15:53:02.493469Z",
+ "iopub.status.idle": "2025-08-14T15:53:08.492291Z",
+ "shell.execute_reply": "2025-08-14T15:53:08.491183Z",
+ "shell.execute_reply.started": "2025-08-14T15:53:02.494152Z"
+ },
+ "trusted": true
+ },
+ "outputs": [],
+ "source": [
+ "%pip install --upgrade bigframes google-cloud-automl google-cloud-translate google-ai-generativelanguage tensorflow "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "@deathbeds/jupyterlab-fonts": {
+ "styles": {
+ "": {
+ "body[data-jp-deck-mode='presenting'] &": {
+ "z-index": "4",
+ "zoom": "236%"
+ }
+ }
+ }
+ }
+ },
+ "source": [
+ "**Important:** restart the kernel by going to \"Run -> Restart & clear cell outputs\" before continuing.\n",
+ "\n",
+ "Configure bigframes to use your GCP project. First, go to \"Add-ons -> Google Cloud SDK\" and click the \"Attach\" button. Then,"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "execution": {
+ "iopub.execute_input": "2025-08-14T15:53:08.494636Z",
+ "iopub.status.busy": "2025-08-14T15:53:08.494313Z",
+ "iopub.status.idle": "2025-08-14T15:53:08.609706Z",
+ "shell.execute_reply": "2025-08-14T15:53:08.608705Z",
+ "shell.execute_reply.started": "2025-08-14T15:53:08.494604Z"
+ },
+ "trusted": true
+ },
+ "outputs": [],
+ "source": [
+ "from kaggle_secrets import UserSecretsClient\n",
+ "user_secrets = UserSecretsClient()\n",
+ "user_credential = user_secrets.get_gcloud_credential()\n",
+ "user_secrets.set_tensorflow_credential(user_credential)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "@deathbeds/jupyterlab-fonts": {
+ "styles": {
+ "": {
+ "body[data-jp-deck-mode='presenting'] &": {
+ "zoom": "193%"
+ }
+ }
+ }
+ },
+ "execution": {
+ "iopub.execute_input": "2025-08-14T15:53:08.610982Z",
+ "iopub.status.busy": "2025-08-14T15:53:08.610686Z",
+ "iopub.status.idle": "2025-08-14T15:53:17.658993Z",
+ "shell.execute_reply": "2025-08-14T15:53:17.657745Z",
+ "shell.execute_reply.started": "2025-08-14T15:53:08.610961Z"
+ },
+ "trusted": true
+ },
+ "outputs": [],
+ "source": [
+ "import bigframes._config\n",
+ "import bigframes.pandas as bpd\n",
+ "\n",
+ "bpd.options.bigquery.location = \"US\"\n",
+ "\n",
+ "# Set to your GCP project ID.\n",
+ "bpd.options.bigquery.project = \"swast-scratch\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "@deathbeds/jupyterlab-fonts": {
+ "styles": {
+ "": {
+ "body[data-jp-deck-mode='presenting'] &": {
+ "zoom": "207%"
+ }
+ }
+ }
+ },
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "source": [
+ "## Reading data\n",
+ "\n",
+ "BigQuery DataFrames can read data from BigQuery, GCS, or even local sources. With `engine=\"bigquery\"`, BigQuery's distributed processing reads the file without it ever having to reach your local Python environment."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "@deathbeds/jupyterlab-fonts": {
+ "styles": {
+ "": {
+ "body[data-jp-deck-mode='presenting'] &": {
+ "zoom": "225%"
+ }
+ }
+ }
+ },
+ "execution": {
+ "iopub.execute_input": "2025-08-14T15:53:17.662234Z",
+ "iopub.status.busy": "2025-08-14T15:53:17.661901Z",
+ "iopub.status.idle": "2025-08-14T15:53:34.486799Z",
+ "shell.execute_reply": "2025-08-14T15:53:34.485777Z",
+ "shell.execute_reply.started": "2025-08-14T15:53:17.662207Z"
+ },
+ "trusted": true
+ },
+ "outputs": [],
+ "source": [
+ "df = bpd.read_json(\n",
+ " \"gs://cloud-samples-data/third-party/usa-loc-national-jukebox/jukebox.jsonl\",\n",
+ " engine=\"bigquery\",\n",
+ " orient=\"records\",\n",
+ " lines=True,\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "@deathbeds/jupyterlab-fonts": {
+ "styles": {
+ "": {
+ "body[data-jp-deck-mode='presenting'] &": {
+ "zoom": "122%"
+ }
+ }
+ }
+ },
+ "execution": {
+ "iopub.execute_input": "2025-08-14T15:53:34.488610Z",
+ "iopub.status.busy": "2025-08-14T15:53:34.488332Z",
+ "iopub.status.idle": "2025-08-14T15:53:40.347014Z",
+ "shell.execute_reply": "2025-08-14T15:53:40.345773Z",
+ "shell.execute_reply.started": "2025-08-14T15:53:34.488589Z"
+ },
+ "slideshow": {
+ "slide_type": "slide"
+ },
+ "trusted": true
+ },
+ "outputs": [],
+ "source": [
+ "# Use `peek()` instead of `head()` to see arbitrary rows rather than the \"first\" rows.\n",
+ "df.peek()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "@deathbeds/jupyterlab-fonts": {
+ "styles": {
+ "": {
+ "body[data-jp-deck-mode='presenting'] &": {
+ "zoom": "134%"
+ }
+ }
+ }
+ },
+ "execution": {
+ "iopub.execute_input": "2025-08-14T15:53:40.348376Z",
+ "iopub.status.busy": "2025-08-14T15:53:40.348021Z",
+ "iopub.status.idle": "2025-08-14T15:53:40.364129Z",
+ "shell.execute_reply": "2025-08-14T15:53:40.363204Z",
+ "shell.execute_reply.started": "2025-08-14T15:53:40.348351Z"
+ },
+ "trusted": true
+ },
+ "outputs": [],
+ "source": [
+ "df.shape"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "execution": {
+ "iopub.execute_input": "2025-08-14T15:55:55.448664Z",
+ "iopub.status.busy": "2025-08-14T15:55:55.448310Z",
+ "iopub.status.idle": "2025-08-14T15:55:59.440964Z",
+ "shell.execute_reply": "2025-08-14T15:55:59.439988Z",
+ "shell.execute_reply.started": "2025-08-14T15:55:55.448637Z"
+ },
+ "trusted": true
+ },
+ "outputs": [],
+ "source": [
+ "# For the purposes of a demo, select only a subset of rows.\n",
+ "df = df.sample(n=250)\n",
+ "df.cache()\n",
+ "df.shape"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "@deathbeds/jupyterlab-fonts": {
+ "styles": {
+ "": {
+ "body[data-jp-deck-mode='presenting'] &": {
+ "zoom": "161%"
+ }
+ }
+ }
+ },
+ "execution": {
+ "iopub.execute_input": "2025-08-14T15:56:02.040804Z",
+ "iopub.status.busy": "2025-08-14T15:56:02.040450Z",
+ "iopub.status.idle": "2025-08-14T15:56:06.544384Z",
+ "shell.execute_reply": "2025-08-14T15:56:06.543240Z",
+ "shell.execute_reply.started": "2025-08-14T15:56:02.040777Z"
+ },
+ "slideshow": {
+ "slide_type": "slide"
+ },
+ "trusted": true
+ },
+ "outputs": [],
+ "source": [
+ "# As a side effect of how I extracted the song information from the HTML DOM,\n",
+ "# we ended up with lists in places where we only expect one item.\n",
+ "#\n",
+ "# We can \"explode\" to flatten these lists.\n",
+ "flattened = df.explode([\n",
+ " \"Recording Repository\",\n",
+ " \"Recording Label\",\n",
+ " \"Recording Take Number\",\n",
+ " \"Recording Date\",\n",
+ " \"Recording Matrix Number\",\n",
+ " \"Recording Catalog Number\",\n",
+ " \"Media Size\",\n",
+ " \"Recording Location\",\n",
+ " \"Summary\",\n",
+ " \"Rights Advisory\",\n",
+ " \"Title\",\n",
+ "])\n",
+ "flattened.peek()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "execution": {
+ "iopub.execute_input": "2025-08-14T15:56:06.546531Z",
+ "iopub.status.busy": "2025-08-14T15:56:06.546140Z",
+ "iopub.status.idle": "2025-08-14T15:56:06.566005Z",
+ "shell.execute_reply": "2025-08-14T15:56:06.564355Z",
+ "shell.execute_reply.started": "2025-08-14T15:56:06.546494Z"
+ },
+ "trusted": true
+ },
+ "outputs": [],
+ "source": [
+ "flattened.shape"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "@deathbeds/jupyterlab-fonts": {
+ "styles": {
+ "": {
+ "body[data-jp-deck-mode='presenting'] &": {
+ "zoom": "216%"
+ }
+ }
+ }
+ },
+ "editable": true,
+ "slideshow": {
+ "slide_type": "slide"
+ },
+ "tags": []
+ },
+ "source": [
+ "To access unstructured data from BigQuery, create a URI pointing to a file in Google Cloud Storage (GCS). Then, construct a \"blob\" (also known as an \"Object Ref\" in BigQuery terms) so that BigQuery can read from GCS."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "@deathbeds/jupyterlab-fonts": {
+ "styles": {
+ "": {
+ "body[data-jp-deck-mode='presenting'] &": {
+ "zoom": "211%"
+ }
+ }
+ }
+ },
+ "editable": true,
+ "execution": {
+ "iopub.execute_input": "2025-08-14T15:56:07.394879Z",
+ "iopub.status.busy": "2025-08-14T15:56:07.394509Z",
+ "iopub.status.idle": "2025-08-14T15:56:12.217017Z",
+ "shell.execute_reply": "2025-08-14T15:56:12.215852Z",
+ "shell.execute_reply.started": "2025-08-14T15:56:07.394853Z"
+ },
+ "slideshow": {
+ "slide_type": ""
+ },
+ "tags": [],
+ "trusted": true
+ },
+ "outputs": [],
+ "source": [
+ "flattened = flattened.assign(**{\n",
+ " \"GCS Prefix\": \"gs://cloud-samples-data/third-party/usa-loc-national-jukebox/\",\n",
+ " \"GCS Stub\": flattened['URL'].str.extract(r'/(jukebox-[0-9]+)/'),\n",
+ "})\n",
+ "flattened[\"GCS URI\"] = flattened[\"GCS Prefix\"] + flattened[\"GCS Stub\"] + \".mp3\"\n",
+ "flattened[\"GCS Blob\"] = flattened[\"GCS URI\"].str.to_blob()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "@deathbeds/jupyterlab-fonts": {
+ "styles": {
+ "": {
+ "body[data-jp-deck-mode='presenting'] &": {
+ "zoom": "317%"
+ }
+ }
+ }
+ },
+ "editable": true,
+ "slideshow": {
+ "slide_type": "slide"
+ },
+ "tags": []
+ },
+ "source": [
+ "BigQuery (and BigQuery DataFrames) provide access to powerful models and multimodal capabilities. Here, we transcribe audio to text."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "editable": true,
+ "execution": {
+ "iopub.execute_input": "2025-08-14T15:56:20.908198Z",
+ "iopub.status.busy": "2025-08-14T15:56:20.907791Z",
+ "iopub.status.idle": "2025-08-14T15:58:45.909086Z",
+ "shell.execute_reply": "2025-08-14T15:58:45.908060Z",
+ "shell.execute_reply.started": "2025-08-14T15:56:20.908170Z"
+ },
+ "slideshow": {
+ "slide_type": ""
+ },
+ "tags": [],
+ "trusted": true
+ },
+ "outputs": [],
+ "source": [
+ "flattened[\"Transcription\"] = flattened[\"GCS Blob\"].blob.audio_transcribe(\n",
+ " model_name=\"gemini-2.0-flash-001\",\n",
+ " verbose=True,\n",
+ ")\n",
+ "flattened[\"Transcription\"]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "@deathbeds/jupyterlab-fonts": {
+ "styles": {
+ "": {
+ "body[data-jp-deck-mode='presenting'] &": {
+ "zoom": "229%"
+ }
+ }
+ }
+ },
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "source": [
+ "Sometimes the model has transient errors. Check the status column to see if there are errors."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "@deathbeds/jupyterlab-fonts": {
+ "styles": {
+ "": {
+ "body[data-jp-deck-mode='presenting'] &": {
+ "zoom": "177%"
+ }
+ }
+ }
+ },
+ "editable": true,
+ "execution": {
+ "iopub.execute_input": "2025-08-14T15:59:43.609239Z",
+ "iopub.status.busy": "2025-08-14T15:59:43.607976Z",
+ "iopub.status.idle": "2025-08-14T15:59:44.515118Z",
+ "shell.execute_reply": "2025-08-14T15:59:44.514275Z",
+ "shell.execute_reply.started": "2025-08-14T15:59:43.609201Z"
+ },
+ "slideshow": {
+ "slide_type": ""
+ },
+ "tags": [],
+ "trusted": true
+ },
+ "outputs": [],
+ "source": [
+ "print(f\"Successful rows: {(flattened['Transcription'].struct.field('status') == '').sum()}\")\n",
+ "print(f\"Failed rows: {(flattened['Transcription'].struct.field('status') != '').sum()}\")\n",
+ "flattened.shape"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "@deathbeds/jupyterlab-fonts": {
+ "styles": {
+ "": {
+ "body[data-jp-deck-mode='presenting'] &": {
+ "zoom": "141%"
+ }
+ }
+ }
+ },
+ "execution": {
+ "iopub.execute_input": "2025-08-14T15:59:44.820256Z",
+ "iopub.status.busy": "2025-08-14T15:59:44.819926Z",
+ "iopub.status.idle": "2025-08-14T15:59:53.147159Z",
+ "shell.execute_reply": "2025-08-14T15:59:53.146281Z",
+ "shell.execute_reply.started": "2025-08-14T15:59:44.820232Z"
+ },
+ "trusted": true
+ },
+ "outputs": [],
+ "source": [
+ "# Show transcribed lyrics.\n",
+ "flattened[\"Transcription\"].struct.field(\"content\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "@deathbeds/jupyterlab-fonts": {
+ "styles": {
+ "": {
+ "body[data-jp-deck-mode='presenting'] &": {
+ "zoom": "152%"
+ }
+ }
+ }
+ },
+ "execution": {
+ "iopub.execute_input": "2025-08-14T15:59:53.149222Z",
+ "iopub.status.busy": "2025-08-14T15:59:53.148783Z",
+ "iopub.status.idle": "2025-08-14T15:59:58.868959Z",
+ "shell.execute_reply": "2025-08-14T15:59:58.867804Z",
+ "shell.execute_reply.started": "2025-08-14T15:59:53.149198Z"
+ },
+ "slideshow": {
+ "slide_type": "slide"
+ },
+ "trusted": true
+ },
+ "outputs": [],
+ "source": [
+ "# Find all instrumentatal songs\n",
+ "instrumental = flattened[flattened[\"Transcription\"].struct.field(\"content\") == \"\"]\n",
+ "print(instrumental.shape)\n",
+ "song = instrumental.peek(1)\n",
+ "song"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "@deathbeds/jupyterlab-fonts": {
+ "styles": {
+ "": {
+ "body[data-jp-deck-mode='presenting'] &": {
+ "zoom": "152%"
+ }
+ }
+ }
+ },
+ "editable": true,
+ "execution": {
+ "iopub.execute_input": "2025-08-14T15:59:58.870143Z",
+ "iopub.status.busy": "2025-08-14T15:59:58.869868Z",
+ "iopub.status.idle": "2025-08-14T16:00:15.502470Z",
+ "shell.execute_reply": "2025-08-14T16:00:15.500813Z",
+ "shell.execute_reply.started": "2025-08-14T15:59:58.870123Z"
+ },
+ "slideshow": {
+ "slide_type": ""
+ },
+ "tags": [],
+ "trusted": true
+ },
+ "outputs": [],
+ "source": [
+ "import gcsfs\n",
+ "import IPython.display\n",
+ "\n",
+ "fs = gcsfs.GCSFileSystem(project='bigframes-dev')\n",
+ "with fs.open(song[\"GCS URI\"].iloc[0]) as song_file:\n",
+ " song_bytes = song_file.read()\n",
+ "\n",
+ "IPython.display.Audio(song_bytes)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "@deathbeds/jupyterlab-fonts": {
+ "styles": {
+ "": {
+ "body[data-jp-deck-mode='presenting'] &": {
+ "zoom": "181%"
+ }
+ }
+ }
+ },
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "source": [
+ "## Creating a searchable index\n",
+ "\n",
+ "To be able to search by semantics rather than just text, generate embeddings and then create an index to efficiently search these.\n",
+ "\n",
+ "See also, this example: https://github.com/googleapis/python-bigquery-dataframes/blob/main/notebooks/generative_ai/bq_dataframes_llm_vector_search.ipynb"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "@deathbeds/jupyterlab-fonts": {
+ "styles": {
+ "": {
+ "body[data-jp-deck-mode='presenting'] &": {
+ "zoom": "163%"
+ }
+ }
+ }
+ },
+ "execution": {
+ "iopub.execute_input": "2025-08-14T16:00:15.506380Z",
+ "iopub.status.busy": "2025-08-14T16:00:15.505775Z",
+ "iopub.status.idle": "2025-08-14T16:00:25.134987Z",
+ "shell.execute_reply": "2025-08-14T16:00:25.134124Z",
+ "shell.execute_reply.started": "2025-08-14T16:00:15.506337Z"
+ },
+ "trusted": true
+ },
+ "outputs": [],
+ "source": [
+ "from bigframes.ml.llm import TextEmbeddingGenerator\n",
+ "\n",
+ "text_model = TextEmbeddingGenerator(model_name=\"text-multilingual-embedding-002\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "@deathbeds/jupyterlab-fonts": {
+ "styles": {
+ "": {
+ "body[data-jp-deck-mode='presenting'] &": {
+ "zoom": "125%"
+ }
+ }
+ }
+ },
+ "execution": {
+ "iopub.execute_input": "2025-08-14T16:00:25.136017Z",
+ "iopub.status.busy": "2025-08-14T16:00:25.135744Z",
+ "iopub.status.idle": "2025-08-14T16:00:34.860878Z",
+ "shell.execute_reply": "2025-08-14T16:00:34.859925Z",
+ "shell.execute_reply.started": "2025-08-14T16:00:25.135997Z"
+ },
+ "trusted": true
+ },
+ "outputs": [],
+ "source": [
+ "df_to_index = (\n",
+ " flattened\n",
+ " .assign(content=flattened[\"Transcription\"].struct.field(\"content\"))\n",
+ " [flattened[\"Transcription\"].struct.field(\"content\") != \"\"]\n",
+ ")\n",
+ "embedding = text_model.predict(df_to_index)\n",
+ "embedding.peek(1)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "@deathbeds/jupyterlab-fonts": {
+ "styles": {
+ "": {
+ "body[data-jp-deck-mode='presenting'] &": {
+ "zoom": "178%"
+ }
+ }
+ }
+ },
+ "editable": true,
+ "execution": {
+ "iopub.execute_input": "2025-08-14T16:01:20.816923Z",
+ "iopub.status.busy": "2025-08-14T16:01:20.816523Z",
+ "iopub.status.idle": "2025-08-14T16:01:22.480554Z",
+ "shell.execute_reply": "2025-08-14T16:01:22.479604Z",
+ "shell.execute_reply.started": "2025-08-14T16:01:20.816894Z"
+ },
+ "slideshow": {
+ "slide_type": "slide"
+ },
+ "tags": [],
+ "trusted": true
+ },
+ "outputs": [],
+ "source": [
+ "# Check the status column to look for errors.\n",
+ "print(f\"Successful rows: {(embedding['ml_generate_embedding_status'] == '').sum()}\")\n",
+ "print(f\"Failed rows: {(embedding['ml_generate_embedding_status'] != '').sum()}\")\n",
+ "embedding.shape"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "@deathbeds/jupyterlab-fonts": {
+ "styles": {
+ "": {
+ "body[data-jp-deck-mode='presenting'] &": {
+ "zoom": "224%"
+ }
+ }
+ }
+ }
+ },
+ "source": [
+ "We're now ready to save this to a table."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "@deathbeds/jupyterlab-fonts": {
+ "styles": {
+ "": {
+ "body[data-jp-deck-mode='presenting'] &": {
+ "zoom": "172%"
+ }
+ }
+ }
+ },
+ "execution": {
+ "iopub.execute_input": "2025-08-14T16:03:43.611592Z",
+ "iopub.status.busy": "2025-08-14T16:03:43.611265Z",
+ "iopub.status.idle": "2025-08-14T16:03:47.459025Z",
+ "shell.execute_reply": "2025-08-14T16:03:47.458079Z",
+ "shell.execute_reply.started": "2025-08-14T16:03:43.611568Z"
+ },
+ "trusted": true
+ },
+ "outputs": [],
+ "source": [
+ "embedding_table_id = f\"{bpd.options.bigquery.project}.kaggle.national_jukebox\"\n",
+ "embedding.to_gbq(embedding_table_id, if_exists=\"replace\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "@deathbeds/jupyterlab-fonts": {
+ "styles": {
+ "": {
+ "body[data-jp-deck-mode='presenting'] &": {
+ "zoom": "183%"
+ }
+ }
+ }
+ },
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "source": [
+ "## Searching the database\n",
+ "\n",
+ "To search by semantics, we:\n",
+ "\n",
+ "1. Turn our search string into an embedding using the same model as our index.\n",
+ "2. Find the closest matches to the search string."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "@deathbeds/jupyterlab-fonts": {
+ "styles": {
+ "": {
+ "body[data-jp-deck-mode='presenting'] &": {
+ "zoom": "92%"
+ }
+ }
+ }
+ },
+ "execution": {
+ "iopub.execute_input": "2025-08-14T16:03:52.674429Z",
+ "iopub.status.busy": "2025-08-14T16:03:52.673629Z",
+ "iopub.status.idle": "2025-08-14T16:03:59.962635Z",
+ "shell.execute_reply": "2025-08-14T16:03:59.961482Z",
+ "shell.execute_reply.started": "2025-08-14T16:03:52.674399Z"
+ },
+ "slideshow": {
+ "slide_type": "skip"
+ },
+ "trusted": true
+ },
+ "outputs": [],
+ "source": [
+ "import bigframes.pandas as bpd\n",
+ "\n",
+ "df_written = bpd.read_gbq(embedding_table_id)\n",
+ "df_written.peek(1)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "@deathbeds/jupyterlab-fonts": {
+ "styles": {
+ "": {
+ "body[data-jp-deck-mode='presenting'] &": {
+ "zoom": "127%"
+ }
+ }
+ }
+ },
+ "execution": {
+ "iopub.execute_input": "2025-08-14T16:03:59.964634Z",
+ "iopub.status.busy": "2025-08-14T16:03:59.964268Z",
+ "iopub.status.idle": "2025-08-14T16:04:55.051531Z",
+ "shell.execute_reply": "2025-08-14T16:04:55.050393Z",
+ "shell.execute_reply.started": "2025-08-14T16:03:59.964598Z"
+ },
+ "trusted": true
+ },
+ "outputs": [],
+ "source": [
+ "from bigframes.ml.llm import TextEmbeddingGenerator\n",
+ "\n",
+ "search_string = \"walking home\"\n",
+ "\n",
+ "text_model = TextEmbeddingGenerator(model_name=\"text-multilingual-embedding-002\")\n",
+ "search_df = bpd.DataFrame([search_string], columns=['search_string'])\n",
+ "search_embedding = text_model.predict(search_df)\n",
+ "search_embedding"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "@deathbeds/jupyterlab-fonts": {
+ "styles": {
+ "": {
+ "body[data-jp-deck-mode='presenting'] &": {
+ "zoom": "175%"
+ }
+ }
+ }
+ },
+ "editable": true,
+ "execution": {
+ "iopub.execute_input": "2025-08-14T16:05:46.473357Z",
+ "iopub.status.busy": "2025-08-14T16:05:46.473056Z",
+ "iopub.status.idle": "2025-08-14T16:05:50.564470Z",
+ "shell.execute_reply": "2025-08-14T16:05:50.563277Z",
+ "shell.execute_reply.started": "2025-08-14T16:05:46.473336Z"
+ },
+ "slideshow": {
+ "slide_type": "slide"
+ },
+ "tags": [],
+ "trusted": true
+ },
+ "outputs": [],
+ "source": [
+ "import bigframes.bigquery as bbq\n",
+ "\n",
+ "vector_search_results = bbq.vector_search(\n",
+ " base_table=f\"swast-scratch.scipy2025.national_jukebox\",\n",
+ " column_to_search=\"ml_generate_embedding_result\",\n",
+ " query=search_embedding,\n",
+ " distance_type=\"COSINE\",\n",
+ " query_column_to_search=\"ml_generate_embedding_result\",\n",
+ " top_k=5,\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "execution": {
+ "iopub.execute_input": "2025-08-14T16:05:50.566930Z",
+ "iopub.status.busy": "2025-08-14T16:05:50.566422Z",
+ "iopub.status.idle": "2025-08-14T16:05:50.576293Z",
+ "shell.execute_reply": "2025-08-14T16:05:50.575186Z",
+ "shell.execute_reply.started": "2025-08-14T16:05:50.566893Z"
+ },
+ "trusted": true
+ },
+ "outputs": [],
+ "source": [
+ "vector_search_results.dtypes"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "@deathbeds/jupyterlab-fonts": {
+ "styles": {
+ "": {
+ "body[data-jp-deck-mode='presenting'] &": {
+ "zoom": "158%"
+ }
+ }
+ }
+ },
+ "execution": {
+ "iopub.execute_input": "2025-08-14T16:05:54.787080Z",
+ "iopub.status.busy": "2025-08-14T16:05:54.786649Z",
+ "iopub.status.idle": "2025-08-14T16:05:55.581285Z",
+ "shell.execute_reply": "2025-08-14T16:05:55.580012Z",
+ "shell.execute_reply.started": "2025-08-14T16:05:54.787054Z"
+ },
+ "slideshow": {
+ "slide_type": "slide"
+ },
+ "trusted": true
+ },
+ "outputs": [],
+ "source": [
+ "results = vector_search_results[[\"Title\", \"Summary\", \"Names\", \"GCS URI\", \"Transcription\", \"distance\"]].sort_values(\"distance\").to_pandas()\n",
+ "results"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "@deathbeds/jupyterlab-fonts": {
+ "styles": {
+ "": {
+ "body[data-jp-deck-mode='presenting'] &": {
+ "zoom": "138%"
+ }
+ }
+ }
+ },
+ "execution": {
+ "iopub.execute_input": "2025-08-14T16:05:56.142373Z",
+ "iopub.status.busy": "2025-08-14T16:05:56.142038Z",
+ "iopub.status.idle": "2025-08-14T16:05:56.149020Z",
+ "shell.execute_reply": "2025-08-14T16:05:56.147966Z",
+ "shell.execute_reply.started": "2025-08-14T16:05:56.142350Z"
+ },
+ "trusted": true
+ },
+ "outputs": [],
+ "source": [
+ "print(results[\"Transcription\"].struct.field(\"content\").iloc[0])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "editable": true,
+ "execution": {
+ "iopub.execute_input": "2025-08-14T16:06:04.542878Z",
+ "iopub.status.busy": "2025-08-14T16:06:04.542537Z",
+ "iopub.status.idle": "2025-08-14T16:06:04.843052Z",
+ "shell.execute_reply": "2025-08-14T16:06:04.841220Z",
+ "shell.execute_reply.started": "2025-08-14T16:06:04.542854Z"
+ },
+ "scrolled": true,
+ "slideshow": {
+ "slide_type": ""
+ },
+ "tags": [],
+ "trusted": true
+ },
+ "outputs": [],
+ "source": [
+ "import gcsfs\n",
+ "import IPython.display\n",
+ "\n",
+ "fs = gcsfs.GCSFileSystem(project='bigframes-dev')\n",
+ "with fs.open(results[\"GCS URI\"].iloc[0]) as song_file:\n",
+ " song_bytes = song_file.read()\n",
+ "\n",
+ "IPython.display.Audio(song_bytes)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "trusted": true
+ },
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kaggle": {
+ "accelerator": "none",
+ "dataSources": [
+ {
+ "databundleVersionId": 13238728,
+ "sourceId": 110281,
+ "sourceType": "competition"
+ }
+ ],
+ "dockerImageVersionId": 31089,
+ "isGpuEnabled": false,
+ "isInternetEnabled": true,
+ "language": "python",
+ "sourceType": "notebook"
+ },
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.11.13"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/noxfile.py b/noxfile.py
index 7adf499a08..cc38a3b8c0 100644
--- a/noxfile.py
+++ b/noxfile.py
@@ -838,11 +838,10 @@ def notebook(session: nox.Session):
]
)
- # Convert each Path notebook object to a string using a list comprehension.
+ # Convert each Path notebook object to a string using a list comprehension,
+ # and remove tests that we choose not to test.
notebooks = [str(nb) for nb in notebooks_list]
-
- # Remove tests that we choose not to test.
- notebooks = list(filter(lambda nb: nb not in denylist, notebooks))
+ notebooks = [nb for nb in notebooks if nb not in denylist and "/kaggle/" not in nb]
# Regionalized notebooks
notebooks_reg = {
diff --git a/samples/dbt/.dbt.yml b/samples/dbt/.dbt.yml
index a2fd2ffd4c..a4301a0bab 100644
--- a/samples/dbt/.dbt.yml
+++ b/samples/dbt/.dbt.yml
@@ -1,4 +1,4 @@
-# Copyright 2019 Google LLC
+# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
diff --git a/samples/dbt/README.md b/samples/dbt/README.md
index c52b633116..986aa2eae3 100644
--- a/samples/dbt/README.md
+++ b/samples/dbt/README.md
@@ -10,6 +10,8 @@ It includes basic configurations and sample models to help you get started quick
- `dbt_project.yml`: configures your dbt project - **dbt_sample_project**.
- `dbt_bigframes_code_sample_1.py`: An example to read BigQuery data and perform basic transformation.
- `dbt_bigframes_code_sample_2.py`: An example to build an incremental model that leverages BigFrames UDF capabilities.
+- `prepare_table.py`: An ML example to consolidate various data sources into a single, unified table for later usage.
+- `prediction.py`: An ML example to train models and then generate predictions using the prepared table.
## Requirements
diff --git a/samples/dbt/dbt_sample_project/dbt_project.yml b/samples/dbt/dbt_sample_project/dbt_project.yml
index aef376e1fc..789f4d2549 100644
--- a/samples/dbt/dbt_sample_project/dbt_project.yml
+++ b/samples/dbt/dbt_sample_project/dbt_project.yml
@@ -1,4 +1,4 @@
-# Copyright 2019 Google LLC
+# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
diff --git a/samples/dbt/dbt_sample_project/models/example/dbt_bigframes_code_sample_1.py b/samples/dbt/dbt_sample_project/models/example/dbt_bigframes_code_sample_1.py
index e397549afe..2e24596b79 100644
--- a/samples/dbt/dbt_sample_project/models/example/dbt_bigframes_code_sample_1.py
+++ b/samples/dbt/dbt_sample_project/models/example/dbt_bigframes_code_sample_1.py
@@ -1,4 +1,4 @@
-# Copyright 2019 Google LLC
+# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
diff --git a/samples/dbt/dbt_sample_project/models/example/dbt_bigframes_code_sample_2.py b/samples/dbt/dbt_sample_project/models/example/dbt_bigframes_code_sample_2.py
index 3795d0eee9..1f060cd60b 100644
--- a/samples/dbt/dbt_sample_project/models/example/dbt_bigframes_code_sample_2.py
+++ b/samples/dbt/dbt_sample_project/models/example/dbt_bigframes_code_sample_2.py
@@ -1,4 +1,4 @@
-# Copyright 2019 Google LLC
+# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
diff --git a/samples/dbt/dbt_sample_project/models/ml_example/prediction.py b/samples/dbt/dbt_sample_project/models/ml_example/prediction.py
new file mode 100644
index 0000000000..d2fb54b384
--- /dev/null
+++ b/samples/dbt/dbt_sample_project/models/ml_example/prediction.py
@@ -0,0 +1,67 @@
+# Copyright 2025 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# This DBT Python model prepares and trains a machine learning model to predict
+# ozone levels.
+# 1. Data Preparation: The model first gets a prepared dataset and splits it
+# into three subsets based on the year: training data (before 2017),
+# testing data (2017-2019), and prediction data (2020 and later).
+# 2. Model Training: It then uses the LinearRegression model from BigFrames
+# ML library. The model is trained on the historical data, using other
+# atmospheric parameters to predict the 'o3' (ozone) levels.
+# 3. Prediction: Finally, the trained model makes predictions on the most
+# recent data (from 2020 onwards) and returns the resulting DataFrame of
+# predicted ozone values.
+#
+# See more details from the related blog post: https://docs.getdbt.com/blog/train-linear-dbt-bigframes
+
+
+def model(dbt, session):
+ dbt.config(submission_method="bigframes", timeout=6000)
+
+ df = dbt.ref("prepare_table")
+
+ # Define the rules for separating the training, test and prediction data.
+ train_data_filter = (df.date_local.dt.year < 2017)
+ test_data_filter = (
+ (df.date_local.dt.year >= 2017) & (df.date_local.dt.year < 2020)
+ )
+ predict_data_filter = (df.date_local.dt.year >= 2020)
+
+ # Define index_columns again here in prediction.
+ index_columns = ["state_name", "county_name", "site_num", "date_local", "time_local"]
+
+ # Separate the training, test and prediction data.
+ df_train = df[train_data_filter].set_index(index_columns)
+ df_test = df[test_data_filter].set_index(index_columns)
+ df_predict = df[predict_data_filter].set_index(index_columns)
+
+ # Finalize the training dataframe.
+ X_train = df_train.drop(columns="o3")
+ y_train = df_train["o3"]
+
+ # Finalize the prediction dataframe.
+ X_predict = df_predict.drop(columns="o3")
+
+ # Import the LinearRegression model from bigframes.ml module.
+ from bigframes.ml.linear_model import LinearRegression
+
+ # Train the model.
+ model = LinearRegression()
+ model.fit(X_train, y_train)
+
+ # Make the prediction using the model.
+ df_pred = model.predict(X_predict)
+
+ return df_pred
diff --git a/samples/dbt/dbt_sample_project/models/ml_example/prepare_table.py b/samples/dbt/dbt_sample_project/models/ml_example/prepare_table.py
new file mode 100644
index 0000000000..23b54a9122
--- /dev/null
+++ b/samples/dbt/dbt_sample_project/models/ml_example/prepare_table.py
@@ -0,0 +1,93 @@
+# Copyright 2025 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# This DBT Python model processes EPA historical air quality data from BigQuery
+# using BigFrames. The primary goal is to merge several hourly summary
+# tables into a single, unified DataFrame for later prediction. It includes the
+# following steps:
+# 1. Reading and Cleaning: It reads individual hourly summary tables from
+# BigQuery for various atmospheric parameters (like CO, O3, temperature,
+# and wind speed). Each table is cleaned by sorting, removing duplicates,
+# and renaming columns for clarity.
+# 2. Combining Data: It then merges these cleaned tables into a single,
+# comprehensive DataFrame. An inner join is used to ensure the final output
+# only includes records with complete data across all parameters.
+# 3. Final Output: The unified DataFrame is returned as the model's output,
+# creating a corresponding BigQuery table for future use.
+#
+# See more details from the related blog post: https://docs.getdbt.com/blog/train-linear-dbt-bigframes
+
+
+import bigframes.pandas as bpd
+
+def model(dbt, session):
+ # Optional: override settings from dbt_project.yml.
+ # When both are set, dbt.config takes precedence over dbt_project.yml.
+ dbt.config(submission_method="bigframes", timeout=6000)
+
+ # Define the dataset and the columns of interest representing various parameters
+ # in the atmosphere.
+ dataset = "bigquery-public-data.epa_historical_air_quality"
+ index_columns = ["state_name", "county_name", "site_num", "date_local", "time_local"]
+ param_column = "parameter_name"
+ value_column = "sample_measurement"
+
+ # Initialize a list for collecting dataframes from individual parameters.
+ params_dfs = []
+
+ # Collect dataframes from tables which contain data for single parameter.
+ table_param_dict = {
+ "co_hourly_summary" : "co",
+ "no2_hourly_summary" : "no2",
+ "o3_hourly_summary" : "o3",
+ "pressure_hourly_summary" : "pressure",
+ "so2_hourly_summary" : "so2",
+ "temperature_hourly_summary" : "temperature",
+ }
+
+ for table, param in table_param_dict.items():
+ param_df = bpd.read_gbq(
+ f"{dataset}.{table}",
+ columns=index_columns + [value_column]
+ )
+ param_df = param_df\
+ .sort_values(index_columns)\
+ .drop_duplicates(index_columns)\
+ .set_index(index_columns)\
+ .rename(columns={value_column : param})
+ params_dfs.append(param_df)
+
+ # Collect dataframes from the table containing wind speed.
+ # Optionally: collect dataframes from other tables containing
+ # wind direction, NO, NOx, and NOy data as needed.
+ wind_table = f"{dataset}.wind_hourly_summary"
+ bpd.read_gbq(wind_table, columns=[param_column]).value_counts()
+
+ wind_speed_df = bpd.read_gbq(
+ wind_table,
+ columns=index_columns + [value_column],
+ filters=[(param_column, "==", "Wind Speed - Resultant")]
+ )
+ wind_speed_df = wind_speed_df\
+ .sort_values(index_columns)\
+ .drop_duplicates(index_columns)\
+ .set_index(index_columns)\
+ .rename(columns={value_column: "wind_speed"})
+ params_dfs.append(wind_speed_df)
+
+ # Combine data for all the selected parameters.
+ df = bpd.concat(params_dfs, axis=1, join="inner")
+ df = df.reset_index()
+
+ return df
diff --git a/specs/2025-08-11-anywidget-align-text.md b/specs/2025-08-11-anywidget-align-text.md
new file mode 100644
index 0000000000..03305538dc
--- /dev/null
+++ b/specs/2025-08-11-anywidget-align-text.md
@@ -0,0 +1,132 @@
+# Anywidget: align text left and numerics right
+
+The "anywidget" rendering mode outputs an HTML table per page right now, but
+the values need to be aligned according to their data type.
+
+## Background
+
+Anywidget currently renders pages like the following:
+
+```html
+
+
+
+ state |
+ gender |
+ year |
+ name |
+ number |
+
+
+
+
+ VA |
+ M |
+ 1930 |
+ Pat |
+ 6 |
+
+
+
+ TX |
+ M |
+ 1968 |
+ Kennith |
+ 18 |
+
+
+
+```
+
+* This change fixes internal issue b/437697339.
+* Numeric data should be right aligned so that it is easier to compare numbers,
+ especially if they all are rounded to the same precision.
+* Text data is better left aligned, since many languages read left to right.
+
+## Acceptance Criteria
+
+- [ ] Header cells should align left.
+- [ ] Header cells should use the resize CSS property to allow resizing.
+- [ ] STRING columns are left-aligned in the output of `TableWidget` in
+ `bigframes/display/anywidget.py`.
+- [ ] Numeric columns (INT64, FLOAT64, NUMERIC, BIGNUMERIC) are right-aligned
+ in the output of `TableWidget` in `bigframes/display/anywidget.py`.
+- [ ] Create option `DisplayOptions.precision` in
+ `bigframes/_config/display_options.py` that can override the output
+ precision (defaults to 6, just like `pandas.options.display.precision`).
+- [ ] All other non-numeric column types, including BYTES, BOOLEAN, TIMESTAMP,
+ and more, are left-aligned in the output of `TableWidget` in
+ `bigframes/display/anywidget.py`.
+- [ ] There are parameterized unit tests verifying the alignment is set
+ correctly.
+
+## Detailed Steps
+
+### 1. Create Display Precision Configuration
+
+- [ ] In `bigframes/_config/display_options.py`, add a new `precision` attribute to the `DisplayOptions` dataclass.
+- [ ] Set the default value to `6`.
+- [ ] Add precision to the items in `def pandas_repr` that get passed to the pandas options context.
+- [ ] Add a docstring explaining that it controls the floating point output precision, similar to `pandas.options.display.precision`.
+- [ ] Check these items off with `[x]` as they are completed.
+
+### 2. Improve the headers
+
+- [ ] Create `bigframes/display/html.py`.
+- [ ] In `bigframes/display/html.py`, create a `def render_html(*, dataframe: pandas.DataFrame, table_id: str)` method.
+- [ ] Loop through the column names to create the table head.
+- [ ] Apply the `text-align: left` style to the header.
+- [ ] Wrap the cell text in a resizable `div`.
+- [ ] Check these items off with `[x]` as they are completed.
+
+### 3. Implement Alignment and Precision Logic in TableWidget
+
+- [ ] Create a helper function `_is_dtype_numeric(dtype)` that takes a pandas
+ dtype returns True for types that that should be right-aligned. These
+ dtypes should correspond to the BigQuery data types: `INT64`, `FLOAT64`,
+ `NUMERIC`, `BIGNUMERIC`. Use the `bigframes.dtypes` module to map from
+ pandas type to BigQuery type.
+- [ ] In the loop that generates the table rows (`` elements), add a function to determine the style based on the column's `dtype`.
+- [ ] If the column's `dtype` is in the numeric set, apply the CSS style `text-align: right`.
+- [ ] For all other `dtypes` (including `STRING`, `BYTES`, `BOOLEAN`, `TIMESTAMP`, etc.), apply `text-align: left`.
+- [ ] When formatting floating-point numbers for display, use the `bigframes.options.display.precision` value.
+- [ ] In `bigframes/display/anywidget.py`, modify the `_set_table_html` method of the `TableWidget` class to call `bigframes.display.html.render_html(...)`.
+- [ ] Render the notebook at `notebooks/dataframes/anywidget_mode.ipynb` with
+ the `jupyter nbconvert --to notebook --execute notebooks/dataframes/anywidget_mode.ipynb`
+ command and validate that the rendered notebook includes the desired
+ changes to the HTML tables.
+- [ ] Check these items off with `[x]` as they are completed.
+
+### 4. Add Parameterized Unit Tests
+
+- [ ] Create a new test file: `tests/unit/display/test_html.py`.
+- [ ] Create a parameterized test method, e.g., `test_render_html_alignment_and_precision`.
+- [ ] Use `@pytest.mark.parametrize` to test various scenarios.
+- [ ] **Scenario 1: Alignment.**
+ - Create a sample `bigframes.dataframe.DataFrame` with columns of different types: a string, an integer, a float, and a boolean.
+ - Render the `pandas.DataFrame` to HTML.
+ - Assert that the integer and float column headers and data cells (` | ` and ` | `) have `style="text-align: right;"`.
+ - Assert that the string and boolean columns have `style="text-align: left;"`.
+- [ ] **Scenario 2: Precision.**
+ - Create a `bigframes.dataframe.DataFrame` with a `FLOAT64` column containing a number with many decimal places (e.g., `3.14159265`).
+ - Set `bigframes.options.display.precision = 4`.
+ - Render the `pandas.DataFrame` to HTML.
+ - Assert that the output string contains the number formatted to 4 decimal places (e.g., `3.1416`).
+ - Remember to reset the option value after the test to avoid side effects.
+- [ ] Check these items off with `[x]` as they are completed.
+
+## Verification
+
+*Specify the commands to run to verify the changes.*
+
+- [ ] The `nox -r -s format lint lint_setup_py` linter should pass.
+- [ ] The `nox -r -s mypy` static type checker should pass.
+- [ ] The `nox -r -s docs docfx` docs should successfully build and include relevant docs in the output.
+- [ ] All new and existing unit tests `pytest tests/unit` should pass.
+- [ ] Identify all related system tests in the `tests/system` directories.
+- [ ] All related system tests `pytest tests/system/small/path_to_relevant_test.py::test_name` should pass.
+- [ ] Check these items off with `[x]` as they are completed.
+
+## Constraints
+
+Follow the guidelines listed in GEMINI.md at the root of the repository.
diff --git a/tests/system/large/blob/test_function.py b/tests/system/large/blob/test_function.py
index a594b144f5..c8fa63d493 100644
--- a/tests/system/large/blob/test_function.py
+++ b/tests/system/large/blob/test_function.py
@@ -302,37 +302,16 @@ def test_blob_image_normalize_to_bq(images_mm_df: bpd.DataFrame, bq_connection:
@pytest.mark.parametrize(
- "verbose, expected",
+ "verbose",
[
- (
- True,
- pd.Series(
- [
- {"status": "File has not been decrypted", "content": ""},
- {
- "status": "",
- "content": "Sample PDF This is a testing file. Some dummy messages are used for testing purposes. ",
- },
- ]
- ),
- ),
- (
- False,
- pd.Series(
- [
- "",
- "Sample PDF This is a testing file. Some dummy messages are used for testing purposes. ",
- ],
- name="pdf",
- ),
- ),
+ (True),
+ (False),
],
)
def test_blob_pdf_extract(
pdf_mm_df: bpd.DataFrame,
verbose: bool,
bq_connection: str,
- expected: pd.Series,
):
actual = (
pdf_mm_df["pdf"]
@@ -341,49 +320,44 @@ def test_blob_pdf_extract(
.to_pandas()
)
- pd.testing.assert_series_equal(
- actual,
- expected,
- check_dtype=False,
- check_index=False,
+ # check relative length
+ expected_text = "Sample PDF This is a testing file. Some dummy messages are used for testing purposes."
+ expected_len = len(expected_text)
+
+ actual_text = ""
+ if verbose:
+ # The first entry is for a file that doesn't exist, so we check the second one
+ successful_results = actual[actual.apply(lambda x: x["status"] == "")]
+ actual_text = successful_results.apply(lambda x: x["content"]).iloc[0]
+ else:
+ actual_text = actual[actual != ""].iloc[0]
+ actual_len = len(actual_text)
+
+ relative_length_tolerance = 0.25
+ min_acceptable_len = expected_len * (1 - relative_length_tolerance)
+ max_acceptable_len = expected_len * (1 + relative_length_tolerance)
+ assert min_acceptable_len <= actual_len <= max_acceptable_len, (
+ f"Item (verbose={verbose}): Extracted text length {actual_len} is outside the acceptable range "
+ f"[{min_acceptable_len:.0f}, {max_acceptable_len:.0f}]. "
+ f"Expected reference length was {expected_len}. "
)
+ # check for major keywords
+ major_keywords = ["Sample", "PDF", "testing", "dummy", "messages"]
+ for keyword in major_keywords:
+ assert (
+ keyword.lower() in actual_text.lower()
+ ), f"Item (verbose={verbose}): Expected keyword '{keyword}' not found in extracted text. "
+
@pytest.mark.parametrize(
- "verbose, expected",
+ "verbose",
[
- (
- True,
- pd.Series(
- [
- {"status": "File has not been decrypted", "content": []},
- {
- "status": "",
- "content": [
- "Sample PDF This is a testing file. Some ",
- "dummy messages are used for testing ",
- "purposes. ",
- ],
- },
- ]
- ),
- ),
- (
- False,
- pd.Series(
- [
- pd.NA,
- "Sample PDF This is a testing file. Some ",
- "dummy messages are used for testing ",
- "purposes. ",
- ],
- ),
- ),
+ (True),
+ (False),
],
)
-def test_blob_pdf_chunk(
- pdf_mm_df: bpd.DataFrame, verbose: bool, bq_connection: str, expected: pd.Series
-):
+def test_blob_pdf_chunk(pdf_mm_df: bpd.DataFrame, verbose: bool, bq_connection: str):
actual = (
pdf_mm_df["pdf"]
.blob.pdf_chunk(
@@ -397,13 +371,36 @@ def test_blob_pdf_chunk(
.to_pandas()
)
- pd.testing.assert_series_equal(
- actual,
- expected,
- check_dtype=False,
- check_index=False,
+ # check relative length
+ expected_text = "Sample PDF This is a testing file. Some dummy messages are used for testing purposes."
+ expected_len = len(expected_text)
+
+ actual_text = ""
+ if verbose:
+ # The first entry is for a file that doesn't exist, so we check the second one
+ successful_results = actual[actual.apply(lambda x: x["status"] == "")]
+ actual_text = "".join(successful_results.apply(lambda x: x["content"]).iloc[0])
+ else:
+ # First entry is NA
+ actual_text = "".join(actual.dropna())
+ actual_len = len(actual_text)
+
+ relative_length_tolerance = 0.25
+ min_acceptable_len = expected_len * (1 - relative_length_tolerance)
+ max_acceptable_len = expected_len * (1 + relative_length_tolerance)
+ assert min_acceptable_len <= actual_len <= max_acceptable_len, (
+ f"Item (verbose={verbose}): Extracted text length {actual_len} is outside the acceptable range "
+ f"[{min_acceptable_len:.0f}, {max_acceptable_len:.0f}]. "
+ f"Expected reference length was {expected_len}. "
)
+ # check for major keywords
+ major_keywords = ["Sample", "PDF", "testing", "dummy", "messages"]
+ for keyword in major_keywords:
+ assert (
+ keyword.lower() in actual_text.lower()
+ ), f"Item (verbose={verbose}): Expected keyword '{keyword}' not found in extracted text. "
+
@pytest.mark.parametrize(
"model_name, verbose",
diff --git a/tests/system/large/functions/test_managed_function.py b/tests/system/large/functions/test_managed_function.py
index 5349529f1d..262f5f0fe2 100644
--- a/tests/system/large/functions/test_managed_function.py
+++ b/tests/system/large/functions/test_managed_function.py
@@ -963,3 +963,151 @@ def float_parser(row):
cleanup_function_assets(
float_parser_mf, session.bqclient, ignore_failures=False
)
+
+
+def test_managed_function_df_where(session, dataset_id, scalars_dfs):
+ try:
+
+ # The return type has to be bool type for callable where condition.
+ def is_sum_positive(a, b):
+ return a + b > 0
+
+ is_sum_positive_mf = session.udf(
+ input_types=[int, int],
+ output_type=bool,
+ dataset=dataset_id,
+ name=prefixer.create_prefix(),
+ )(is_sum_positive)
+
+ scalars_df, scalars_pandas_df = scalars_dfs
+ int64_cols = ["int64_col", "int64_too"]
+
+ bf_int64_df = scalars_df[int64_cols]
+ bf_int64_df_filtered = bf_int64_df.dropna()
+ pd_int64_df = scalars_pandas_df[int64_cols]
+ pd_int64_df_filtered = pd_int64_df.dropna()
+
+ # Use callable condition in dataframe.where method.
+ bf_result = bf_int64_df_filtered.where(is_sum_positive_mf).to_pandas()
+ # Pandas doesn't support such case, use following as workaround.
+ pd_result = pd_int64_df_filtered.where(pd_int64_df_filtered.sum(axis=1) > 0)
+
+ # Ignore any dtype difference.
+ pandas.testing.assert_frame_equal(bf_result, pd_result, check_dtype=False)
+
+ # Make sure the read_gbq_function path works for this function.
+ is_sum_positive_ref = session.read_gbq_function(
+ function_name=is_sum_positive_mf.bigframes_bigquery_function
+ )
+
+ bf_result_gbq = bf_int64_df_filtered.where(
+ is_sum_positive_ref, -bf_int64_df_filtered
+ ).to_pandas()
+ pd_result_gbq = pd_int64_df_filtered.where(
+ pd_int64_df_filtered.sum(axis=1) > 0, -pd_int64_df_filtered
+ )
+
+ # Ignore any dtype difference.
+ pandas.testing.assert_frame_equal(
+ bf_result_gbq, pd_result_gbq, check_dtype=False
+ )
+
+ finally:
+ # Clean up the gcp assets created for the managed function.
+ cleanup_function_assets(
+ is_sum_positive_mf, session.bqclient, ignore_failures=False
+ )
+
+
+def test_managed_function_df_where_series(session, dataset_id, scalars_dfs):
+ try:
+
+ # The return type has to be bool type for callable where condition.
+ def is_sum_positive_series(s):
+ return s["int64_col"] + s["int64_too"] > 0
+
+ is_sum_positive_series_mf = session.udf(
+ input_types=bigframes.series.Series,
+ output_type=bool,
+ dataset=dataset_id,
+ name=prefixer.create_prefix(),
+ )(is_sum_positive_series)
+
+ scalars_df, scalars_pandas_df = scalars_dfs
+ int64_cols = ["int64_col", "int64_too"]
+
+ bf_int64_df = scalars_df[int64_cols]
+ bf_int64_df_filtered = bf_int64_df.dropna()
+ pd_int64_df = scalars_pandas_df[int64_cols]
+ pd_int64_df_filtered = pd_int64_df.dropna()
+
+ # Use callable condition in dataframe.where method.
+ bf_result = bf_int64_df_filtered.where(is_sum_positive_series).to_pandas()
+ pd_result = pd_int64_df_filtered.where(is_sum_positive_series)
+
+ # Ignore any dtype difference.
+ pandas.testing.assert_frame_equal(bf_result, pd_result, check_dtype=False)
+
+ # Make sure the read_gbq_function path works for this function.
+ is_sum_positive_series_ref = session.read_gbq_function(
+ function_name=is_sum_positive_series_mf.bigframes_bigquery_function,
+ is_row_processor=True,
+ )
+
+ # This is for callable `other` arg in dataframe.where method.
+ def func_for_other(x):
+ return -x
+
+ bf_result_gbq = bf_int64_df_filtered.where(
+ is_sum_positive_series_ref, func_for_other
+ ).to_pandas()
+ pd_result_gbq = pd_int64_df_filtered.where(
+ is_sum_positive_series, func_for_other
+ )
+
+ # Ignore any dtype difference.
+ pandas.testing.assert_frame_equal(
+ bf_result_gbq, pd_result_gbq, check_dtype=False
+ )
+
+ finally:
+ # Clean up the gcp assets created for the managed function.
+ cleanup_function_assets(
+ is_sum_positive_series_mf, session.bqclient, ignore_failures=False
+ )
+
+
+def test_managed_function_series_where(session, dataset_id, scalars_dfs):
+ try:
+
+ # The return type has to be bool type for callable where condition.
+ def _is_positive(s):
+ return s + 1000 > 0
+
+ is_positive_mf = session.udf(
+ input_types=int,
+ output_type=bool,
+ dataset=dataset_id,
+ name=prefixer.create_prefix(),
+ )(_is_positive)
+
+ scalars, scalars_pandas = scalars_dfs
+
+ bf_int64 = scalars["int64_col"]
+ bf_int64_filtered = bf_int64.dropna()
+ pd_int64 = scalars_pandas["int64_col"]
+ pd_int64_filtered = pd_int64.dropna()
+
+ # The cond is a callable (managed function) and the other is not a
+ # callable in series.where method.
+ bf_result = bf_int64_filtered.where(
+ cond=is_positive_mf, other=-bf_int64_filtered
+ ).to_pandas()
+ pd_result = pd_int64_filtered.where(cond=_is_positive, other=-pd_int64_filtered)
+
+ # Ignore any dtype difference.
+ pandas.testing.assert_series_equal(bf_result, pd_result, check_dtype=False)
+
+ finally:
+ # Clean up the gcp assets created for the managed function.
+ cleanup_function_assets(is_positive_mf, session.bqclient, ignore_failures=False)
diff --git a/tests/system/large/functions/test_remote_function.py b/tests/system/large/functions/test_remote_function.py
index a93435d11a..9e2c1e2c81 100644
--- a/tests/system/large/functions/test_remote_function.py
+++ b/tests/system/large/functions/test_remote_function.py
@@ -2847,3 +2847,125 @@ def foo(x: int) -> int:
finally:
# clean up the gcp assets created for the remote function
cleanup_function_assets(foo, session.bqclient, session.cloudfunctionsclient)
+
+
+@pytest.mark.flaky(retries=2, delay=120)
+def test_remote_function_df_where(session, dataset_id, scalars_dfs):
+ try:
+
+ # The return type has to be bool type for callable where condition.
+ def is_sum_positive(a, b):
+ return a + b > 0
+
+ is_sum_positive_mf = session.remote_function(
+ input_types=[int, int],
+ output_type=bool,
+ dataset=dataset_id,
+ reuse=False,
+ cloud_function_service_account="default",
+ )(is_sum_positive)
+
+ scalars_df, scalars_pandas_df = scalars_dfs
+ int64_cols = ["int64_col", "int64_too"]
+
+ bf_int64_df = scalars_df[int64_cols]
+ bf_int64_df_filtered = bf_int64_df.dropna()
+ pd_int64_df = scalars_pandas_df[int64_cols]
+ pd_int64_df_filtered = pd_int64_df.dropna()
+
+ # Use callable condition in dataframe.where method.
+ bf_result = bf_int64_df_filtered.where(is_sum_positive_mf, 0).to_pandas()
+ # Pandas doesn't support such case, use following as workaround.
+ pd_result = pd_int64_df_filtered.where(pd_int64_df_filtered.sum(axis=1) > 0, 0)
+
+ # Ignore any dtype difference.
+ pandas.testing.assert_frame_equal(bf_result, pd_result, check_dtype=False)
+
+ finally:
+ # Clean up the gcp assets created for the remote function.
+ cleanup_function_assets(
+ is_sum_positive_mf, session.bqclient, ignore_failures=False
+ )
+
+
+@pytest.mark.flaky(retries=2, delay=120)
+def test_remote_function_df_where_series(session, dataset_id, scalars_dfs):
+ try:
+
+ # The return type has to be bool type for callable where condition.
+ def is_sum_positive_series(s):
+ return s["int64_col"] + s["int64_too"] > 0
+
+ is_sum_positive_series_mf = session.remote_function(
+ input_types=bigframes.series.Series,
+ output_type=bool,
+ dataset=dataset_id,
+ reuse=False,
+ cloud_function_service_account="default",
+ )(is_sum_positive_series)
+
+ scalars_df, scalars_pandas_df = scalars_dfs
+ int64_cols = ["int64_col", "int64_too"]
+
+ bf_int64_df = scalars_df[int64_cols]
+ bf_int64_df_filtered = bf_int64_df.dropna()
+ pd_int64_df = scalars_pandas_df[int64_cols]
+ pd_int64_df_filtered = pd_int64_df.dropna()
+
+ # This is for callable `other` arg in dataframe.where method.
+ def func_for_other(x):
+ return -x
+
+ # Use callable condition in dataframe.where method.
+ bf_result = bf_int64_df_filtered.where(
+ is_sum_positive_series, func_for_other
+ ).to_pandas()
+ pd_result = pd_int64_df_filtered.where(is_sum_positive_series, func_for_other)
+
+ # Ignore any dtype difference.
+ pandas.testing.assert_frame_equal(bf_result, pd_result, check_dtype=False)
+
+ finally:
+ # Clean up the gcp assets created for the remote function.
+ cleanup_function_assets(
+ is_sum_positive_series_mf, session.bqclient, ignore_failures=False
+ )
+
+
+@pytest.mark.flaky(retries=2, delay=120)
+def test_remote_function_series_where(session, dataset_id, scalars_dfs):
+ try:
+
+ def _ten_times(x):
+ return x * 10
+
+ ten_times_mf = session.remote_function(
+ input_types=float,
+ output_type=float,
+ dataset=dataset_id,
+ reuse=False,
+ cloud_function_service_account="default",
+ )(_ten_times)
+
+ scalars, scalars_pandas = scalars_dfs
+
+ bf_int64 = scalars["float64_col"]
+ bf_int64_filtered = bf_int64.dropna()
+ pd_int64 = scalars_pandas["float64_col"]
+ pd_int64_filtered = pd_int64.dropna()
+
+ # The cond is not a callable and the other is a callable (remote
+ # function) in series.where method.
+ bf_result = bf_int64_filtered.where(
+ cond=bf_int64_filtered < 0, other=ten_times_mf
+ ).to_pandas()
+ pd_result = pd_int64_filtered.where(
+ cond=pd_int64_filtered < 0, other=_ten_times
+ )
+
+ # Ignore any dtype difference.
+ pandas.testing.assert_series_equal(bf_result, pd_result, check_dtype=False)
+
+ finally:
+ # Clean up the gcp assets created for the remote function.
+ cleanup_function_assets(ten_times_mf, session.bqclient, ignore_failures=False)
diff --git a/tests/system/small/engines/test_bool_ops.py b/tests/system/small/engines/test_bool_ops.py
new file mode 100644
index 0000000000..065a43c209
--- /dev/null
+++ b/tests/system/small/engines/test_bool_ops.py
@@ -0,0 +1,64 @@
+# Copyright 2025 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import itertools
+
+import pytest
+
+from bigframes.core import array_value
+import bigframes.operations as ops
+from bigframes.session import polars_executor
+from bigframes.testing.engine_utils import assert_equivalence_execution
+
+pytest.importorskip("polars")
+
+# Polars used as reference as its fast and local. Generally though, prefer gbq engine where they disagree.
+REFERENCE_ENGINE = polars_executor.PolarsExecutor()
+
+
+def apply_op_pairwise(
+ array: array_value.ArrayValue, op: ops.BinaryOp, excluded_cols=[]
+) -> array_value.ArrayValue:
+ exprs = []
+ for l_arg, r_arg in itertools.permutations(array.column_ids, 2):
+ if (l_arg in excluded_cols) or (r_arg in excluded_cols):
+ continue
+ try:
+ _ = op.output_type(
+ array.get_column_type(l_arg), array.get_column_type(r_arg)
+ )
+ exprs.append(op.as_expr(l_arg, r_arg))
+ except TypeError:
+ continue
+ assert len(exprs) > 0
+ new_arr, _ = array.compute_values(exprs)
+ return new_arr
+
+
+@pytest.mark.parametrize("engine", ["polars", "bq"], indirect=True)
+@pytest.mark.parametrize(
+ "op",
+ [
+ ops.and_op,
+ ops.or_op,
+ ops.xor_op,
+ ],
+)
+def test_engines_project_boolean_op(
+ scalars_array_value: array_value.ArrayValue, engine, op
+):
+ # exclude string cols as does not contain dates
+ # bool col actually doesn't work properly for bq engine
+ arr = apply_op_pairwise(scalars_array_value, op, excluded_cols=["string_col"])
+ assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine)
diff --git a/tests/system/small/engines/test_generic_ops.py b/tests/system/small/engines/test_generic_ops.py
index af114991eb..9fdb6bca78 100644
--- a/tests/system/small/engines/test_generic_ops.py
+++ b/tests/system/small/engines/test_generic_ops.py
@@ -59,6 +59,7 @@ def test_engines_astype_int(scalars_array_value: array_value.ArrayValue, engine)
ops.AsTypeOp(to_type=bigframes.dtypes.INT_DTYPE),
excluded_cols=["string_col"],
)
+
assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine)
@@ -73,6 +74,7 @@ def test_engines_astype_string_int(scalars_array_value: array_value.ArrayValue,
for val in vals
]
)
+
assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine)
@@ -83,6 +85,7 @@ def test_engines_astype_float(scalars_array_value: array_value.ArrayValue, engin
ops.AsTypeOp(to_type=bigframes.dtypes.FLOAT_DTYPE),
excluded_cols=["string_col"],
)
+
assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine)
@@ -99,6 +102,7 @@ def test_engines_astype_string_float(
for val in vals
]
)
+
assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine)
@@ -107,6 +111,7 @@ def test_engines_astype_bool(scalars_array_value: array_value.ArrayValue, engine
arr = apply_op(
scalars_array_value, ops.AsTypeOp(to_type=bigframes.dtypes.BOOL_DTYPE)
)
+
assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine)
@@ -118,6 +123,7 @@ def test_engines_astype_string(scalars_array_value: array_value.ArrayValue, engi
ops.AsTypeOp(to_type=bigframes.dtypes.STRING_DTYPE),
excluded_cols=["float64_col"],
)
+
assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine)
@@ -128,6 +134,7 @@ def test_engines_astype_numeric(scalars_array_value: array_value.ArrayValue, eng
ops.AsTypeOp(to_type=bigframes.dtypes.NUMERIC_DTYPE),
excluded_cols=["string_col"],
)
+
assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine)
@@ -144,6 +151,7 @@ def test_engines_astype_string_numeric(
for val in vals
]
)
+
assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine)
@@ -154,6 +162,7 @@ def test_engines_astype_date(scalars_array_value: array_value.ArrayValue, engine
ops.AsTypeOp(to_type=bigframes.dtypes.DATE_DTYPE),
excluded_cols=["string_col"],
)
+
assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine)
@@ -170,6 +179,7 @@ def test_engines_astype_string_date(
for val in vals
]
)
+
assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine)
@@ -180,6 +190,7 @@ def test_engines_astype_datetime(scalars_array_value: array_value.ArrayValue, en
ops.AsTypeOp(to_type=bigframes.dtypes.DATETIME_DTYPE),
excluded_cols=["string_col"],
)
+
assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine)
@@ -196,6 +207,7 @@ def test_engines_astype_string_datetime(
for val in vals
]
)
+
assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine)
@@ -206,6 +218,7 @@ def test_engines_astype_timestamp(scalars_array_value: array_value.ArrayValue, e
ops.AsTypeOp(to_type=bigframes.dtypes.TIMESTAMP_DTYPE),
excluded_cols=["string_col"],
)
+
assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine)
@@ -226,6 +239,7 @@ def test_engines_astype_string_timestamp(
for val in vals
]
)
+
assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine)
@@ -236,6 +250,7 @@ def test_engines_astype_time(scalars_array_value: array_value.ArrayValue, engine
ops.AsTypeOp(to_type=bigframes.dtypes.TIME_DTYPE),
excluded_cols=["string_col", "int64_col", "int64_too"],
)
+
assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine)
@@ -256,6 +271,7 @@ def test_engines_astype_from_json(scalars_array_value: array_value.ArrayValue, e
),
]
arr, _ = scalars_array_value.compute_values(exprs)
+
assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine)
@@ -265,4 +281,112 @@ def test_engines_astype_timedelta(scalars_array_value: array_value.ArrayValue, e
scalars_array_value,
ops.AsTypeOp(to_type=bigframes.dtypes.TIMEDELTA_DTYPE),
)
+
+ assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine)
+
+
+@pytest.mark.parametrize("engine", ["polars", "bq"], indirect=True)
+def test_engines_where_op(scalars_array_value: array_value.ArrayValue, engine):
+ arr, _ = scalars_array_value.compute_values(
+ [
+ ops.where_op.as_expr(
+ expression.deref("int64_col"),
+ expression.deref("bool_col"),
+ expression.deref("float64_col"),
+ )
+ ]
+ )
+
+ assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine)
+
+
+@pytest.mark.parametrize("engine", ["polars", "bq"], indirect=True)
+def test_engines_coalesce_op(scalars_array_value: array_value.ArrayValue, engine):
+ arr, _ = scalars_array_value.compute_values(
+ [
+ ops.coalesce_op.as_expr(
+ expression.deref("int64_col"),
+ expression.deref("float64_col"),
+ )
+ ]
+ )
+
+ assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine)
+
+
+@pytest.mark.parametrize("engine", ["polars", "bq"], indirect=True)
+def test_engines_fillna_op(scalars_array_value: array_value.ArrayValue, engine):
+ arr, _ = scalars_array_value.compute_values(
+ [
+ ops.fillna_op.as_expr(
+ expression.deref("int64_col"),
+ expression.deref("float64_col"),
+ )
+ ]
+ )
+
+ assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine)
+
+
+@pytest.mark.parametrize("engine", ["polars", "bq"], indirect=True)
+def test_engines_casewhen_op_single_case(
+ scalars_array_value: array_value.ArrayValue, engine
+):
+ arr, _ = scalars_array_value.compute_values(
+ [
+ ops.case_when_op.as_expr(
+ expression.deref("bool_col"),
+ expression.deref("int64_col"),
+ )
+ ]
+ )
+
+ assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine)
+
+
+@pytest.mark.parametrize("engine", ["polars", "bq"], indirect=True)
+def test_engines_casewhen_op_double_case(
+ scalars_array_value: array_value.ArrayValue, engine
+):
+ arr, _ = scalars_array_value.compute_values(
+ [
+ ops.case_when_op.as_expr(
+ ops.gt_op.as_expr(expression.deref("int64_col"), expression.const(3)),
+ expression.deref("int64_col"),
+ ops.lt_op.as_expr(expression.deref("int64_col"), expression.const(-3)),
+ expression.deref("int64_too"),
+ )
+ ]
+ )
+
+ assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine)
+
+
+@pytest.mark.parametrize("engine", ["polars", "bq"], indirect=True)
+def test_engines_isnull_op(scalars_array_value: array_value.ArrayValue, engine):
+ arr, _ = scalars_array_value.compute_values(
+ [ops.isnull_op.as_expr(expression.deref("string_col"))]
+ )
+
+ assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine)
+
+
+@pytest.mark.parametrize("engine", ["polars", "bq"], indirect=True)
+def test_engines_notnull_op(scalars_array_value: array_value.ArrayValue, engine):
+ arr, _ = scalars_array_value.compute_values(
+ [ops.notnull_op.as_expr(expression.deref("string_col"))]
+ )
+
+ assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine)
+
+
+@pytest.mark.parametrize("engine", ["polars", "bq"], indirect=True)
+def test_engines_invert_op(scalars_array_value: array_value.ArrayValue, engine):
+ arr, _ = scalars_array_value.compute_values(
+ [
+ ops.invert_op.as_expr(expression.deref("bytes_col")),
+ ops.invert_op.as_expr(expression.deref("bool_col")),
+ ]
+ )
+
assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine)
diff --git a/tests/system/small/engines/test_numeric_ops.py b/tests/system/small/engines/test_numeric_ops.py
index 7e5b85857b..7928922e41 100644
--- a/tests/system/small/engines/test_numeric_ops.py
+++ b/tests/system/small/engines/test_numeric_ops.py
@@ -71,7 +71,7 @@ def test_engines_project_sub(
assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine)
-@pytest.mark.parametrize("engine", ["polars", "bq"], indirect=True)
+@pytest.mark.parametrize("engine", ["polars", "bq", "bq-sqlglot"], indirect=True)
def test_engines_project_mul(
scalars_array_value: array_value.ArrayValue,
engine,
@@ -80,7 +80,7 @@ def test_engines_project_mul(
assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine)
-@pytest.mark.parametrize("engine", ["polars", "bq"], indirect=True)
+@pytest.mark.parametrize("engine", ["polars", "bq", "bq-sqlglot"], indirect=True)
def test_engines_project_div(scalars_array_value: array_value.ArrayValue, engine):
# TODO: Duration div is sensitive to zeroes
# TODO: Numeric col is sensitive to scale shifts
@@ -90,7 +90,7 @@ def test_engines_project_div(scalars_array_value: array_value.ArrayValue, engine
assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine)
-@pytest.mark.parametrize("engine", ["polars", "bq"], indirect=True)
+@pytest.mark.parametrize("engine", ["polars", "bq", "bq-sqlglot"], indirect=True)
def test_engines_project_div_durations(
scalars_array_value: array_value.ArrayValue, engine
):
@@ -117,7 +117,7 @@ def test_engines_project_div_durations(
assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine)
-@pytest.mark.parametrize("engine", ["polars", "bq"], indirect=True)
+@pytest.mark.parametrize("engine", ["polars", "bq", "bq-sqlglot"], indirect=True)
def test_engines_project_floordiv(
scalars_array_value: array_value.ArrayValue,
engine,
@@ -130,7 +130,7 @@ def test_engines_project_floordiv(
assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine)
-@pytest.mark.parametrize("engine", ["polars", "bq"], indirect=True)
+@pytest.mark.parametrize("engine", ["polars", "bq", "bq-sqlglot"], indirect=True)
def test_engines_project_floordiv_durations(
scalars_array_value: array_value.ArrayValue, engine
):
diff --git a/tests/system/small/engines/test_windowing.py b/tests/system/small/engines/test_windowing.py
index f4c2b61e6f..a5f20a47cd 100644
--- a/tests/system/small/engines/test_windowing.py
+++ b/tests/system/small/engines/test_windowing.py
@@ -12,10 +12,12 @@
# See the License for the specific language governing permissions and
# limitations under the License.
+from google.cloud import bigquery
import pytest
-from bigframes.core import array_value
-from bigframes.session import polars_executor
+from bigframes.core import array_value, expression, identifiers, nodes, window_spec
+import bigframes.operations.aggregations as agg_ops
+from bigframes.session import direct_gbq_execution, polars_executor
from bigframes.testing.engine_utils import assert_equivalence_execution
pytest.importorskip("polars")
@@ -31,3 +33,30 @@ def test_engines_with_offsets(
):
result, _ = scalars_array_value.promote_offsets()
assert_equivalence_execution(result.node, REFERENCE_ENGINE, engine)
+
+
+@pytest.mark.parametrize("never_skip_nulls", [True, False])
+@pytest.mark.parametrize("agg_op", [agg_ops.sum_op, agg_ops.count_op])
+def test_engines_with_rows_window(
+ scalars_array_value: array_value.ArrayValue,
+ bigquery_client: bigquery.Client,
+ never_skip_nulls,
+ agg_op,
+):
+ window = window_spec.WindowSpec(
+ bounds=window_spec.RowsWindowBounds.from_window_size(3, "left"),
+ )
+ window_node = nodes.WindowOpNode(
+ child=scalars_array_value.node,
+ expression=expression.UnaryAggregation(agg_op, expression.deref("int64_too")),
+ window_spec=window,
+ output_name=identifiers.ColumnId("agg_int64"),
+ never_skip_nulls=never_skip_nulls,
+ skip_reproject_unsafe=False,
+ )
+
+ bq_executor = direct_gbq_execution.DirectGbqExecutor(bigquery_client)
+ bq_sqlgot_executor = direct_gbq_execution.DirectGbqExecutor(
+ bigquery_client, compiler="sqlglot"
+ )
+ assert_equivalence_execution(window_node, bq_executor, bq_sqlgot_executor)
diff --git a/tests/system/small/test_dataframe.py b/tests/system/small/test_dataframe.py
index 50989ae150..3b70dec0e9 100644
--- a/tests/system/small/test_dataframe.py
+++ b/tests/system/small/test_dataframe.py
@@ -2070,6 +2070,26 @@ def test_reset_index(scalars_df_index, scalars_pandas_df_index, drop):
pandas.testing.assert_frame_equal(bf_result, pd_result)
+@pytest.mark.parametrize(
+ ("drop",),
+ ((True,), (False,)),
+)
+def test_reset_index_inplace(scalars_df_index, scalars_pandas_df_index, drop):
+ df = scalars_df_index.copy()
+ df.reset_index(drop=drop, inplace=True)
+ assert df.index.name is None
+
+ bf_result = df.to_pandas()
+ pd_result = scalars_pandas_df_index.copy()
+ pd_result.reset_index(drop=drop, inplace=True)
+
+ # Pandas uses int64 instead of Int64 (nullable) dtype.
+ pd_result.index = pd_result.index.astype(pd.Int64Dtype())
+
+ # reset_index should maintain the original ordering.
+ pandas.testing.assert_frame_equal(bf_result, pd_result)
+
+
def test_reset_index_then_filter(
scalars_df_index,
scalars_pandas_df_index,
diff --git a/tests/system/small/test_multiindex.py b/tests/system/small/test_multiindex.py
index 13b5b1886f..0c23ea97ae 100644
--- a/tests/system/small/test_multiindex.py
+++ b/tests/system/small/test_multiindex.py
@@ -101,20 +101,69 @@ def test_set_multi_index(scalars_df_index, scalars_pandas_df_index):
pandas.testing.assert_frame_equal(bf_result, pd_result)
-def test_reset_multi_index(scalars_df_index, scalars_pandas_df_index):
+@pytest.mark.parametrize(
+ ("level", "drop"),
+ [
+ (None, True),
+ (None, False),
+ (1, True),
+ ("bool_col", True),
+ (["float64_col", "int64_too"], True),
+ ([2, 0], False),
+ ],
+)
+def test_df_reset_multi_index(scalars_df_index, scalars_pandas_df_index, level, drop):
bf_result = (
- scalars_df_index.set_index(["bool_col", "int64_too"]).reset_index().to_pandas()
+ scalars_df_index.set_index(["bool_col", "int64_too", "float64_col"])
+ .reset_index(level=level, drop=drop)
+ .to_pandas()
)
pd_result = scalars_pandas_df_index.set_index(
- ["bool_col", "int64_too"]
- ).reset_index()
+ ["bool_col", "int64_too", "float64_col"]
+ ).reset_index(level=level, drop=drop)
# Pandas uses int64 instead of Int64 (nullable) dtype.
- pd_result.index = pd_result.index.astype(pandas.Int64Dtype())
+ if pd_result.index.dtype != bf_result.index.dtype:
+ pd_result.index = pd_result.index.astype(pandas.Int64Dtype())
pandas.testing.assert_frame_equal(bf_result, pd_result)
+@pytest.mark.parametrize(
+ ("level", "drop"),
+ [
+ (None, True),
+ (None, False),
+ (1, True),
+ ("bool_col", True),
+ (["float64_col", "int64_too"], True),
+ ([2, 0], False),
+ ],
+)
+def test_series_reset_multi_index(
+ scalars_df_index, scalars_pandas_df_index, level, drop
+):
+ bf_result = (
+ scalars_df_index.set_index(["bool_col", "int64_too", "float64_col"])[
+ "string_col"
+ ]
+ .reset_index(level=level, drop=drop)
+ .to_pandas()
+ )
+ pd_result = scalars_pandas_df_index.set_index(
+ ["bool_col", "int64_too", "float64_col"]
+ )["string_col"].reset_index(level=level, drop=drop)
+
+ # Pandas uses int64 instead of Int64 (nullable) dtype.
+ if pd_result.index.dtype != bf_result.index.dtype:
+ pd_result.index = pd_result.index.astype(pandas.Int64Dtype())
+
+ if drop:
+ pandas.testing.assert_series_equal(bf_result, pd_result)
+ else:
+ pandas.testing.assert_frame_equal(bf_result, pd_result)
+
+
def test_series_multi_index_idxmin(scalars_df_index, scalars_pandas_df_index):
bf_result = scalars_df_index.set_index(["bool_col", "int64_too"])[
"float64_col"
diff --git a/tests/system/small/test_series.py b/tests/system/small/test_series.py
index e94250e98f..2172962046 100644
--- a/tests/system/small/test_series.py
+++ b/tests/system/small/test_series.py
@@ -1339,6 +1339,18 @@ def test_reset_index_drop(scalars_df_index, scalars_pandas_df_index):
pd.testing.assert_series_equal(bf_result.to_pandas(), pd_result)
+def test_series_reset_index_inplace(scalars_df_index, scalars_pandas_df_index):
+ bf_result = scalars_df_index.sort_index(ascending=False)["float64_col"]
+ bf_result.reset_index(drop=True, inplace=True)
+ pd_result = scalars_pandas_df_index.sort_index(ascending=False)["float64_col"]
+ pd_result.reset_index(drop=True, inplace=True)
+
+ # BigQuery DataFrames default indices use nullable Int64 always
+ pd_result.index = pd_result.index.astype("Int64")
+
+ pd.testing.assert_series_equal(bf_result.to_pandas(), pd_result)
+
+
@pytest.mark.parametrize(
("name",),
[
@@ -3097,6 +3109,26 @@ def test_where_with_default(scalars_df_index, scalars_pandas_df_index):
)
+def test_where_with_callable(scalars_df_index, scalars_pandas_df_index):
+ def _is_positive(x):
+ return x > 0
+
+ # Both cond and other are callable.
+ bf_result = (
+ scalars_df_index["int64_col"]
+ .where(cond=_is_positive, other=lambda x: x * 10)
+ .to_pandas()
+ )
+ pd_result = scalars_pandas_df_index["int64_col"].where(
+ cond=_is_positive, other=lambda x: x * 10
+ )
+
+ pd.testing.assert_series_equal(
+ bf_result,
+ pd_result,
+ )
+
+
@pytest.mark.parametrize(
("ordered"),
[
diff --git a/tests/unit/_config/test_threaded_options.py b/tests/unit/_config/test_threaded_options.py
index 7fc97a9f72..b16a3550bc 100644
--- a/tests/unit/_config/test_threaded_options.py
+++ b/tests/unit/_config/test_threaded_options.py
@@ -37,5 +37,5 @@ def mutate_options_threaded(options, result_dict):
assert result_dict["this_before"] == 50
assert result_dict["this_after"] == 50
- assert result_dict["other_before"] == 25
+ assert result_dict["other_before"] == 10
assert result_dict["other_after"] == 100
diff --git a/tests/unit/core/compile/sqlglot/conftest.py b/tests/unit/core/compile/sqlglot/conftest.py
index 754c19ac90..f65343fd66 100644
--- a/tests/unit/core/compile/sqlglot/conftest.py
+++ b/tests/unit/core/compile/sqlglot/conftest.py
@@ -89,6 +89,7 @@ def scalar_types_table_schema() -> typing.Sequence[bigquery.SchemaField]:
bigquery.SchemaField("string_col", "STRING"),
bigquery.SchemaField("time_col", "TIME"),
bigquery.SchemaField("timestamp_col", "TIMESTAMP"),
+ bigquery.SchemaField("duration_col", "INTEGER"),
]
diff --git a/tests/unit/core/compile/sqlglot/expressions/snapshots/test_binary_compiler/test_div_numeric/out.sql b/tests/unit/core/compile/sqlglot/expressions/snapshots/test_binary_compiler/test_div_numeric/out.sql
new file mode 100644
index 0000000000..03d48276a0
--- /dev/null
+++ b/tests/unit/core/compile/sqlglot/expressions/snapshots/test_binary_compiler/test_div_numeric/out.sql
@@ -0,0 +1,122 @@
+WITH `bfcte_0` AS (
+ SELECT
+ `bool_col` AS `bfcol_0`,
+ `int64_col` AS `bfcol_1`,
+ `float64_col` AS `bfcol_2`,
+ `rowindex` AS `bfcol_3`
+ FROM `bigframes-dev`.`sqlglot_test`.`scalar_types`
+), `bfcte_1` AS (
+ SELECT
+ *,
+ `bfcol_3` AS `bfcol_8`,
+ `bfcol_1` AS `bfcol_9`,
+ `bfcol_0` AS `bfcol_10`,
+ `bfcol_2` AS `bfcol_11`,
+ IEEE_DIVIDE(`bfcol_1`, `bfcol_1`) AS `bfcol_12`
+ FROM `bfcte_0`
+), `bfcte_2` AS (
+ SELECT
+ *,
+ `bfcol_8` AS `bfcol_18`,
+ `bfcol_9` AS `bfcol_19`,
+ `bfcol_10` AS `bfcol_20`,
+ `bfcol_11` AS `bfcol_21`,
+ `bfcol_12` AS `bfcol_22`,
+ IEEE_DIVIDE(`bfcol_9`, 1) AS `bfcol_23`
+ FROM `bfcte_1`
+), `bfcte_3` AS (
+ SELECT
+ *,
+ `bfcol_18` AS `bfcol_30`,
+ `bfcol_19` AS `bfcol_31`,
+ `bfcol_20` AS `bfcol_32`,
+ `bfcol_21` AS `bfcol_33`,
+ `bfcol_22` AS `bfcol_34`,
+ `bfcol_23` AS `bfcol_35`,
+ IEEE_DIVIDE(`bfcol_19`, 0.0) AS `bfcol_36`
+ FROM `bfcte_2`
+), `bfcte_4` AS (
+ SELECT
+ *,
+ `bfcol_30` AS `bfcol_44`,
+ `bfcol_31` AS `bfcol_45`,
+ `bfcol_32` AS `bfcol_46`,
+ `bfcol_33` AS `bfcol_47`,
+ `bfcol_34` AS `bfcol_48`,
+ `bfcol_35` AS `bfcol_49`,
+ `bfcol_36` AS `bfcol_50`,
+ IEEE_DIVIDE(`bfcol_31`, `bfcol_33`) AS `bfcol_51`
+ FROM `bfcte_3`
+), `bfcte_5` AS (
+ SELECT
+ *,
+ `bfcol_44` AS `bfcol_60`,
+ `bfcol_45` AS `bfcol_61`,
+ `bfcol_46` AS `bfcol_62`,
+ `bfcol_47` AS `bfcol_63`,
+ `bfcol_48` AS `bfcol_64`,
+ `bfcol_49` AS `bfcol_65`,
+ `bfcol_50` AS `bfcol_66`,
+ `bfcol_51` AS `bfcol_67`,
+ IEEE_DIVIDE(`bfcol_47`, `bfcol_45`) AS `bfcol_68`
+ FROM `bfcte_4`
+), `bfcte_6` AS (
+ SELECT
+ *,
+ `bfcol_60` AS `bfcol_78`,
+ `bfcol_61` AS `bfcol_79`,
+ `bfcol_62` AS `bfcol_80`,
+ `bfcol_63` AS `bfcol_81`,
+ `bfcol_64` AS `bfcol_82`,
+ `bfcol_65` AS `bfcol_83`,
+ `bfcol_66` AS `bfcol_84`,
+ `bfcol_67` AS `bfcol_85`,
+ `bfcol_68` AS `bfcol_86`,
+ IEEE_DIVIDE(`bfcol_63`, 0.0) AS `bfcol_87`
+ FROM `bfcte_5`
+), `bfcte_7` AS (
+ SELECT
+ *,
+ `bfcol_78` AS `bfcol_98`,
+ `bfcol_79` AS `bfcol_99`,
+ `bfcol_80` AS `bfcol_100`,
+ `bfcol_81` AS `bfcol_101`,
+ `bfcol_82` AS `bfcol_102`,
+ `bfcol_83` AS `bfcol_103`,
+ `bfcol_84` AS `bfcol_104`,
+ `bfcol_85` AS `bfcol_105`,
+ `bfcol_86` AS `bfcol_106`,
+ `bfcol_87` AS `bfcol_107`,
+ IEEE_DIVIDE(`bfcol_79`, CAST(`bfcol_80` AS INT64)) AS `bfcol_108`
+ FROM `bfcte_6`
+), `bfcte_8` AS (
+ SELECT
+ *,
+ `bfcol_98` AS `bfcol_120`,
+ `bfcol_99` AS `bfcol_121`,
+ `bfcol_100` AS `bfcol_122`,
+ `bfcol_101` AS `bfcol_123`,
+ `bfcol_102` AS `bfcol_124`,
+ `bfcol_103` AS `bfcol_125`,
+ `bfcol_104` AS `bfcol_126`,
+ `bfcol_105` AS `bfcol_127`,
+ `bfcol_106` AS `bfcol_128`,
+ `bfcol_107` AS `bfcol_129`,
+ `bfcol_108` AS `bfcol_130`,
+ IEEE_DIVIDE(CAST(`bfcol_100` AS INT64), `bfcol_99`) AS `bfcol_131`
+ FROM `bfcte_7`
+)
+SELECT
+ `bfcol_120` AS `rowindex`,
+ `bfcol_121` AS `int64_col`,
+ `bfcol_122` AS `bool_col`,
+ `bfcol_123` AS `float64_col`,
+ `bfcol_124` AS `int_div_int`,
+ `bfcol_125` AS `int_div_1`,
+ `bfcol_126` AS `int_div_0`,
+ `bfcol_127` AS `int_div_float`,
+ `bfcol_128` AS `float_div_int`,
+ `bfcol_129` AS `float_div_0`,
+ `bfcol_130` AS `int_div_bool`,
+ `bfcol_131` AS `bool_div_int`
+FROM `bfcte_8`
\ No newline at end of file
diff --git a/tests/unit/core/compile/sqlglot/expressions/snapshots/test_binary_compiler/test_div_timedelta/out.sql b/tests/unit/core/compile/sqlglot/expressions/snapshots/test_binary_compiler/test_div_timedelta/out.sql
new file mode 100644
index 0000000000..6e05302fc9
--- /dev/null
+++ b/tests/unit/core/compile/sqlglot/expressions/snapshots/test_binary_compiler/test_div_timedelta/out.sql
@@ -0,0 +1,21 @@
+WITH `bfcte_0` AS (
+ SELECT
+ `int64_col` AS `bfcol_0`,
+ `rowindex` AS `bfcol_1`,
+ `timestamp_col` AS `bfcol_2`
+ FROM `bigframes-dev`.`sqlglot_test`.`scalar_types`
+), `bfcte_1` AS (
+ SELECT
+ *,
+ `bfcol_1` AS `bfcol_6`,
+ `bfcol_2` AS `bfcol_7`,
+ `bfcol_0` AS `bfcol_8`,
+ CAST(FLOOR(IEEE_DIVIDE(86400000000, `bfcol_0`)) AS INT64) AS `bfcol_9`
+ FROM `bfcte_0`
+)
+SELECT
+ `bfcol_6` AS `rowindex`,
+ `bfcol_7` AS `timestamp_col`,
+ `bfcol_8` AS `int64_col`,
+ `bfcol_9` AS `timedelta_div_numeric`
+FROM `bfcte_1`
\ No newline at end of file
diff --git a/tests/unit/core/compile/sqlglot/expressions/snapshots/test_binary_compiler/test_floordiv_numeric/out.sql b/tests/unit/core/compile/sqlglot/expressions/snapshots/test_binary_compiler/test_floordiv_numeric/out.sql
new file mode 100644
index 0000000000..c38bc18523
--- /dev/null
+++ b/tests/unit/core/compile/sqlglot/expressions/snapshots/test_binary_compiler/test_floordiv_numeric/out.sql
@@ -0,0 +1,154 @@
+WITH `bfcte_0` AS (
+ SELECT
+ `bool_col` AS `bfcol_0`,
+ `int64_col` AS `bfcol_1`,
+ `float64_col` AS `bfcol_2`,
+ `rowindex` AS `bfcol_3`
+ FROM `bigframes-dev`.`sqlglot_test`.`scalar_types`
+), `bfcte_1` AS (
+ SELECT
+ *,
+ `bfcol_3` AS `bfcol_8`,
+ `bfcol_1` AS `bfcol_9`,
+ `bfcol_0` AS `bfcol_10`,
+ `bfcol_2` AS `bfcol_11`,
+ CASE
+ WHEN `bfcol_1` = CAST(0 AS INT64)
+ THEN CAST(0 AS INT64) * `bfcol_1`
+ ELSE CAST(FLOOR(IEEE_DIVIDE(`bfcol_1`, `bfcol_1`)) AS INT64)
+ END AS `bfcol_12`
+ FROM `bfcte_0`
+), `bfcte_2` AS (
+ SELECT
+ *,
+ `bfcol_8` AS `bfcol_18`,
+ `bfcol_9` AS `bfcol_19`,
+ `bfcol_10` AS `bfcol_20`,
+ `bfcol_11` AS `bfcol_21`,
+ `bfcol_12` AS `bfcol_22`,
+ CASE
+ WHEN 1 = CAST(0 AS INT64)
+ THEN CAST(0 AS INT64) * `bfcol_9`
+ ELSE CAST(FLOOR(IEEE_DIVIDE(`bfcol_9`, 1)) AS INT64)
+ END AS `bfcol_23`
+ FROM `bfcte_1`
+), `bfcte_3` AS (
+ SELECT
+ *,
+ `bfcol_18` AS `bfcol_30`,
+ `bfcol_19` AS `bfcol_31`,
+ `bfcol_20` AS `bfcol_32`,
+ `bfcol_21` AS `bfcol_33`,
+ `bfcol_22` AS `bfcol_34`,
+ `bfcol_23` AS `bfcol_35`,
+ CASE
+ WHEN 0.0 = CAST(0 AS INT64)
+ THEN CAST('Infinity' AS FLOAT64) * `bfcol_19`
+ ELSE CAST(FLOOR(IEEE_DIVIDE(`bfcol_19`, 0.0)) AS INT64)
+ END AS `bfcol_36`
+ FROM `bfcte_2`
+), `bfcte_4` AS (
+ SELECT
+ *,
+ `bfcol_30` AS `bfcol_44`,
+ `bfcol_31` AS `bfcol_45`,
+ `bfcol_32` AS `bfcol_46`,
+ `bfcol_33` AS `bfcol_47`,
+ `bfcol_34` AS `bfcol_48`,
+ `bfcol_35` AS `bfcol_49`,
+ `bfcol_36` AS `bfcol_50`,
+ CASE
+ WHEN `bfcol_33` = CAST(0 AS INT64)
+ THEN CAST('Infinity' AS FLOAT64) * `bfcol_31`
+ ELSE CAST(FLOOR(IEEE_DIVIDE(`bfcol_31`, `bfcol_33`)) AS INT64)
+ END AS `bfcol_51`
+ FROM `bfcte_3`
+), `bfcte_5` AS (
+ SELECT
+ *,
+ `bfcol_44` AS `bfcol_60`,
+ `bfcol_45` AS `bfcol_61`,
+ `bfcol_46` AS `bfcol_62`,
+ `bfcol_47` AS `bfcol_63`,
+ `bfcol_48` AS `bfcol_64`,
+ `bfcol_49` AS `bfcol_65`,
+ `bfcol_50` AS `bfcol_66`,
+ `bfcol_51` AS `bfcol_67`,
+ CASE
+ WHEN `bfcol_45` = CAST(0 AS INT64)
+ THEN CAST('Infinity' AS FLOAT64) * `bfcol_47`
+ ELSE CAST(FLOOR(IEEE_DIVIDE(`bfcol_47`, `bfcol_45`)) AS INT64)
+ END AS `bfcol_68`
+ FROM `bfcte_4`
+), `bfcte_6` AS (
+ SELECT
+ *,
+ `bfcol_60` AS `bfcol_78`,
+ `bfcol_61` AS `bfcol_79`,
+ `bfcol_62` AS `bfcol_80`,
+ `bfcol_63` AS `bfcol_81`,
+ `bfcol_64` AS `bfcol_82`,
+ `bfcol_65` AS `bfcol_83`,
+ `bfcol_66` AS `bfcol_84`,
+ `bfcol_67` AS `bfcol_85`,
+ `bfcol_68` AS `bfcol_86`,
+ CASE
+ WHEN 0.0 = CAST(0 AS INT64)
+ THEN CAST('Infinity' AS FLOAT64) * `bfcol_63`
+ ELSE CAST(FLOOR(IEEE_DIVIDE(`bfcol_63`, 0.0)) AS INT64)
+ END AS `bfcol_87`
+ FROM `bfcte_5`
+), `bfcte_7` AS (
+ SELECT
+ *,
+ `bfcol_78` AS `bfcol_98`,
+ `bfcol_79` AS `bfcol_99`,
+ `bfcol_80` AS `bfcol_100`,
+ `bfcol_81` AS `bfcol_101`,
+ `bfcol_82` AS `bfcol_102`,
+ `bfcol_83` AS `bfcol_103`,
+ `bfcol_84` AS `bfcol_104`,
+ `bfcol_85` AS `bfcol_105`,
+ `bfcol_86` AS `bfcol_106`,
+ `bfcol_87` AS `bfcol_107`,
+ CASE
+ WHEN CAST(`bfcol_80` AS INT64) = CAST(0 AS INT64)
+ THEN CAST(0 AS INT64) * `bfcol_79`
+ ELSE CAST(FLOOR(IEEE_DIVIDE(`bfcol_79`, CAST(`bfcol_80` AS INT64))) AS INT64)
+ END AS `bfcol_108`
+ FROM `bfcte_6`
+), `bfcte_8` AS (
+ SELECT
+ *,
+ `bfcol_98` AS `bfcol_120`,
+ `bfcol_99` AS `bfcol_121`,
+ `bfcol_100` AS `bfcol_122`,
+ `bfcol_101` AS `bfcol_123`,
+ `bfcol_102` AS `bfcol_124`,
+ `bfcol_103` AS `bfcol_125`,
+ `bfcol_104` AS `bfcol_126`,
+ `bfcol_105` AS `bfcol_127`,
+ `bfcol_106` AS `bfcol_128`,
+ `bfcol_107` AS `bfcol_129`,
+ `bfcol_108` AS `bfcol_130`,
+ CASE
+ WHEN `bfcol_99` = CAST(0 AS INT64)
+ THEN CAST(0 AS INT64) * CAST(`bfcol_100` AS INT64)
+ ELSE CAST(FLOOR(IEEE_DIVIDE(CAST(`bfcol_100` AS INT64), `bfcol_99`)) AS INT64)
+ END AS `bfcol_131`
+ FROM `bfcte_7`
+)
+SELECT
+ `bfcol_120` AS `rowindex`,
+ `bfcol_121` AS `int64_col`,
+ `bfcol_122` AS `bool_col`,
+ `bfcol_123` AS `float64_col`,
+ `bfcol_124` AS `int_div_int`,
+ `bfcol_125` AS `int_div_1`,
+ `bfcol_126` AS `int_div_0`,
+ `bfcol_127` AS `int_div_float`,
+ `bfcol_128` AS `float_div_int`,
+ `bfcol_129` AS `float_div_0`,
+ `bfcol_130` AS `int_div_bool`,
+ `bfcol_131` AS `bool_div_int`
+FROM `bfcte_8`
\ No newline at end of file
diff --git a/tests/unit/core/compile/sqlglot/expressions/snapshots/test_binary_compiler/test_floordiv_timedelta/out.sql b/tests/unit/core/compile/sqlglot/expressions/snapshots/test_binary_compiler/test_floordiv_timedelta/out.sql
new file mode 100644
index 0000000000..bc4f94d306
--- /dev/null
+++ b/tests/unit/core/compile/sqlglot/expressions/snapshots/test_binary_compiler/test_floordiv_timedelta/out.sql
@@ -0,0 +1,18 @@
+WITH `bfcte_0` AS (
+ SELECT
+ `date_col` AS `bfcol_0`,
+ `rowindex` AS `bfcol_1`,
+ `timestamp_col` AS `bfcol_2`
+ FROM `bigframes-dev`.`sqlglot_test`.`scalar_types`
+), `bfcte_1` AS (
+ SELECT
+ *,
+ 43200000000 AS `bfcol_6`
+ FROM `bfcte_0`
+)
+SELECT
+ `bfcol_1` AS `rowindex`,
+ `bfcol_2` AS `timestamp_col`,
+ `bfcol_0` AS `date_col`,
+ `bfcol_6` AS `timedelta_div_numeric`
+FROM `bfcte_1`
\ No newline at end of file
diff --git a/tests/unit/core/compile/sqlglot/expressions/snapshots/test_binary_compiler/test_mul_numeric/out.sql b/tests/unit/core/compile/sqlglot/expressions/snapshots/test_binary_compiler/test_mul_numeric/out.sql
new file mode 100644
index 0000000000..a9c81f4744
--- /dev/null
+++ b/tests/unit/core/compile/sqlglot/expressions/snapshots/test_binary_compiler/test_mul_numeric/out.sql
@@ -0,0 +1,54 @@
+WITH `bfcte_0` AS (
+ SELECT
+ `bool_col` AS `bfcol_0`,
+ `int64_col` AS `bfcol_1`,
+ `rowindex` AS `bfcol_2`
+ FROM `bigframes-dev`.`sqlglot_test`.`scalar_types`
+), `bfcte_1` AS (
+ SELECT
+ *,
+ `bfcol_2` AS `bfcol_6`,
+ `bfcol_1` AS `bfcol_7`,
+ `bfcol_0` AS `bfcol_8`,
+ `bfcol_1` * `bfcol_1` AS `bfcol_9`
+ FROM `bfcte_0`
+), `bfcte_2` AS (
+ SELECT
+ *,
+ `bfcol_6` AS `bfcol_14`,
+ `bfcol_7` AS `bfcol_15`,
+ `bfcol_8` AS `bfcol_16`,
+ `bfcol_9` AS `bfcol_17`,
+ `bfcol_7` * 1 AS `bfcol_18`
+ FROM `bfcte_1`
+), `bfcte_3` AS (
+ SELECT
+ *,
+ `bfcol_14` AS `bfcol_24`,
+ `bfcol_15` AS `bfcol_25`,
+ `bfcol_16` AS `bfcol_26`,
+ `bfcol_17` AS `bfcol_27`,
+ `bfcol_18` AS `bfcol_28`,
+ `bfcol_15` * CAST(`bfcol_16` AS INT64) AS `bfcol_29`
+ FROM `bfcte_2`
+), `bfcte_4` AS (
+ SELECT
+ *,
+ `bfcol_24` AS `bfcol_36`,
+ `bfcol_25` AS `bfcol_37`,
+ `bfcol_26` AS `bfcol_38`,
+ `bfcol_27` AS `bfcol_39`,
+ `bfcol_28` AS `bfcol_40`,
+ `bfcol_29` AS `bfcol_41`,
+ CAST(`bfcol_26` AS INT64) * `bfcol_25` AS `bfcol_42`
+ FROM `bfcte_3`
+)
+SELECT
+ `bfcol_36` AS `rowindex`,
+ `bfcol_37` AS `int64_col`,
+ `bfcol_38` AS `bool_col`,
+ `bfcol_39` AS `int_mul_int`,
+ `bfcol_40` AS `int_mul_1`,
+ `bfcol_41` AS `int_mul_bool`,
+ `bfcol_42` AS `bool_mul_int`
+FROM `bfcte_4`
\ No newline at end of file
diff --git a/tests/unit/core/compile/sqlglot/expressions/snapshots/test_binary_compiler/test_mul_timedelta/out.sql b/tests/unit/core/compile/sqlglot/expressions/snapshots/test_binary_compiler/test_mul_timedelta/out.sql
new file mode 100644
index 0000000000..c8a8cf6cbf
--- /dev/null
+++ b/tests/unit/core/compile/sqlglot/expressions/snapshots/test_binary_compiler/test_mul_timedelta/out.sql
@@ -0,0 +1,43 @@
+WITH `bfcte_0` AS (
+ SELECT
+ `int64_col` AS `bfcol_0`,
+ `rowindex` AS `bfcol_1`,
+ `timestamp_col` AS `bfcol_2`,
+ `duration_col` AS `bfcol_3`
+ FROM `bigframes-dev`.`sqlglot_test`.`scalar_types`
+), `bfcte_1` AS (
+ SELECT
+ *,
+ `bfcol_1` AS `bfcol_8`,
+ `bfcol_2` AS `bfcol_9`,
+ `bfcol_0` AS `bfcol_10`,
+ INTERVAL `bfcol_3` MICROSECOND AS `bfcol_11`
+ FROM `bfcte_0`
+), `bfcte_2` AS (
+ SELECT
+ *,
+ `bfcol_8` AS `bfcol_16`,
+ `bfcol_9` AS `bfcol_17`,
+ `bfcol_10` AS `bfcol_18`,
+ `bfcol_11` AS `bfcol_19`,
+ CAST(FLOOR(`bfcol_11` * `bfcol_10`) AS INT64) AS `bfcol_20`
+ FROM `bfcte_1`
+), `bfcte_3` AS (
+ SELECT
+ *,
+ `bfcol_16` AS `bfcol_26`,
+ `bfcol_17` AS `bfcol_27`,
+ `bfcol_18` AS `bfcol_28`,
+ `bfcol_19` AS `bfcol_29`,
+ `bfcol_20` AS `bfcol_30`,
+ CAST(FLOOR(`bfcol_18` * `bfcol_19`) AS INT64) AS `bfcol_31`
+ FROM `bfcte_2`
+)
+SELECT
+ `bfcol_26` AS `rowindex`,
+ `bfcol_27` AS `timestamp_col`,
+ `bfcol_28` AS `int64_col`,
+ `bfcol_29` AS `duration_col`,
+ `bfcol_30` AS `timedelta_mul_numeric`,
+ `bfcol_31` AS `numeric_mul_timedelta`
+FROM `bfcte_3`
\ No newline at end of file
diff --git a/tests/unit/core/compile/sqlglot/expressions/snapshots/test_binary_compiler/test_obj_make_ref/out.sql b/tests/unit/core/compile/sqlglot/expressions/snapshots/test_binary_compiler/test_obj_make_ref/out.sql
new file mode 100644
index 0000000000..e3228feaaa
--- /dev/null
+++ b/tests/unit/core/compile/sqlglot/expressions/snapshots/test_binary_compiler/test_obj_make_ref/out.sql
@@ -0,0 +1,15 @@
+WITH `bfcte_0` AS (
+ SELECT
+ `rowindex` AS `bfcol_0`,
+ `string_col` AS `bfcol_1`
+ FROM `bigframes-dev`.`sqlglot_test`.`scalar_types`
+), `bfcte_1` AS (
+ SELECT
+ *,
+ OBJ.MAKE_REF(`bfcol_1`, 'bigframes-dev.test-region.bigframes-default-connection') AS `bfcol_4`
+ FROM `bfcte_0`
+)
+SELECT
+ `bfcol_0` AS `rowindex`,
+ `bfcol_4` AS `string_col`
+FROM `bfcte_1`
\ No newline at end of file
diff --git a/tests/unit/core/compile/sqlglot/expressions/snapshots/test_binary_compiler/test_sub_timedelta/out.sql b/tests/unit/core/compile/sqlglot/expressions/snapshots/test_binary_compiler/test_sub_timedelta/out.sql
index 41e45d3333..460f941d1b 100644
--- a/tests/unit/core/compile/sqlglot/expressions/snapshots/test_binary_compiler/test_sub_timedelta/out.sql
+++ b/tests/unit/core/compile/sqlglot/expressions/snapshots/test_binary_compiler/test_sub_timedelta/out.sql
@@ -2,59 +2,81 @@ WITH `bfcte_0` AS (
SELECT
`date_col` AS `bfcol_0`,
`rowindex` AS `bfcol_1`,
- `timestamp_col` AS `bfcol_2`
+ `timestamp_col` AS `bfcol_2`,
+ `duration_col` AS `bfcol_3`
FROM `bigframes-dev`.`sqlglot_test`.`scalar_types`
), `bfcte_1` AS (
SELECT
*,
- `bfcol_1` AS `bfcol_6`,
- `bfcol_2` AS `bfcol_7`,
- `bfcol_0` AS `bfcol_8`,
- TIMESTAMP_SUB(CAST(`bfcol_0` AS DATETIME), INTERVAL 86400000000 MICROSECOND) AS `bfcol_9`
+ `bfcol_1` AS `bfcol_8`,
+ `bfcol_2` AS `bfcol_9`,
+ `bfcol_0` AS `bfcol_10`,
+ INTERVAL `bfcol_3` MICROSECOND AS `bfcol_11`
FROM `bfcte_0`
), `bfcte_2` AS (
SELECT
*,
- `bfcol_6` AS `bfcol_14`,
- `bfcol_7` AS `bfcol_15`,
`bfcol_8` AS `bfcol_16`,
`bfcol_9` AS `bfcol_17`,
- TIMESTAMP_SUB(`bfcol_7`, INTERVAL 86400000000 MICROSECOND) AS `bfcol_18`
+ `bfcol_11` AS `bfcol_18`,
+ `bfcol_10` AS `bfcol_19`,
+ TIMESTAMP_SUB(CAST(`bfcol_10` AS DATETIME), INTERVAL `bfcol_11` MICROSECOND) AS `bfcol_20`
FROM `bfcte_1`
), `bfcte_3` AS (
SELECT
*,
- `bfcol_14` AS `bfcol_24`,
- `bfcol_15` AS `bfcol_25`,
`bfcol_16` AS `bfcol_26`,
`bfcol_17` AS `bfcol_27`,
`bfcol_18` AS `bfcol_28`,
- TIMESTAMP_DIFF(CAST(`bfcol_16` AS DATETIME), CAST(`bfcol_16` AS DATETIME), MICROSECOND) AS `bfcol_29`
+ `bfcol_19` AS `bfcol_29`,
+ `bfcol_20` AS `bfcol_30`,
+ TIMESTAMP_SUB(`bfcol_17`, INTERVAL `bfcol_18` MICROSECOND) AS `bfcol_31`
FROM `bfcte_2`
), `bfcte_4` AS (
SELECT
*,
- `bfcol_24` AS `bfcol_36`,
- `bfcol_25` AS `bfcol_37`,
`bfcol_26` AS `bfcol_38`,
`bfcol_27` AS `bfcol_39`,
`bfcol_28` AS `bfcol_40`,
`bfcol_29` AS `bfcol_41`,
- TIMESTAMP_DIFF(`bfcol_25`, `bfcol_25`, MICROSECOND) AS `bfcol_42`
+ `bfcol_30` AS `bfcol_42`,
+ `bfcol_31` AS `bfcol_43`,
+ TIMESTAMP_DIFF(CAST(`bfcol_29` AS DATETIME), CAST(`bfcol_29` AS DATETIME), MICROSECOND) AS `bfcol_44`
FROM `bfcte_3`
), `bfcte_5` AS (
SELECT
*,
- 0 AS `bfcol_50`
+ `bfcol_38` AS `bfcol_52`,
+ `bfcol_39` AS `bfcol_53`,
+ `bfcol_40` AS `bfcol_54`,
+ `bfcol_41` AS `bfcol_55`,
+ `bfcol_42` AS `bfcol_56`,
+ `bfcol_43` AS `bfcol_57`,
+ `bfcol_44` AS `bfcol_58`,
+ TIMESTAMP_DIFF(`bfcol_39`, `bfcol_39`, MICROSECOND) AS `bfcol_59`
FROM `bfcte_4`
+), `bfcte_6` AS (
+ SELECT
+ *,
+ `bfcol_52` AS `bfcol_68`,
+ `bfcol_53` AS `bfcol_69`,
+ `bfcol_54` AS `bfcol_70`,
+ `bfcol_55` AS `bfcol_71`,
+ `bfcol_56` AS `bfcol_72`,
+ `bfcol_57` AS `bfcol_73`,
+ `bfcol_58` AS `bfcol_74`,
+ `bfcol_59` AS `bfcol_75`,
+ `bfcol_54` - `bfcol_54` AS `bfcol_76`
+ FROM `bfcte_5`
)
SELECT
- `bfcol_36` AS `rowindex`,
- `bfcol_37` AS `timestamp_col`,
- `bfcol_38` AS `date_col`,
- `bfcol_39` AS `date_sub_timedelta`,
- `bfcol_40` AS `timestamp_sub_timedelta`,
- `bfcol_41` AS `timestamp_sub_date`,
- `bfcol_42` AS `date_sub_timestamp`,
- `bfcol_50` AS `timedelta_sub_timedelta`
-FROM `bfcte_5`
\ No newline at end of file
+ `bfcol_68` AS `rowindex`,
+ `bfcol_69` AS `timestamp_col`,
+ `bfcol_70` AS `duration_col`,
+ `bfcol_71` AS `date_col`,
+ `bfcol_72` AS `date_sub_timedelta`,
+ `bfcol_73` AS `timestamp_sub_timedelta`,
+ `bfcol_74` AS `timestamp_sub_date`,
+ `bfcol_75` AS `date_sub_timestamp`,
+ `bfcol_76` AS `timedelta_sub_timedelta`
+FROM `bfcte_6`
\ No newline at end of file
diff --git a/tests/unit/core/compile/sqlglot/expressions/snapshots/test_unary_compiler/test_obj_fetch_metadata/out.sql b/tests/unit/core/compile/sqlglot/expressions/snapshots/test_unary_compiler/test_obj_fetch_metadata/out.sql
new file mode 100644
index 0000000000..134fdc363b
--- /dev/null
+++ b/tests/unit/core/compile/sqlglot/expressions/snapshots/test_unary_compiler/test_obj_fetch_metadata/out.sql
@@ -0,0 +1,25 @@
+WITH `bfcte_0` AS (
+ SELECT
+ `rowindex` AS `bfcol_0`,
+ `string_col` AS `bfcol_1`
+ FROM `bigframes-dev`.`sqlglot_test`.`scalar_types`
+), `bfcte_1` AS (
+ SELECT
+ *,
+ OBJ.MAKE_REF(`bfcol_1`, 'bigframes-dev.test-region.bigframes-default-connection') AS `bfcol_4`
+ FROM `bfcte_0`
+), `bfcte_2` AS (
+ SELECT
+ *,
+ OBJ.FETCH_METADATA(`bfcol_4`) AS `bfcol_7`
+ FROM `bfcte_1`
+), `bfcte_3` AS (
+ SELECT
+ *,
+ `bfcol_7`.`version` AS `bfcol_10`
+ FROM `bfcte_2`
+)
+SELECT
+ `bfcol_0` AS `rowindex`,
+ `bfcol_10` AS `version`
+FROM `bfcte_3`
\ No newline at end of file
diff --git a/tests/unit/core/compile/sqlglot/expressions/snapshots/test_unary_compiler/test_obj_get_access_url/out.sql b/tests/unit/core/compile/sqlglot/expressions/snapshots/test_unary_compiler/test_obj_get_access_url/out.sql
new file mode 100644
index 0000000000..4a963b4972
--- /dev/null
+++ b/tests/unit/core/compile/sqlglot/expressions/snapshots/test_unary_compiler/test_obj_get_access_url/out.sql
@@ -0,0 +1,25 @@
+WITH `bfcte_0` AS (
+ SELECT
+ `rowindex` AS `bfcol_0`,
+ `string_col` AS `bfcol_1`
+ FROM `bigframes-dev`.`sqlglot_test`.`scalar_types`
+), `bfcte_1` AS (
+ SELECT
+ *,
+ OBJ.MAKE_REF(`bfcol_1`, 'bigframes-dev.test-region.bigframes-default-connection') AS `bfcol_4`
+ FROM `bfcte_0`
+), `bfcte_2` AS (
+ SELECT
+ *,
+ OBJ.GET_ACCESS_URL(`bfcol_4`) AS `bfcol_7`
+ FROM `bfcte_1`
+), `bfcte_3` AS (
+ SELECT
+ *,
+ JSON_VALUE(`bfcol_7`, '$.access_urls.read_url') AS `bfcol_10`
+ FROM `bfcte_2`
+)
+SELECT
+ `bfcol_0` AS `rowindex`,
+ `bfcol_10` AS `string_col`
+FROM `bfcte_3`
\ No newline at end of file
diff --git a/tests/unit/core/compile/sqlglot/expressions/snapshots/test_unary_compiler/test_to_timedelta/out.sql b/tests/unit/core/compile/sqlglot/expressions/snapshots/test_unary_compiler/test_to_timedelta/out.sql
index b89056d65f..01ebebc455 100644
--- a/tests/unit/core/compile/sqlglot/expressions/snapshots/test_unary_compiler/test_to_timedelta/out.sql
+++ b/tests/unit/core/compile/sqlglot/expressions/snapshots/test_unary_compiler/test_to_timedelta/out.sql
@@ -1,13 +1,37 @@
WITH `bfcte_0` AS (
SELECT
- `int64_col` AS `bfcol_0`
+ `int64_col` AS `bfcol_0`,
+ `rowindex` AS `bfcol_1`
FROM `bigframes-dev`.`sqlglot_test`.`scalar_types`
), `bfcte_1` AS (
SELECT
*,
- INTERVAL `bfcol_0` SECOND AS `bfcol_1`
+ `bfcol_1` AS `bfcol_4`,
+ `bfcol_0` AS `bfcol_5`,
+ INTERVAL `bfcol_0` MICROSECOND AS `bfcol_6`
FROM `bfcte_0`
+), `bfcte_2` AS (
+ SELECT
+ *,
+ `bfcol_4` AS `bfcol_10`,
+ `bfcol_5` AS `bfcol_11`,
+ `bfcol_6` AS `bfcol_12`,
+ INTERVAL (`bfcol_5` * 1000000) MICROSECOND AS `bfcol_13`
+ FROM `bfcte_1`
+), `bfcte_3` AS (
+ SELECT
+ *,
+ `bfcol_10` AS `bfcol_18`,
+ `bfcol_11` AS `bfcol_19`,
+ `bfcol_12` AS `bfcol_20`,
+ `bfcol_13` AS `bfcol_21`,
+ INTERVAL (`bfcol_11` * 604800000000) MICROSECOND AS `bfcol_22`
+ FROM `bfcte_2`
)
SELECT
- `bfcol_1` AS `int64_col`
-FROM `bfcte_1`
\ No newline at end of file
+ `bfcol_18` AS `rowindex`,
+ `bfcol_19` AS `int64_col`,
+ `bfcol_20` AS `duration_us`,
+ `bfcol_21` AS `duration_s`,
+ `bfcol_22` AS `duration_w`
+FROM `bfcte_3`
\ No newline at end of file
diff --git a/tests/unit/core/compile/sqlglot/expressions/test_binary_compiler.py b/tests/unit/core/compile/sqlglot/expressions/test_binary_compiler.py
index 05d9c26945..49426fe6c3 100644
--- a/tests/unit/core/compile/sqlglot/expressions/test_binary_compiler.py
+++ b/tests/unit/core/compile/sqlglot/expressions/test_binary_compiler.py
@@ -82,6 +82,57 @@ def test_add_unsupported_raises(scalar_types_df: bpd.DataFrame):
_apply_binary_op(scalar_types_df, ops.add_op, "int64_col", "string_col")
+def test_div_numeric(scalar_types_df: bpd.DataFrame, snapshot):
+ bf_df = scalar_types_df[["int64_col", "bool_col", "float64_col"]]
+
+ bf_df["int_div_int"] = bf_df["int64_col"] / bf_df["int64_col"]
+ bf_df["int_div_1"] = bf_df["int64_col"] / 1
+ bf_df["int_div_0"] = bf_df["int64_col"] / 0.0
+
+ bf_df["int_div_float"] = bf_df["int64_col"] / bf_df["float64_col"]
+ bf_df["float_div_int"] = bf_df["float64_col"] / bf_df["int64_col"]
+ bf_df["float_div_0"] = bf_df["float64_col"] / 0.0
+
+ bf_df["int_div_bool"] = bf_df["int64_col"] / bf_df["bool_col"]
+ bf_df["bool_div_int"] = bf_df["bool_col"] / bf_df["int64_col"]
+
+ snapshot.assert_match(bf_df.sql, "out.sql")
+
+
+def test_div_timedelta(scalar_types_df: bpd.DataFrame, snapshot):
+ bf_df = scalar_types_df[["timestamp_col", "int64_col"]]
+ timedelta = pd.Timedelta(1, unit="d")
+ bf_df["timedelta_div_numeric"] = timedelta / bf_df["int64_col"]
+
+ snapshot.assert_match(bf_df.sql, "out.sql")
+
+
+def test_floordiv_numeric(scalar_types_df: bpd.DataFrame, snapshot):
+ bf_df = scalar_types_df[["int64_col", "bool_col", "float64_col"]]
+
+ bf_df["int_div_int"] = bf_df["int64_col"] // bf_df["int64_col"]
+ bf_df["int_div_1"] = bf_df["int64_col"] // 1
+ bf_df["int_div_0"] = bf_df["int64_col"] // 0.0
+
+ bf_df["int_div_float"] = bf_df["int64_col"] // bf_df["float64_col"]
+ bf_df["float_div_int"] = bf_df["float64_col"] // bf_df["int64_col"]
+ bf_df["float_div_0"] = bf_df["float64_col"] // 0.0
+
+ bf_df["int_div_bool"] = bf_df["int64_col"] // bf_df["bool_col"]
+ bf_df["bool_div_int"] = bf_df["bool_col"] // bf_df["int64_col"]
+
+ snapshot.assert_match(bf_df.sql, "out.sql")
+
+
+def test_floordiv_timedelta(scalar_types_df: bpd.DataFrame, snapshot):
+ bf_df = scalar_types_df[["timestamp_col", "date_col"]]
+ timedelta = pd.Timedelta(1, unit="d")
+
+ bf_df["timedelta_div_numeric"] = timedelta // 2
+
+ snapshot.assert_match(bf_df.sql, "out.sql")
+
+
def test_json_set(json_types_df: bpd.DataFrame, snapshot):
bf_df = json_types_df[["json_col"]]
sql = _apply_binary_op(
@@ -104,14 +155,14 @@ def test_sub_numeric(scalar_types_df: bpd.DataFrame, snapshot):
def test_sub_timedelta(scalar_types_df: bpd.DataFrame, snapshot):
- bf_df = scalar_types_df[["timestamp_col", "date_col"]]
- timedelta = pd.Timedelta(1, unit="d")
+ bf_df = scalar_types_df[["timestamp_col", "duration_col", "date_col"]]
+ bf_df["duration_col"] = bpd.to_timedelta(bf_df["duration_col"], unit="us")
- bf_df["date_sub_timedelta"] = bf_df["date_col"] - timedelta
- bf_df["timestamp_sub_timedelta"] = bf_df["timestamp_col"] - timedelta
+ bf_df["date_sub_timedelta"] = bf_df["date_col"] - bf_df["duration_col"]
+ bf_df["timestamp_sub_timedelta"] = bf_df["timestamp_col"] - bf_df["duration_col"]
bf_df["timestamp_sub_date"] = bf_df["date_col"] - bf_df["date_col"]
bf_df["date_sub_timestamp"] = bf_df["timestamp_col"] - bf_df["timestamp_col"]
- bf_df["timedelta_sub_timedelta"] = timedelta - timedelta
+ bf_df["timedelta_sub_timedelta"] = bf_df["duration_col"] - bf_df["duration_col"]
snapshot.assert_match(bf_df.sql, "out.sql")
@@ -122,3 +173,30 @@ def test_sub_unsupported_raises(scalar_types_df: bpd.DataFrame):
with pytest.raises(TypeError):
_apply_binary_op(scalar_types_df, ops.sub_op, "int64_col", "string_col")
+
+
+def test_mul_numeric(scalar_types_df: bpd.DataFrame, snapshot):
+ bf_df = scalar_types_df[["int64_col", "bool_col"]]
+
+ bf_df["int_mul_int"] = bf_df["int64_col"] * bf_df["int64_col"]
+ bf_df["int_mul_1"] = bf_df["int64_col"] * 1
+
+ bf_df["int_mul_bool"] = bf_df["int64_col"] * bf_df["bool_col"]
+ bf_df["bool_mul_int"] = bf_df["bool_col"] * bf_df["int64_col"]
+
+ snapshot.assert_match(bf_df.sql, "out.sql")
+
+
+def test_mul_timedelta(scalar_types_df: bpd.DataFrame, snapshot):
+ bf_df = scalar_types_df[["timestamp_col", "int64_col", "duration_col"]]
+ bf_df["duration_col"] = bpd.to_timedelta(bf_df["duration_col"], unit="us")
+
+ bf_df["timedelta_mul_numeric"] = bf_df["duration_col"] * bf_df["int64_col"]
+ bf_df["numeric_mul_timedelta"] = bf_df["int64_col"] * bf_df["duration_col"]
+
+ snapshot.assert_match(bf_df.sql, "out.sql")
+
+
+def test_obj_make_ref(scalar_types_df: bpd.DataFrame, snapshot):
+ blob_df = scalar_types_df["string_col"].str.to_blob()
+ snapshot.assert_match(blob_df.to_frame().sql, "out.sql")
diff --git a/tests/unit/core/compile/sqlglot/expressions/test_unary_compiler.py b/tests/unit/core/compile/sqlglot/expressions/test_unary_compiler.py
index 0a930d68ae..4a5b586c77 100644
--- a/tests/unit/core/compile/sqlglot/expressions/test_unary_compiler.py
+++ b/tests/unit/core/compile/sqlglot/expressions/test_unary_compiler.py
@@ -405,6 +405,18 @@ def test_normalize(scalar_types_df: bpd.DataFrame, snapshot):
snapshot.assert_match(sql, "out.sql")
+def test_obj_fetch_metadata(scalar_types_df: bpd.DataFrame, snapshot):
+ blob_s = scalar_types_df["string_col"].str.to_blob()
+ sql = blob_s.blob.version().to_frame().sql
+ snapshot.assert_match(sql, "out.sql")
+
+
+def test_obj_get_access_url(scalar_types_df: bpd.DataFrame, snapshot):
+ blob_s = scalar_types_df["string_col"].str.to_blob()
+ sql = blob_s.blob.read_url().to_frame().sql
+ snapshot.assert_match(sql, "out.sql")
+
+
def test_pos(scalar_types_df: bpd.DataFrame, snapshot):
bf_df = scalar_types_df[["float64_col"]]
sql = _apply_unary_op(bf_df, ops.pos_op, "float64_col")
@@ -587,9 +599,11 @@ def test_to_timestamp(scalar_types_df: bpd.DataFrame, snapshot):
def test_to_timedelta(scalar_types_df: bpd.DataFrame, snapshot):
bf_df = scalar_types_df[["int64_col"]]
- sql = _apply_unary_op(bf_df, ops.ToTimedeltaOp("s"), "int64_col")
+ bf_df["duration_us"] = bpd.to_timedelta(bf_df["int64_col"], "us")
+ bf_df["duration_s"] = bpd.to_timedelta(bf_df["int64_col"], "s")
+ bf_df["duration_w"] = bpd.to_timedelta(bf_df["int64_col"], "W")
- snapshot.assert_match(sql, "out.sql")
+ snapshot.assert_match(bf_df.sql, "out.sql")
def test_unix_micros(scalar_types_df: bpd.DataFrame, snapshot):
diff --git a/tests/unit/core/compile/sqlglot/snapshots/test_compile_readtable/test_compile_readtable/out.sql b/tests/unit/core/compile/sqlglot/snapshots/test_compile_readtable/test_compile_readtable/out.sql
index 34fc8e3c49..10c2a2088a 100644
--- a/tests/unit/core/compile/sqlglot/snapshots/test_compile_readtable/test_compile_readtable/out.sql
+++ b/tests/unit/core/compile/sqlglot/snapshots/test_compile_readtable/test_compile_readtable/out.sql
@@ -13,7 +13,8 @@ WITH `bfcte_0` AS (
`rowindex_2` AS `bfcol_10`,
`string_col` AS `bfcol_11`,
`time_col` AS `bfcol_12`,
- `timestamp_col` AS `bfcol_13`
+ `timestamp_col` AS `bfcol_13`,
+ `duration_col` AS `bfcol_14`
FROM `bigframes-dev`.`sqlglot_test`.`scalar_types`
)
SELECT
@@ -31,5 +32,6 @@ SELECT
`bfcol_10` AS `rowindex_2`,
`bfcol_11` AS `string_col`,
`bfcol_12` AS `time_col`,
- `bfcol_13` AS `timestamp_col`
+ `bfcol_13` AS `timestamp_col`,
+ `bfcol_14` AS `duration_col`
FROM `bfcte_0`
\ No newline at end of file
diff --git a/tests/unit/core/compile/sqlglot/snapshots/test_compile_window/test_compile_window_w_groupby_rolling/out.sql b/tests/unit/core/compile/sqlglot/snapshots/test_compile_window/test_compile_window_w_groupby_rolling/out.sql
new file mode 100644
index 0000000000..beb3caa073
--- /dev/null
+++ b/tests/unit/core/compile/sqlglot/snapshots/test_compile_window/test_compile_window_w_groupby_rolling/out.sql
@@ -0,0 +1,76 @@
+WITH `bfcte_0` AS (
+ SELECT
+ `bool_col` AS `bfcol_0`,
+ `int64_col` AS `bfcol_1`,
+ `rowindex` AS `bfcol_2`
+ FROM `bigframes-dev`.`sqlglot_test`.`scalar_types`
+), `bfcte_1` AS (
+ SELECT
+ *,
+ `bfcol_2` AS `bfcol_6`,
+ `bfcol_0` AS `bfcol_7`,
+ `bfcol_1` AS `bfcol_8`,
+ `bfcol_0` AS `bfcol_9`
+ FROM `bfcte_0`
+), `bfcte_2` AS (
+ SELECT
+ *
+ FROM `bfcte_1`
+ WHERE
+ NOT `bfcol_9` IS NULL
+), `bfcte_3` AS (
+ SELECT
+ *,
+ CASE
+ WHEN SUM(CAST(NOT `bfcol_7` IS NULL AS INT64)) OVER (
+ PARTITION BY `bfcol_9`
+ ORDER BY `bfcol_9` IS NULL ASC NULLS LAST, `bfcol_9` ASC NULLS LAST, `bfcol_2` IS NULL ASC NULLS LAST, `bfcol_2` ASC NULLS LAST
+ ROWS BETWEEN 3 PRECEDING AND CURRENT ROW
+ ) < 3
+ THEN NULL
+ ELSE COALESCE(
+ SUM(CAST(`bfcol_7` AS INT64)) OVER (
+ PARTITION BY `bfcol_9`
+ ORDER BY `bfcol_9` IS NULL ASC NULLS LAST, `bfcol_9` ASC NULLS LAST, `bfcol_2` IS NULL ASC NULLS LAST, `bfcol_2` ASC NULLS LAST
+ ROWS BETWEEN 3 PRECEDING AND CURRENT ROW
+ ),
+ 0
+ )
+ END AS `bfcol_15`
+ FROM `bfcte_2`
+), `bfcte_4` AS (
+ SELECT
+ *
+ FROM `bfcte_3`
+ WHERE
+ NOT `bfcol_9` IS NULL
+), `bfcte_5` AS (
+ SELECT
+ *,
+ CASE
+ WHEN SUM(CAST(NOT `bfcol_8` IS NULL AS INT64)) OVER (
+ PARTITION BY `bfcol_9`
+ ORDER BY `bfcol_9` IS NULL ASC NULLS LAST, `bfcol_9` ASC NULLS LAST, `bfcol_2` IS NULL ASC NULLS LAST, `bfcol_2` ASC NULLS LAST
+ ROWS BETWEEN 3 PRECEDING AND CURRENT ROW
+ ) < 3
+ THEN NULL
+ ELSE COALESCE(
+ SUM(`bfcol_8`) OVER (
+ PARTITION BY `bfcol_9`
+ ORDER BY `bfcol_9` IS NULL ASC NULLS LAST, `bfcol_9` ASC NULLS LAST, `bfcol_2` IS NULL ASC NULLS LAST, `bfcol_2` ASC NULLS LAST
+ ROWS BETWEEN 3 PRECEDING AND CURRENT ROW
+ ),
+ 0
+ )
+ END AS `bfcol_21`
+ FROM `bfcte_4`
+)
+SELECT
+ `bfcol_9` AS `bool_col`,
+ `bfcol_6` AS `rowindex`,
+ `bfcol_15` AS `bool_col_1`,
+ `bfcol_21` AS `int64_col`
+FROM `bfcte_5`
+ORDER BY
+ `bfcol_9` ASC NULLS LAST,
+ `bfcol_2` ASC NULLS LAST
\ No newline at end of file
diff --git a/tests/unit/core/compile/sqlglot/snapshots/test_compile_window/test_compile_window_w_range_rolling/out.sql b/tests/unit/core/compile/sqlglot/snapshots/test_compile_window/test_compile_window_w_range_rolling/out.sql
new file mode 100644
index 0000000000..581c81c6b4
--- /dev/null
+++ b/tests/unit/core/compile/sqlglot/snapshots/test_compile_window/test_compile_window_w_range_rolling/out.sql
@@ -0,0 +1,30 @@
+WITH `bfcte_0` AS (
+ SELECT
+ *
+ FROM UNNEST(ARRAY>[STRUCT(CAST('2025-01-01T00:00:00+00:00' AS TIMESTAMP), 0, 0), STRUCT(CAST('2025-01-01T00:00:01+00:00' AS TIMESTAMP), 1, 1), STRUCT(CAST('2025-01-01T00:00:02+00:00' AS TIMESTAMP), 2, 2), STRUCT(CAST('2025-01-01T00:00:03+00:00' AS TIMESTAMP), 3, 3), STRUCT(CAST('2025-01-01T00:00:04+00:00' AS TIMESTAMP), 0, 4), STRUCT(CAST('2025-01-01T00:00:05+00:00' AS TIMESTAMP), 1, 5), STRUCT(CAST('2025-01-01T00:00:06+00:00' AS TIMESTAMP), 2, 6), STRUCT(CAST('2025-01-01T00:00:07+00:00' AS TIMESTAMP), 3, 7), STRUCT(CAST('2025-01-01T00:00:08+00:00' AS TIMESTAMP), 0, 8), STRUCT(CAST('2025-01-01T00:00:09+00:00' AS TIMESTAMP), 1, 9), STRUCT(CAST('2025-01-01T00:00:10+00:00' AS TIMESTAMP), 2, 10), STRUCT(CAST('2025-01-01T00:00:11+00:00' AS TIMESTAMP), 3, 11), STRUCT(CAST('2025-01-01T00:00:12+00:00' AS TIMESTAMP), 0, 12), STRUCT(CAST('2025-01-01T00:00:13+00:00' AS TIMESTAMP), 1, 13), STRUCT(CAST('2025-01-01T00:00:14+00:00' AS TIMESTAMP), 2, 14), STRUCT(CAST('2025-01-01T00:00:15+00:00' AS TIMESTAMP), 3, 15), STRUCT(CAST('2025-01-01T00:00:16+00:00' AS TIMESTAMP), 0, 16), STRUCT(CAST('2025-01-01T00:00:17+00:00' AS TIMESTAMP), 1, 17), STRUCT(CAST('2025-01-01T00:00:18+00:00' AS TIMESTAMP), 2, 18), STRUCT(CAST('2025-01-01T00:00:19+00:00' AS TIMESTAMP), 3, 19)])
+), `bfcte_1` AS (
+ SELECT
+ *,
+ CASE
+ WHEN SUM(CAST(NOT `bfcol_1` IS NULL AS INT64)) OVER (
+ ORDER BY UNIX_MICROS(`bfcol_0`) ASC NULLS LAST
+ RANGE BETWEEN 2999999 PRECEDING AND CURRENT ROW
+ ) < 1
+ THEN NULL
+ ELSE COALESCE(
+ SUM(`bfcol_1`) OVER (
+ ORDER BY UNIX_MICROS(`bfcol_0`) ASC NULLS LAST
+ RANGE BETWEEN 2999999 PRECEDING AND CURRENT ROW
+ ),
+ 0
+ )
+ END AS `bfcol_6`
+ FROM `bfcte_0`
+)
+SELECT
+ `bfcol_0` AS `ts_col`,
+ `bfcol_6` AS `int_col`
+FROM `bfcte_1`
+ORDER BY
+ `bfcol_0` ASC NULLS LAST,
+ `bfcol_2` ASC NULLS LAST
\ No newline at end of file
diff --git a/tests/unit/core/compile/sqlglot/snapshots/test_compile_window/test_compile_window_w_skips_nulls_op/out.sql b/tests/unit/core/compile/sqlglot/snapshots/test_compile_window/test_compile_window_w_skips_nulls_op/out.sql
new file mode 100644
index 0000000000..6d779a40ac
--- /dev/null
+++ b/tests/unit/core/compile/sqlglot/snapshots/test_compile_window/test_compile_window_w_skips_nulls_op/out.sql
@@ -0,0 +1,30 @@
+WITH `bfcte_0` AS (
+ SELECT
+ `int64_col` AS `bfcol_0`,
+ `rowindex` AS `bfcol_1`
+ FROM `bigframes-dev`.`sqlglot_test`.`scalar_types`
+), `bfcte_1` AS (
+ SELECT
+ *,
+ CASE
+ WHEN SUM(CAST(NOT `bfcol_0` IS NULL AS INT64)) OVER (
+ ORDER BY `bfcol_1` IS NULL ASC NULLS LAST, `bfcol_1` ASC NULLS LAST
+ ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
+ ) < 3
+ THEN NULL
+ ELSE COALESCE(
+ SUM(`bfcol_0`) OVER (
+ ORDER BY `bfcol_1` IS NULL ASC NULLS LAST, `bfcol_1` ASC NULLS LAST
+ ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
+ ),
+ 0
+ )
+ END AS `bfcol_4`
+ FROM `bfcte_0`
+)
+SELECT
+ `bfcol_1` AS `rowindex`,
+ `bfcol_4` AS `int64_col`
+FROM `bfcte_1`
+ORDER BY
+ `bfcol_1` ASC NULLS LAST
\ No newline at end of file
diff --git a/tests/unit/core/compile/sqlglot/snapshots/test_compile_window/test_compile_window_wo_skips_nulls_op/out.sql b/tests/unit/core/compile/sqlglot/snapshots/test_compile_window/test_compile_window_wo_skips_nulls_op/out.sql
new file mode 100644
index 0000000000..1d5d9a9e45
--- /dev/null
+++ b/tests/unit/core/compile/sqlglot/snapshots/test_compile_window/test_compile_window_wo_skips_nulls_op/out.sql
@@ -0,0 +1,27 @@
+WITH `bfcte_0` AS (
+ SELECT
+ `int64_col` AS `bfcol_0`,
+ `rowindex` AS `bfcol_1`
+ FROM `bigframes-dev`.`sqlglot_test`.`scalar_types`
+), `bfcte_1` AS (
+ SELECT
+ *,
+ CASE
+ WHEN COUNT(CAST(NOT `bfcol_0` IS NULL AS INT64)) OVER (
+ ORDER BY `bfcol_1` IS NULL ASC NULLS LAST, `bfcol_1` ASC NULLS LAST
+ ROWS BETWEEN 4 PRECEDING AND CURRENT ROW
+ ) < 5
+ THEN NULL
+ ELSE COUNT(`bfcol_0`) OVER (
+ ORDER BY `bfcol_1` IS NULL ASC NULLS LAST, `bfcol_1` ASC NULLS LAST
+ ROWS BETWEEN 4 PRECEDING AND CURRENT ROW
+ )
+ END AS `bfcol_4`
+ FROM `bfcte_0`
+)
+SELECT
+ `bfcol_1` AS `rowindex`,
+ `bfcol_4` AS `int64_col`
+FROM `bfcte_1`
+ORDER BY
+ `bfcol_1` ASC NULLS LAST
\ No newline at end of file
diff --git a/tests/unit/core/compile/sqlglot/test_compile_window.py b/tests/unit/core/compile/sqlglot/test_compile_window.py
new file mode 100644
index 0000000000..1fc70dc30f
--- /dev/null
+++ b/tests/unit/core/compile/sqlglot/test_compile_window.py
@@ -0,0 +1,70 @@
+# Copyright 2025 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import sys
+
+import numpy as np
+import pandas as pd
+import pytest
+
+import bigframes.pandas as bpd
+
+pytest.importorskip("pytest_snapshot")
+
+
+if sys.version_info < (3, 12):
+ pytest.skip(
+ "Skipping test due to inconsistent SQL formatting on Python < 3.12.",
+ allow_module_level=True,
+ )
+
+
+def test_compile_window_w_skips_nulls_op(scalar_types_df: bpd.DataFrame, snapshot):
+ bf_df = scalar_types_df[["int64_col"]].sort_index()
+ # The SumOp's skips_nulls is True
+ result = bf_df.rolling(window=3).sum()
+ snapshot.assert_match(result.sql, "out.sql")
+
+
+def test_compile_window_wo_skips_nulls_op(scalar_types_df: bpd.DataFrame, snapshot):
+ bf_df = scalar_types_df[["int64_col"]].sort_index()
+ # The CountOp's skips_nulls is False
+ result = bf_df.rolling(window=5).count()
+ snapshot.assert_match(result.sql, "out.sql")
+
+
+def test_compile_window_w_groupby_rolling(scalar_types_df: bpd.DataFrame, snapshot):
+ bf_df = scalar_types_df[["bool_col", "int64_col"]].sort_index()
+ result = (
+ bf_df.groupby(scalar_types_df["bool_col"])
+ .rolling(window=3, closed="both")
+ .sum()
+ )
+ snapshot.assert_match(result.sql, "out.sql")
+
+
+def test_compile_window_w_range_rolling(compiler_session, snapshot):
+ # TODO: use `duration_col` instead.
+ values = np.arange(20)
+ pd_df = pd.DataFrame(
+ {
+ "ts_col": pd.Timestamp("20250101", tz="UTC") + pd.to_timedelta(values, "s"),
+ "int_col": values % 4,
+ "float_col": values / 2,
+ }
+ )
+ bf_df = compiler_session.read_pandas(pd_df)
+ bf_series = bf_df.set_index("ts_col")["int_col"].sort_index()
+ result = bf_series.rolling(window="3s").sum()
+ snapshot.assert_match(result.to_frame().sql, "out.sql")
diff --git a/tests/unit/display/test_html.py b/tests/unit/display/test_html.py
new file mode 100644
index 0000000000..fcf1455362
--- /dev/null
+++ b/tests/unit/display/test_html.py
@@ -0,0 +1,151 @@
+# Copyright 2024 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import datetime
+
+import pandas as pd
+import pyarrow as pa
+import pytest
+
+import bigframes as bf
+import bigframes.display.html as bf_html
+
+
+@pytest.mark.parametrize(
+ ("data", "expected_alignments", "expected_strings"),
+ [
+ pytest.param(
+ {
+ "string_col": ["a", "b", "c"],
+ "int_col": [1, 2, 3],
+ "float_col": [1.1, 2.2, 3.3],
+ "bool_col": [True, False, True],
+ },
+ {
+ "string_col": "left",
+ "int_col": "right",
+ "float_col": "right",
+ "bool_col": "left",
+ },
+ ["1.100000", "2.200000", "3.300000"],
+ id="scalars",
+ ),
+ pytest.param(
+ {
+ "timestamp_col": pa.array(
+ [
+ datetime.datetime.fromisoformat(value)
+ for value in [
+ "2024-01-01 00:00:00",
+ "2024-01-01 00:00:01",
+ "2024-01-01 00:00:02",
+ ]
+ ],
+ pa.timestamp("us", tz="UTC"),
+ ),
+ "datetime_col": pa.array(
+ [
+ datetime.datetime.fromisoformat(value)
+ for value in [
+ "2027-06-05 04:03:02.001",
+ "2027-01-01 00:00:01",
+ "2027-01-01 00:00:02",
+ ]
+ ],
+ pa.timestamp("us"),
+ ),
+ "date_col": pa.array(
+ [
+ datetime.date(1999, 1, 1),
+ datetime.date(1999, 1, 2),
+ datetime.date(1999, 1, 3),
+ ],
+ pa.date32(),
+ ),
+ "time_col": pa.array(
+ [
+ datetime.time(11, 11, 0),
+ datetime.time(11, 11, 1),
+ datetime.time(11, 11, 2),
+ ],
+ pa.time64("us"),
+ ),
+ },
+ {
+ "timestamp_col": "left",
+ "datetime_col": "left",
+ "date_col": "left",
+ "time_col": "left",
+ },
+ [
+ "2024-01-01 00:00:00",
+ "2027-06-05 04:03:02.001",
+ "1999-01-01",
+ "11:11:01",
+ ],
+ id="datetimes",
+ ),
+ pytest.param(
+ {
+ "array_col": pd.Series(
+ [[1, 2, 3], [4, 5, 6], [7, 8, 9]],
+ dtype=pd.ArrowDtype(pa.list_(pa.int64())),
+ ),
+ },
+ {
+ "array_col": "left",
+ },
+ ["[1, 2, 3]", "[4, 5, 6]", "[7, 8, 9]"],
+ id="array",
+ ),
+ pytest.param(
+ {
+ "struct_col": pd.Series(
+ [{"v": 1}, {"v": 2}, {"v": 3}],
+ dtype=pd.ArrowDtype(pa.struct([("v", pa.int64())])),
+ ),
+ },
+ {
+ "struct_col": "left",
+ },
+ ["{'v': 1}", "{'v': 2}", "{'v': 3}"],
+ id="struct",
+ ),
+ ],
+)
+def test_render_html_alignment_and_precision(
+ data, expected_alignments, expected_strings
+):
+ df = pd.DataFrame(data)
+ html = bf_html.render_html(dataframe=df, table_id="test-table")
+
+ for _, align in expected_alignments.items():
+ assert 'th style="text-align: left;"' in html
+ assert f'=1.2.0"]
+
+ # Exact match
+ assert _utils._package_existed(reqs, "pandas==1.0")
+ # Different version
+ assert _utils._package_existed(reqs, "pandas==2.0")
+ # No version specified
+ assert _utils._package_existed(reqs, "numpy")
+ # Not in list
+ assert not _utils._package_existed(reqs, "xgboost")
+ # Empty list
+ assert not _utils._package_existed([], "pandas")
+
+
+def test_has_conflict_output_type_no_conflict():
+ """Tests has_conflict_output_type with type annotation."""
+ # Helper functions with type annotation for has_conflict_output_type.
+ def _func_with_return_type(x: int) -> int:
+ return x
+
+ signature = inspect.signature(_func_with_return_type)
+
+ assert _utils.has_conflict_output_type(signature, output_type=float)
+ assert not _utils.has_conflict_output_type(signature, output_type=int)
+
+
+def test_has_conflict_output_type_no_annotation():
+ """Tests has_conflict_output_type without type annotation."""
+ # Helper functions without type annotation for has_conflict_output_type.
+ def _func_without_return_type(x):
+ return x
+
+ signature = inspect.signature(_func_without_return_type)
+
+ assert not _utils.has_conflict_output_type(signature, output_type=int)
+ assert not _utils.has_conflict_output_type(signature, output_type=float)
+
+
@pytest.mark.parametrize(
["metadata_options", "metadata_string"],
(
@@ -54,6 +222,7 @@
),
)
def test_get_bigframes_metadata(metadata_options, metadata_string):
+
assert _utils.get_bigframes_metadata(**metadata_options) == metadata_string
@@ -72,6 +241,7 @@ def test_get_bigframes_metadata(metadata_options, metadata_string):
def test_get_bigframes_metadata_array_type_not_serializable(output_type):
with pytest.raises(ValueError) as context:
_utils.get_bigframes_metadata(python_output_type=output_type)
+
assert str(context.value) == (
f"python_output_type {output_type} is not serializable. {constants.FEEDBACK_LINK}"
)
@@ -125,6 +295,7 @@ def test_get_bigframes_metadata_array_type_not_serializable(output_type):
def test_get_python_output_type_from_bigframes_metadata(
metadata_string, python_output_type
):
+
assert (
_utils.get_python_output_type_from_bigframes_metadata(metadata_string)
== python_output_type
@@ -135,4 +306,5 @@ def test_metadata_roundtrip_supported_array_types():
for array_of in function_typing.RF_SUPPORTED_ARRAY_OUTPUT_PYTHON_TYPES:
ser = _utils.get_bigframes_metadata(python_output_type=list[array_of]) # type: ignore
deser = _utils.get_python_output_type_from_bigframes_metadata(ser)
+
assert deser == list[array_of] # type: ignore
diff --git a/tests/unit/test_dataframe_polars.py b/tests/unit/test_dataframe_polars.py
index 2070b25d66..a6f5c3d1ef 100644
--- a/tests/unit/test_dataframe_polars.py
+++ b/tests/unit/test_dataframe_polars.py
@@ -1657,13 +1657,11 @@ def test_reset_index_with_unnamed_index(
pandas.testing.assert_frame_equal(bf_result, pd_result)
-def test_reset_index_with_unnamed_multiindex(
- scalars_df_index,
- scalars_pandas_df_index,
-):
+def test_reset_index_with_unnamed_multiindex(session):
bf_df = dataframe.DataFrame(
([1, 2, 3], [2, 5, 7]),
index=pd.MultiIndex.from_tuples([("a", "aa"), ("a", "aa")]),
+ session=session,
)
pd_df = pd.DataFrame(
([1, 2, 3], [2, 5, 7]),
diff --git a/third_party/bigframes_vendored/ibis/expr/operations/numeric.py b/third_party/bigframes_vendored/ibis/expr/operations/numeric.py
index 174de5ab7f..384323c596 100644
--- a/third_party/bigframes_vendored/ibis/expr/operations/numeric.py
+++ b/third_party/bigframes_vendored/ibis/expr/operations/numeric.py
@@ -326,7 +326,7 @@ class Tan(TrigonometricUnary):
class BitwiseNot(Unary):
"""Bitwise NOT operation."""
- arg: Integer
+ arg: Value[dt.Integer | dt.Binary]
dtype = rlz.numeric_like("args", operator.invert)
diff --git a/third_party/bigframes_vendored/ibis/expr/types/binary.py b/third_party/bigframes_vendored/ibis/expr/types/binary.py
index ba6140a49f..08fea31a1c 100644
--- a/third_party/bigframes_vendored/ibis/expr/types/binary.py
+++ b/third_party/bigframes_vendored/ibis/expr/types/binary.py
@@ -32,6 +32,9 @@ def hashbytes(
"""
return ops.HashBytes(self, how).to_expr()
+ def __invert__(self) -> BinaryValue:
+ return ops.BitwiseNot(self).to_expr()
+
@public
class BinaryScalar(Scalar, BinaryValue):
diff --git a/third_party/bigframes_vendored/pandas/core/config_init.py b/third_party/bigframes_vendored/pandas/core/config_init.py
index 51d056a2c8..3425674e4f 100644
--- a/third_party/bigframes_vendored/pandas/core/config_init.py
+++ b/third_party/bigframes_vendored/pandas/core/config_init.py
@@ -84,6 +84,9 @@
memory_usage (bool):
This specifies if the memory usage of a DataFrame should be displayed when
df.info() is called. Valid values True,False,
+ precision (int):
+ Controls the floating point output precision, similar to
+ `pandas.options.display.precision`.
blob_display (bool):
Whether to display the blob content in notebook DataFrame preview. Default True.
blob_display_width (int or None):
diff --git a/third_party/bigframes_vendored/pandas/core/frame.py b/third_party/bigframes_vendored/pandas/core/frame.py
index 1f79c428c1..00984935a4 100644
--- a/third_party/bigframes_vendored/pandas/core/frame.py
+++ b/third_party/bigframes_vendored/pandas/core/frame.py
@@ -1601,8 +1601,10 @@ def droplevel(self, level, axis: str | int = 0):
def reset_index(
self,
+ level=None,
*,
drop: bool = False,
+ inplace: bool = False,
) -> DataFrame | None:
"""Reset the index.
@@ -1696,9 +1698,14 @@ class name speed max
Args:
+ level (int, str, tuple, or list, default None):
+ Only remove the given levels from the index. Removes all levels by
+ default.
drop (bool, default False):
Do not try to insert index into dataframe columns. This resets
the index to the default integer index.
+ inplace (bool, default False):
+ Whether to modify the DataFrame rather than creating a new one.
Returns:
bigframes.pandas.DataFrame: DataFrame with the new index.
@@ -5972,22 +5979,18 @@ def melt(self, id_vars, value_vars, var_name, value_name):
Using `melt` without optional arguments:
>>> df.melt()
- variable value
- 0 A 1.0
- 1 A
- 2 A 3.0
- 3 A 4.0
- 4 A 5.0
- 5 B 1.0
- 6 B 2.0
- 7 B 3.0
- 8 B 4.0
- 9 B 5.0
- 10 C
- 11 C 3.5
- 12 C
- 13 C 4.5
- 14 C 5.0
+ variable value
+ 0 A 1.0
+ 1 A
+ 2 A 3.0
+ 3 A 4.0
+ 4 A 5.0
+ 5 B 1.0
+ 6 B 2.0
+ 7 B 3.0
+ 8 B 4.0
+ 9 B 5.0
+ ...
[15 rows x 2 columns]
diff --git a/third_party/bigframes_vendored/pandas/core/series.py b/third_party/bigframes_vendored/pandas/core/series.py
index 0160a7eb50..7b420cf6e3 100644
--- a/third_party/bigframes_vendored/pandas/core/series.py
+++ b/third_party/bigframes_vendored/pandas/core/series.py
@@ -321,9 +321,11 @@ def transpose(self) -> Series:
def reset_index(
self,
+ level=None,
*,
drop: bool = False,
name=pd_ext.no_default,
+ inplace: bool = False,
) -> DataFrame | Series | None:
"""
Generate a new DataFrame or Series with the index reset.
@@ -399,6 +401,9 @@ def reset_index(
[4 rows x 3 columns]
Args:
+ level (int, str, tuple, or list, default optional):
+ For a Series with a MultiIndex, only remove the specified levels
+ from the index. Removes all levels by default.
drop (bool, default False):
Just reset the index, without inserting it as a column in
the new DataFrame.
@@ -406,6 +411,8 @@ def reset_index(
The name to use for the column containing the original Series
values. Uses ``self.name`` by default. This argument is ignored
when `drop` is True.
+ inplace (bool, default False):
+ Modify the Series in place (do not create a new object).
Returns:
bigframes.pandas.Series or bigframes.pandas.DataFrame or None:
diff --git a/third_party/bigframes_vendored/version.py b/third_party/bigframes_vendored/version.py
index 7aff17a40d..6b84e2eb1d 100644
--- a/third_party/bigframes_vendored/version.py
+++ b/third_party/bigframes_vendored/version.py
@@ -12,8 +12,8 @@
# See the License for the specific language governing permissions and
# limitations under the License.
-__version__ = "2.15.0"
+__version__ = "2.16.0"
# {x-release-please-start-date}
-__release_date__ = "2025-08-11"
+__release_date__ = "2025-08-20"
# {x-release-please-end}
| |