Senzing integration in Databricks Spark workflows.
If you are beginning your journey with Senzing, please start with Senzing Quick Start guides.
You are in the Senzing Garage where projects are "tinkered" on. Although this GitHub repository may help you understand an approach to using Senzing, it's not considered to be "production ready" and is not considered to be part of the Senzing product. Heck, it may not even be appropriate for your application of Senzing!
This repository contains two Jupyter notebooks which demonstrate Senzing entity resolution integrated with Apache Spark from Databricks:
spark_quickstart.ipynb:
An introductory tutorial showing how to integrate Senzing with Spark DataFrames for running entity resolution in a batch mode. This covers loading multiple datasets, configuring Senzing, processing records, and enriching DataFrames with entity resolution results.
spark_streaming.ipynb:
Demonstrates real-time entity resolution using Spark Structured Streaming.
This shows how to process streaming data through Spark and send
PII features to Senzing for continuous entity resolution, simulating a real-time
data processing pipeline.
Both tutorials require a Docker container running for the Senzing gRPC server whenever you run the Jupyter notebooks:
Start with the spark_quickstart.ipynb tutorial for detailed
explanations of the core concepts.
These tutorials were developed on MacOS, though they should run fine on Linux as well.
Platform requirements are:
- Python 3.13
- Spark requires Java 17 or 21
You can find more details about the Java requirements in the Apache Spark documentation.
To set up the Python environment:
python3 -m venv venv
source venv/bin/activate
python3 -m pip install -U pip wheel setuptools
python3 -m pip install .You also need to pull the latest Docker container for Senzing:
docker pull senzing/serve-grpc:latestThen launch this container and have it running in the background:
docker run -it --publish 8261:8261 --rm senzing/serve-grpcThen launch Jupyter to run the notebooks:
./venv/bin/jupyter-labBe aware that running pip install senzing without specifying a
version number will not give you the correct version. Make sure to
use the versions specified in the pyproject.toml file.
If you experience gRPC errors, these may be due to an outdated Docker container.
Kudos for their help with this tutorial:
@brianmacy, @docktermj