Cross-Venue Semantic Matcher

A GPU-accelerated semantic matching pipeline for identifying equivalent contracts across prediction market venues (Polymarket, Kalshi). Two-stage retrieve-then-rerank (Qwen3-Embedding-4B + DeBERTa-v3 NLI) over Qdrant, with venue-agnostic discovery, micro-batch GPU scheduling, and esports market latency analysis.

Archived as PolyEdge infrastructure exploration. Built to validate cross-venue arbitrage feasibility — the matching pipeline works, but spread economics don't support profitable execution. Published as a reference iterative implementation.

Architecture

Venue-agnostic 7-stage pipeline that ingests prediction market contracts from any supported venue and identifies semantically equivalent markets across platforms. Adding a new venue requires only a connector and a text builder — the matching, verification, and persistence layers are fully venue-independent.

Venue Connectors (Kalshi WS, Polymarket REST/WS)
    |
    v
[1] Canonicalization ── venue-specific markdown normalization (CPU)
    |
    v
[2] Batch Embedding ── Qwen3-Embedding-4B, 2048-dim, instruction-guided (GPU)
    |
    v
[3] Retrieval ── Qdrant top-K similarity search, cross-venue only (CPU + Vector DB)
    |
    v
[4] Reranking ── DeBERTa-v3 NLI cross-encoder, bidirectional scoring (GPU)
    |
    v
[5] ContractSpec Extraction ── entity/date/threshold parsing (CPU, optional LLM fallback)
    |
    v
[6] Pair Verification ── 5-factor weighted fusion (CPU)
    |
    v
[7] Persistence ── PostgreSQL micro-batch writer + LISTEN/NOTIFY (async)

Key Technical Decisions

Retrieve-then-rerank. Recall-heavy embedding retrieval (threshold 0.5, 20 candidates) followed by precision cross-encoder reranking. Bidirectional NLI scoring catches superset/subset false positives.

Micro-batch GPU scheduling. Workers accumulate events (1s timeout) into batches before GPU forward pass. GPU concurrency semaphore (default: 1) serializes passes to prevent VRAM fragmentation. CUDA cache defragmentation every 10 batches.

Dynamic batch sizing. Post-model-load VRAM probing computes optimal batch size. INT8 quantization for cross-encoder (0.5 GB vs 1.7 GB FP32). Runs dual 4B models on 8 GB GPU without OOM.

Venue abstraction. Pluggable discovery strategies (normal, esports, hybrid). Per-venue text builders normalize heterogeneous API schemas to identical markdown. TTL-based deduplication across WebSocket reconnects.

5-factor verification fusion. Cross-encoder NLI (50%) + threshold tolerance (20%) + entity fuzzy match (15%) + date tolerance (10%) + data source compatibility (5%). Returns EQUIVALENT / PARTIAL / REJECTED.

Directory Structure

semantic_pipeline/
├── orchestrator/
│   ├── core.py              # 7-stage pipeline coordinator
│   ├── metrics.py           # Thread-safe per-stage metrics
│   └── discovery/           # Discovery strategy pattern
├── discovery/
│   ├── base_connector.py    # Abstract venue connector (reconnection, heartbeats)
│   ├── kalshi_poller.py     # Kalshi REST bootstrap + WebSocket streaming
│   ├── polymarket_poller.py # Polymarket Gamma API + WebSocket
│   ├── dedup.py             # Market deduplication (TTL-based)
│   └── strategies/          # Esports mode, hybrid mode
├── canonicalization/
│   ├── text_builder.py      # Kalshi/Polymarket-specific markdown builders
│   ├── hasher.py            # Identity hash (dedup) + content hash (change detection)
│   └── contract_spec.py     # ContractSpec dataclass
├── embedding/
│   ├── encoder.py           # Qwen3-Embedding-4B with instruction-guided encoding
│   ├── index.py             # Qdrant async client
│   └── cache/               # In-memory LRU embedding cache
├── matching/
│   ├── retriever.py         # Qdrant top-K cross-venue retrieval
│   ├── cross_encoder.py     # DeBERTa-v3-large-mnli bidirectional scorer
│   ├── reranker.py          # Candidate reranking with asymmetry detection
│   └── pair_verifier.py     # Multi-factor weighted verification
├── extraction/
│   └── spec_extractor.py    # Entity/date/threshold extraction (regex + optional LLM)
├── persistence/
│   └── writer.py            # PostgreSQL batch writer + LISTEN/NOTIFY
├── monitoring/
│   ├── compliance/          # Audit logger, circuit breaker
│   ├── core/                # Latency engine, orderbook manager
│   └── feeds/               # WebSocket feeds (Kalshi, Polymarket, Riot API)
├── spread_scanner.py        # Spread arb exploration script
├── spread_scanner_ws.py     # WebSocket-based spread scanner
├── spread_scanner_ws_esports.py  # Esports-specific spread scanner
├── config.py                # 70+ tuneable parameters from env vars
└── requirements.txt

Setup

Prerequisites

Python 3.10+
PostgreSQL
Qdrant (cloud or self-hosted)
GPU recommended (MPS/CUDA for local model inference)

Installation

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Configuration

Create a .env file:

# Qdrant (vector database for similarity search)
QDRANT_URL=https://your-qdrant-instance.cloud.qdrant.io/
QDRANT_API_KEY=your_qdrant_api_key
QDRANT_COLLECTION_NAME=market_embeddings
QDRANT_VECTOR_SIZE=2048

# Embedding Model
EMBEDDING_MODEL=Qwen/Qwen3-Embedding-4B
EMBEDDING_DEVICE=mps          # mps (Apple Silicon), cuda, or cpu
EMBEDDING_BATCH_SIZE=8         # overridden by AUTO_BATCH_SIZE if enabled
EMBEDDING_DIM=2048
EMBEDDING_QUANTIZATION=true    # INT8 quantization (~50% VRAM reduction)
EMBEDDING_INSTRUCTION=Given a prediction market contract, represent its core event, resolution conditions, and timeframe for matching equivalent contracts across different prediction market platforms

# Cross-Encoder (NLI-based reranking)
CROSS_ENCODER_QUANTIZATION=true

# GPU Scheduling
AUTO_BATCH_SIZE=true           # dynamic batch sizing from free VRAM
EMBEDDING_GPU_CONCURRENCY=1    # serialized GPU access prevents fragmentation
CROSS_ENCODER_GPU_CONCURRENCY=1
ORCHESTRATOR_NUM_WORKERS=2
WORKER_BATCH_TIMEOUT=1.0       # micro-batch accumulation timeout (seconds)

# Cache
EMBEDDING_CACHE_MAX_SIZE=10000
EXTRACTION_CACHE_MAX_SIZE=1000
VERIFICATION_CACHE_MAX_SIZE=10000

# PostgreSQL
DATABASE_URL=postgresql://user:pass@host:port/database

# Kalshi API
KALSHI_API_KEY_ID=your_kalshi_key_id
KALSHI_PRIVATE_KEY_PATH=/path/to/kalshi.pem

# Bootstrap (initial market fetch)
BOOTSTRAP_ENABLED=true
BOOTSTRAP_MAX_MARKETS_PER_VENUE=0  # 0 = unlimited, set to 50 for dev/testing
BOOTSTRAP_FETCH_TIMEOUT=600
POLYMARKET_GAMMA_API_URL=https://gamma-api.polymarket.com

# Discovery mode
DISCOVERY_MODE=esports         # normal | esports | hybrid
ESPORTS_POLYMARKET_CATEGORIES=esports,gaming,video-games
ESPORTS_KALSHI_KEYWORDS=LOL,LEAGUE,DOTA,VALORANT,CSGO

# Riot API (optional, for esports game event latency measurement)
RIOT_API_KEY=your_riot_api_key
GAME_EVENT_POLL_INTERVAL_MS=1000

# LLM fallback (optional)
EXTRACTION_USE_LLM_FALLBACK=false

Running

# Full pipeline
python -m semantic_pipeline.orchestrator

# Esports spread scanner (exploration)
python spread_scanner_ws_esports.py

# Esports arb monitor (single match)
python monitor_dk_vs_t1.py

Exploration Scripts

These scripts document the research process — investigating whether cross-venue esports prediction markets have exploitable latency or spread divergence.

Script	Purpose
`spread_scanner.py`	REST-based cross-venue spread analysis
`spread_scanner_ws.py`	WebSocket real-time spread monitoring
`spread_scanner_ws_esports.py`	Esports-focused spread scanner with Riot API integration
`monitor_dk_vs_t1.py`	Live arb monitor for DK vs T1 LoL match with latency measurement
`verify_market_equivalence.py`	Manual market pair verification tool
`check_esports_pairs.py`	Inspect matched esports pairs in DB
`check_db_for_polymarket_market.py`	Debug tool for Polymarket market lookup

Status

Archived. This was an iterative exploration into cross-venue arbitrage feasibility and its integration into the PolyEdge platform. The semantic matching pipeline works — cross-venue pairs are correctly identified and verified with high precision. The arb thesis was invalidated: prediction market spreads are too thin and venue-to-venue latency too high for profitable execution after fees. The venue discovery and market resolution infrastructure evolved into PolyEdge's production data API.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
canonicalization		canonicalization
discovery		discovery
embedding		embedding
extraction		extraction
matching		matching
monitoring		monitoring
orchestrator		orchestrator
orchestrator_tests		orchestrator_tests
persistence		persistence
scripts		scripts
.gitignore		.gitignore
.python-version		.python-version
MONITOR_QUICKSTART.md		MONITOR_QUICKSTART.md
README.md		README.md
RIOT_API_RATE_LIMITS.md		RIOT_API_RATE_LIMITS.md
__init__.py		__init__.py
__main__.py		__main__.py
check_db_for_polymarket_market.py		check_db_for_polymarket_market.py
check_esports_pairs.py		check_esports_pairs.py
config.py		config.py
diagnose_verification.py		diagnose_verification.py
download_models.py		download_models.py
fetch_polymarket_game3.py		fetch_polymarket_game3.py
gpu_utils.py		gpu_utils.py
investigate_esports_matching.py		investigate_esports_matching.py
monitor_dk_vs_t1.py		monitor_dk_vs_t1.py
orchestrator.py		orchestrator.py
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
run_tests.sh		run_tests.sh
setup_testing.ps1		setup_testing.ps1
setup_testing.sh		setup_testing.sh
spread_divergence.png		spread_divergence.png
spread_scanner.py		spread_scanner.py
spread_scanner_ws.py		spread_scanner_ws.py
spread_scanner_ws_esports.py		spread_scanner_ws_esports.py
test_quick.sh		test_quick.sh
verify_market_equivalence.py		verify_market_equivalence.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cross-Venue Semantic Matcher

Architecture

Key Technical Decisions

Directory Structure

Setup

Prerequisites

Installation

Configuration

Running

Exploration Scripts

Status

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Cross-Venue Semantic Matcher

Architecture

Key Technical Decisions

Directory Structure

Setup

Prerequisites

Installation

Configuration

Running

Exploration Scripts

Status

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages