You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello,
I closed issue #1419 and this discussion is logical continuation. Implementation is deferred until after VSR (Viewstamped Replication) lands, since the two are tightly coupled and the design only becomes concrete with VSR in place. Until then, this thread is where we refine the design.
Motivation
Local disk alone does not scale to terabyte workloads, and is overkill when most reads target the tail. Tiered storage keeps hot data on fast media and offloads cold data to cheap object storage (S3, GCS, Azure Blob, MinIO).
Proposed tiers: RAM (cache, active tail), NVMe (hot segments), object storage (bulk cold data). Each independently enabled.
Open Design Areas
Tier-aware reads - resolve an offset range across tiers and stitch results (e.g. 0..9999 on S3, 10000..19999 on disk, 20000..20990 in RAM). RAM + disk already works.
Migration - move segments outward by age, size, or explicit trigger. Message archiver covers part of this today.
Object storage I/O - no LIST scans (maintain index objects), write batches of 8 to 16 MiB, never one-object-per-message. Reuse offset + timestamp indexes, stored as objects.
Backend abstraction - OpenDAL is the obvious candidate (broad coverage, one API). Open question: expose directly, wrap behind an Iggy trait, or both.
Runtime portability - must work on compio (io_uring). OpenDAL's framework is runtime-agnostic (Execute trait, compfs/monoiofs services exist), but the S3/GCS HTTP path goes through reqwest which is tokio-bound. Needs a plan (separate executor, IPC sidecar, or upstream HTTP client abstraction).
Format migration - design a translation path for stored objects across breaking version changes.
Open questions
Real tier-sizing ratios
Strong opinions on OpenDAL vs. native backends (we use compio...)
Is "object storage only, no local disk" a real deployment for users?
Note: separate from this, an S3 sink connector PR is already open at #3103 - that targets the connector framework (sink data into S3 from a stream), not server-side tiered persistence. Different layer, but worth tracking as prior art for S3 plumbing.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
I closed issue #1419 and this discussion is logical continuation. Implementation is deferred until after VSR (Viewstamped Replication) lands, since the two are tightly coupled and the design only becomes concrete with VSR in place. Until then, this thread is where we refine the design.
Motivation
Local disk alone does not scale to terabyte workloads, and is overkill when most reads target the tail. Tiered storage keeps hot data on fast media and offloads cold data to cheap object storage (S3, GCS, Azure Blob, MinIO).
Proposed tiers: RAM (cache, active tail), NVMe (hot segments), object storage (bulk cold data). Each independently enabled.
Open Design Areas
0..9999on S3,10000..19999on disk,20000..20990in RAM). RAM + disk already works.reqwestwhich is tokio-bound. Needs a plan (separate executor, IPC sidecar, or upstream HTTP client abstraction).Open questions
Note: separate from this, an S3 sink connector PR is already open at #3103 - that targets the connector framework (sink data into S3 from a stream), not server-side tiered persistence. Different layer, but worth tracking as prior art for S3 plumbing.
Thoughts?
Sources
executors-tokiofeature,services-compfs/services-monoiofsservicesBeta Was this translation helpful? Give feedback.
All reactions