NoKV

package module
v0.7.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 29, 2026 License: Apache-2.0 Imports: 30 Imported by: 0

README

🚀 NoKV – High-Performance Distributed KV Engine

NoKV Logo

CI Coverage Go Report Card Go Reference Mentioned in Awesome DBDB.io

Go Version License DeepWiki

LSM Tree • ValueLog • MVCC • Multi-Raft Regions • Redis-Compatible

NoKV is a Go-native storage engine that mixes RocksDB-style manifest discipline with Badger-inspired value separation. You can embed it locally, drive it via multi-Raft regions, or front it with a Redis protocol gateway—all from a single topology file.

📊 CI Benchmark Snapshot

Latest public benchmark snapshot currently checked into the repository, taken from the latest successful main CI YCSB run available at the time of update (run #23701742757). This snapshot used the then-current benchmark profile: A-F, records=1,000,000, ops=1,000,000, value_size=1000, value_threshold=2048, conc=16.

Methodology and harness details live in benchmark/README.md.

Engine Workload Mode Ops/s Avg Latency P95 P99
NoKV YCSB-A 50/50 read/update 175,905 5.684µs 204.039µs 307.851µs
NoKV YCSB-B 95/5 read/update 525,631 1.902µs 24.115µs 750.413µs
NoKV YCSB-C 100% read 409,136 2.444µs 15.077µs 25.658µs
NoKV YCSB-D 95% read, 5% insert (latest) 632,031 1.582µs 21.811µs 638.457µs
NoKV YCSB-E 95% scan, 5% insert 45,620 21.92µs 139.449µs 9.203945ms
NoKV YCSB-F read-modify-write 157,732 6.339µs 232.743µs 371.209µs
Badger YCSB-A 50/50 read/update 108,232 9.239µs 285.74µs 483.139µs
Badger YCSB-B 95/5 read/update 188,893 5.294µs 274.549µs 566.042µs
Badger YCSB-C 100% read 242,463 4.124µs 36.549µs 1.862803ms
Badger YCSB-D 95% read, 5% insert (latest) 284,205 3.518µs 233.414µs 479.801µs
Badger YCSB-E 95% scan, 5% insert 15,027 66.547µs 4.064653ms 7.534558ms
Badger YCSB-F read-modify-write 84,601 11.82µs 407.624µs 645.491µs
Pebble YCSB-A 50/50 read/update 169,792 5.889µs 491.322µs 1.65907ms
Pebble YCSB-B 95/5 read/update 137,483 7.273µs 658.763µs 1.415039ms
Pebble YCSB-C 100% read 90,474 11.052µs 878.733µs 1.817526ms
Pebble YCSB-D 95% read, 5% insert (latest) 198,139 5.046µs 491.515µs 1.282231ms
Pebble YCSB-E 95% scan, 5% insert 40,793 24.513µs 1.332974ms 2.301008ms
Pebble YCSB-F read-modify-write 122,192 8.183µs 760.934µs 1.71655ms

🚦 Quick Start

Start an end-to-end playground with either the local script or Docker Compose. Both spin up a three-node Raft cluster with a PD-lite service and expose the Redis-compatible gateway.

NoKV demo

# Option A: local processes
./scripts/run_local_cluster.sh --config ./raft_config.example.json
# In another shell: launch the Redis gateway on top of the running cluster
go run ./cmd/nokv-redis \
  --addr 127.0.0.1:6380 \
  --raft-config ./raft_config.example.json \
  --metrics-addr 127.0.0.1:9100

# Option B: Docker Compose (cluster + gateway + PD)
docker compose up --build
# Tear down
docker compose down -v

Once the cluster is running you can point any Redis client at 127.0.0.1:6380 (or the address exposed by Compose).

For quick CLI checks:

# Online stats from a running node
go run ./cmd/nokv stats --expvar http://127.0.0.1:9100

# Offline forensics from a stopped node workdir
go run ./cmd/nokv stats --workdir ./artifacts/cluster/store-1

Minimal embedded snippet:

package main

import (
	"fmt"
	"log"

	NoKV "github.com/feichai0017/NoKV"
)

func main() {
	opt := NoKV.NewDefaultOptions()
	opt.WorkDir = "./workdir-demo"

	db, err := NoKV.Open(opt)
	if err != nil {
		log.Fatalf("open failed: %v", err)
	}
	defer db.Close()

	key := []byte("hello")
	if err := db.Set(key, []byte("world")); err != nil {
		log.Fatalf("set failed: %v", err)
	}

	entry, err := db.Get(key)
	if err != nil {
		log.Fatalf("get failed: %v", err)
	}
	fmt.Printf("value=%s\n", entry.Value)
}

Note:

  • DB.Get returns detached entries (do not call DecrRef).
  • DB.GetInternalEntry returns borrowed entries and callers must call DecrRef exactly once.
  • DB.SetWithTTL accepts time.Duration (relative TTL). DB.Set/DB.SetBatch/DB.SetWithTTL reject nil values; use DB.Del or DB.DeleteRange(start,end) for deletes.
  • DB.NewIterator exposes user-facing entries, while DB.NewInternalIterator scans raw internal keys (cf+user_key+ts).

ℹ️ run_local_cluster.sh rebuilds nokv and nokv-config, seeds local peer catalogs via nokv-config manifest, starts PD-lite (nokv pd), streams PD/store logs to the current terminal, and also writes them under artifacts/cluster/store-<id>/server.log and artifacts/cluster/pd.log. Use Ctrl+C to exit cleanly; if the process crashes, wipe the workdir (rm -rf ./artifacts/cluster) before restarting to avoid WAL replay errors.


🧭 Topology & Configuration

Everything hangs off a single file: raft_config.example.json.

"pd": { "addr": "127.0.0.1:2379", "docker_addr": "nokv-pd:2379" },
"stores": [
  { "store_id": 1, "listen_addr": "127.0.0.1:20170", ... },
  { "store_id": 2, "listen_addr": "127.0.0.1:20171", ... },
  { "store_id": 3, "listen_addr": "127.0.0.1:20172", ... }
],
"regions": [
  { "id": 1, "range": [-inf,"m"), peers: 101/201/301, leader: store 1 },
  { "id": 2, "range": ["m",+inf), peers: 102/202/302, leader: store 2 }
]
  • Local scripts (run_local_cluster.sh, serve_from_config.sh, bootstrap_from_config.sh) ingest the same JSON, so local runs match production layouts.
  • Docker Compose mounts the file into each container; manifests, transports, and Redis gateway all stay in sync.
  • Need more stores or regions? Update the JSON and re-run the script/Compose—no code changes required.
  • Programmatic access: import github.com/feichai0017/NoKV/config and call config.LoadFile / Validate for a single source of truth across tools.
🧬 Tech Stack Snapshot
Layer Tech/Package Why it matters
Storage Core lsm/, wal/, vlog/ Hybrid log-structured design with manifest-backed durability and value separation.
Concurrency percolator/, raftstore/client Distributed 2PC, lock management, and MVCC version semantics in raft mode.
Replication raftstore/* + pd/* Multi-Raft data plane plus PD-backed control plane (routing, TSO, heartbeats).
Tooling cmd/nokv, cmd/nokv-config, cmd/nokv-redis CLI, config helper, Redis-compatible gateway share the same topology file.
Observability stats, hotring, expvar Built-in metrics, hot-key analytics, and crash recovery traces.

🧱 Architecture Overview

graph TD
    Client[Client API] -->|Set/Get| DBCore
    DBCore -->|Append| WAL
    DBCore -->|Insert| MemTable
    DBCore -->|ValuePtr| ValueLog
    MemTable -->|Flush Task| FlushMgr
    FlushMgr -->|Build SST| SSTBuilder
    SSTBuilder -->|LogEdit| Manifest
    Manifest -->|Version| LSMLevels
    LSMLevels -->|Compaction| Compactor
    FlushMgr -->|Discard Stats| ValueLog
    ValueLog -->|GC updates| Manifest
    DBCore -->|Stats/HotKeys| Observability

Key ideas:

  • Durability path – WAL first, memtable second. ValueLog writes occur before WAL append so crash replay can fully rebuild state.
  • Metadata – manifest stores SST topology, WAL checkpoints, and vlog head/deletion metadata.
  • Background workers – flush manager handles Prepare → Build → Install → Release, compaction reduces level overlap, and value log GC rewrites segments based on discard stats.
  • Distributed transactions – Percolator 2PC runs in raft mode; embedded mode exposes non-transactional DB APIs.

Dive deeper in docs/architecture.md.


🧩 Module Breakdown

Module Responsibilities Source Docs
WAL Append-only segments with CRC, rotation, replay (wal.Manager). wal/ WAL internals
LSM MemTable, flush pipeline, leveled compactions, iterator merging. lsm/ Memtable
Flush pipeline
Cache
Range filter
Manifest VersionEdit log + CURRENT handling, WAL/vlog checkpoints, value-log metadata. manifest/ Manifest semantics
ValueLog Large value storage, GC, discard stats integration. vlog.go, vlog/ Value log design
Percolator Distributed MVCC 2PC primitives (prewrite/commit/rollback/resolve/status). percolator/ Percolator transactions
RaftStore Multi-Raft Region management, hooks, metrics, transport. raftstore/ RaftStore overview
HotRing Hot key tracking, throttling helpers. hotring/ HotRing overview
Observability Periodic stats, hot key tracking, CLI integration. stats.go, cmd/nokv Stats & observability
CLI reference
Filesystem Pebble-inspired vfs abstraction + mmap-backed file helpers shared by SST/vlog, WAL, and manifest. vfs/, file/ VFS
File abstractions

Each module has a dedicated document under docs/ describing APIs, diagrams, and recovery notes.


📡 Observability & CLI

  • Stats.StartStats publishes metrics via expvar (flush backlog, WAL segments, value log GC stats, raft/region/cache/hot metrics).
  • cmd/nokv gives you:
    • nokv stats --workdir <dir> [--json] [--no-region-metrics]
    • nokv manifest --workdir <dir>
    • nokv regions --workdir <dir> [--json]
    • nokv vlog --workdir <dir>
  • hotring continuously surfaces hot keys in stats + CLI so you can pre-warm caches or debug skewed workloads.

More in docs/cli.md and docs/testing.md.


🔌 Redis Gateway

  • cmd/nokv-redis exposes a RESP-compatible endpoint. In embedded mode (--workdir) commands execute through regular DB APIs; in distributed mode (--raft-config) calls are routed through raftstore/client and committed with TwoPhaseCommit.
  • In raft mode, TTL is persisted directly in each value entry (expires_at) through the same 2PC write path as the value payload.
  • --metrics-addr exposes Redis gateway metrics under NoKV.Stats.redis via expvar. In raft mode, --pd-addr can override config.pd when you need a non-default PD endpoint.
  • A ready-to-use cluster configuration is available at raft_config.example.json, matching both scripts/run_local_cluster.sh and the Docker Compose setup.

For the complete command matrix, configuration and deployment guides, see docs/nokv-redis.md.


📄 License

Apache-2.0. See LICENSE.

Documentation

Overview

Package NoKV provides the embedded database API and engine wiring.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type BatchSetItem added in v0.7.1

type BatchSetItem struct {
	Key   []byte
	Value []byte
}

BatchSetItem represents one non-transactional write in the default CF.

Ownership note: key is copied into the internal-key encoding; value is referenced directly until the write path finishes.

type CacheStatsSnapshot added in v0.6.0

type CacheStatsSnapshot struct {
	BlockL0HitRate float64 `json:"block_l0_hit_rate"`
	BlockL1HitRate float64 `json:"block_l1_hit_rate"`
	IndexHitRate   float64 `json:"index_hit_rate"`
	IteratorReused uint64  `json:"iterator_reused"`
}

CacheStatsSnapshot captures block/index/bloom hit-rate indicators.

type CompactionPolicy added in v0.7.1

type CompactionPolicy string

CompactionPolicy defines compaction priority-arrangement strategy.

const (
	CompactionPolicyLeveled CompactionPolicy = "leveled"
	CompactionPolicyTiered  CompactionPolicy = "tiered"
	CompactionPolicyHybrid  CompactionPolicy = "hybrid"
)

type CompactionStatsSnapshot added in v0.6.0

type CompactionStatsSnapshot struct {
	Backlog              int64   `json:"backlog"`
	MaxScore             float64 `json:"max_score"`
	LastDurationMs       float64 `json:"last_duration_ms"`
	MaxDurationMs        float64 `json:"max_duration_ms"`
	Runs                 uint64  `json:"runs"`
	IngestRuns           int64   `json:"ingest_runs"`
	MergeRuns            int64   `json:"ingest_merge_runs"`
	IngestMs             float64 `json:"ingest_ms"`
	MergeMs              float64 `json:"ingest_merge_ms"`
	IngestTables         int64   `json:"ingest_tables"`
	MergeTables          int64   `json:"ingest_merge_tables"`
	ValueWeight          float64 `json:"value_weight"`
	ValueWeightSuggested float64 `json:"value_weight_suggested,omitempty"`
}

CompactionStatsSnapshot summarizes compaction backlog, runtime, and ingest behavior.

type DB

type DB struct {
	sync.RWMutex
	// contains filtered or unexported fields
}

DB is the global handle for the engine and owns shared resources.

func Open

func Open(opt *Options) (_ *DB, err error)

Open constructs the database and returns initialization errors instead of panicking.

func (*DB) ApplyInternalEntries added in v0.7.1

func (db *DB) ApplyInternalEntries(entries []*kv.Entry) error

ApplyInternalEntries writes pre-built internal-key entries through the regular write pipeline.

The caller must provide entries with internal keys. The entry slices must not be mutated until this call returns.

func (*DB) Close

func (db *DB) Close() error

Close stops background workers and flushes in-memory state before releasing all resources.

func (*DB) Del

func (db *DB) Del(key []byte) error

Del removes a key from the default column family by writing a tombstone.

func (*DB) DeleteRange added in v0.7.1

func (db *DB) DeleteRange(start, end []byte) error

DeleteRange removes all keys in [start, end) from the default column family.

func (*DB) Get

func (db *DB) Get(key []byte) (*kv.Entry, error)

Get reads the latest visible value for key from the default column family.

func (*DB) GetInternalEntry added in v0.7.1

func (db *DB) GetInternalEntry(cf kv.ColumnFamily, key []byte, version uint64) (*kv.Entry, error)

GetInternalEntry retrieves one internal-key record for the provided version.

The returned entry is borrowed from internal pools and returned as-is (no clone/no copy). entry.Key remains in internal encoding (cf+user_key+ts). Callers MUST call DecrRef exactly once when finished.

func (*DB) GetValueSeparationPolicyStats added in v0.7.1

func (db *DB) GetValueSeparationPolicyStats() map[string]int64

GetValueSeparationPolicyStats returns the current value separation policy statistics. Returns nil if no policies are configured.

func (*DB) Info

func (db *DB) Info() *Stats

Info returns the live stats collector for snapshot/diagnostic access.

func (*DB) IsClosed

func (db *DB) IsClosed() bool

IsClosed reports whether Close has finished and the DB no longer accepts work.

func (*DB) NewInternalIterator added in v0.5.0

func (db *DB) NewInternalIterator(opt *utils.Options) utils.Iterator

NewInternalIterator returns an iterator over internal keys (CF marker + user key + timestamp). Callers should decode kv.Entry.Key via kv.SplitInternalKey and handle ok=false.

func (*DB) NewIterator

func (db *DB) NewIterator(opt *utils.Options) utils.Iterator

NewIterator creates a DB-level iterator over user keys in the default column family.

func (*DB) RaftLog added in v0.7.2

func (db *DB) RaftLog() RaftLog

RaftLog returns the raft peer-storage capability backed by the DB WAL.

func (*DB) RunValueLogGC

func (db *DB) RunValueLogGC(discardRatio float64) error

RunValueLogGC triggers a value log garbage collection.

func (*DB) Set

func (db *DB) Set(key, value []byte) error

Set writes a key/value pair into the default column family. Use Del for explicit deletion; nil values are rejected.

func (*DB) SetBatch added in v0.7.1

func (db *DB) SetBatch(items []BatchSetItem) error

SetBatch writes multiple key/value pairs into the default column family.

Semantics:

  • Non-transactional API: each entry receives a monotonically increasing internal version.
  • The batch is submitted through the regular write pipeline and commit queue.

Validation:

  • Empty batch is a no-op.
  • Every item must have a non-empty key and non-nil value.

Ownership:

  • key bytes are encoded into internal keys.
  • value slices are referenced directly until this call returns; callers must keep them immutable for the duration of this call.

func (*DB) SetRegionMetrics

func (db *DB) SetRegionMetrics(rm *metrics.RegionMetrics)

SetRegionMetrics attaches region metrics recorder so Stats snapshot and expvar include region state counts.

func (*DB) SetWithTTL added in v0.7.0

func (db *DB) SetWithTTL(key, value []byte, ttl time.Duration) error

SetWithTTL writes a key/value pair into the default column family with TTL. Use Del for explicit deletion; nil values are rejected.

Ownership note: key is encoded into a new internal-key buffer, while value is referenced directly (no deep copy). Callers must keep value immutable until this method returns.

func (*DB) WAL

func (db *DB) WAL() *wal.Manager

WAL exposes the underlying WAL manager.

func (*DB) WorkDir added in v0.7.1

func (db *DB) WorkDir() string

WorkDir returns the database working directory.

type DBIterator

type DBIterator struct {
	// contains filtered or unexported fields
}

DBIterator wraps the merged LSM iterators and optionally resolves value-log pointers.

func (*DBIterator) Close

func (iter *DBIterator) Close() error

Close releases underlying iterators and returns pooled iterator context.

func (*DBIterator) Err added in v0.7.1

func (iter *DBIterator) Err() error

Err returns the error that stopped iteration, if any. Returns nil if iteration completed successfully or is still in progress. This method follows the pattern established by EntryIterator and RecordIterator.

func (*DBIterator) Item

func (iter *DBIterator) Item() utils.Item

Item returns the currently materialized item, or nil when iterator is invalid.

func (*DBIterator) Next

func (iter *DBIterator) Next()

Next advances to the next visible key/value pair.

func (*DBIterator) Rewind

func (iter *DBIterator) Rewind()

Rewind positions the iterator at the first or last key based on scan direction.

func (*DBIterator) Seek

func (iter *DBIterator) Seek(key []byte)

Seek positions the iterator at the first key >= key in default column family order.

func (*DBIterator) Valid

func (iter *DBIterator) Valid() bool

Valid reports whether the iterator currently points at a valid item.

type FlushStatsSnapshot added in v0.6.0

type FlushStatsSnapshot struct {
	Pending       int64   `json:"pending"`
	QueueLength   int64   `json:"queue_length"`
	Active        int64   `json:"active"`
	WaitMs        float64 `json:"wait_ms"`
	LastWaitMs    float64 `json:"last_wait_ms"`
	MaxWaitMs     float64 `json:"max_wait_ms"`
	BuildMs       float64 `json:"build_ms"`
	LastBuildMs   float64 `json:"last_build_ms"`
	MaxBuildMs    float64 `json:"max_build_ms"`
	ReleaseMs     float64 `json:"release_ms"`
	LastReleaseMs float64 `json:"last_release_ms"`
	MaxReleaseMs  float64 `json:"max_release_ms"`
	Completed     int64   `json:"completed"`
}

FlushStatsSnapshot summarizes flush queue depth and stage timing.

type HotKeyStat

type HotKeyStat struct {
	Key   string `json:"key"`
	Count int32  `json:"count"`
}

HotKeyStat represents one hot key and its observed touch count.

type HotStatsSnapshot added in v0.6.0

type HotStatsSnapshot struct {
	WriteKeys []HotKeyStat   `json:"write_keys,omitempty"`
	WriteRing *hotring.Stats `json:"write_ring,omitempty"`
}

HotStatsSnapshot contains write-hot keys and optional ring internals.

type Item

type Item struct {
	// contains filtered or unexported fields
}

Item is the user-facing iterator item backed by an entry and optional vlog reader.

func (*Item) Entry

func (it *Item) Entry() *kv.Entry

Entry returns the current entry view for this iterator item.

func (*Item) ValueCopy

func (it *Item) ValueCopy(dst []byte) ([]byte, error)

ValueCopy returns a copy of the current value into dst (if provided). Mirrors Badger's semantics to aid callers expecting defensive copies.

type LSMLevelStats added in v0.4.0

type LSMLevelStats struct {
	Level              int     `json:"level"`
	TableCount         int     `json:"tables"`
	SizeBytes          int64   `json:"size_bytes"`
	ValueBytes         int64   `json:"value_bytes"`
	StaleBytes         int64   `json:"stale_bytes"`
	IngestTables       int     `json:"ingest_tables"`
	IngestSizeBytes    int64   `json:"ingest_size_bytes"`
	IngestValueBytes   int64   `json:"ingest_value_bytes"`
	ValueDensity       float64 `json:"value_density"`
	IngestValueDensity float64 `json:"ingest_value_density"`
	IngestRuns         int64   `json:"ingest_runs"`
	IngestMs           float64 `json:"ingest_ms"`
	IngestTablesCount  int64   `json:"ingest_tables_compacted"`
	MergeRuns          int64   `json:"ingest_merge_runs"`
	MergeMs            float64 `json:"ingest_merge_ms"`
	MergeTables        int64   `json:"ingest_merge_tables"`
}

LSMLevelStats captures aggregated metrics per LSM level.

type LSMStatsSnapshot added in v0.6.0

type LSMStatsSnapshot struct {
	Levels            []LSMLevelStats          `json:"levels,omitempty"`
	ValueBytesTotal   int64                    `json:"value_bytes_total"`
	ValueDensityMax   float64                  `json:"value_density_max"`
	ValueDensityAlert bool                     `json:"value_density_alert"`
	RangeFilter       RangeFilterStatsSnapshot `json:"range_filter"`
}

LSMStatsSnapshot summarizes per-level storage shape and value-density signals.

type MVCCStore added in v0.7.1

type MVCCStore interface {
	ApplyInternalEntries(entries []*kv.Entry) error
	// GetInternalEntry returns a borrowed internal entry without cloning/copying.
	// entry.Key remains in internal encoding (cf+user_key+ts). Callers must
	// DecrRef exactly once.
	GetInternalEntry(cf kv.ColumnFamily, key []byte, version uint64) (*kv.Entry, error)
	NewInternalIterator(opt *utils.Options) utils.Iterator
}

MVCCStore defines MVCC/internal operations consumed by percolator and raftstore.

type MemTableEngine added in v0.4.2

type MemTableEngine string

MemTableEngine selects the in-memory index implementation used by memtables.

const (
	MemTableEngineSkiplist MemTableEngine = "skiplist"
	MemTableEngineART      MemTableEngine = "art"
)

type Options

type Options struct {
	// FS provides the filesystem implementation used by DB runtime components.
	// Nil defaults to vfs.OSFS.
	FS vfs.FS

	ValueThreshold int64
	WorkDir        string
	MemTableSize   int64
	MemTableEngine MemTableEngine
	SSTableMaxSz   int64
	// MaxBatchCount bounds the number of entries grouped into one internal
	// write batch. NewDefaultOptions exposes a concrete default; zero is only
	// interpreted as a legacy unset value during normalization.
	MaxBatchCount int64
	// MaxBatchSize bounds the size in bytes of one internal write batch.
	// NewDefaultOptions exposes a concrete default; zero is only interpreted as
	// a legacy unset value during normalization.
	MaxBatchSize       int64
	ValueLogFileSize   int
	ValueLogMaxEntries uint32
	// ValueLogBucketCount controls how many hash buckets the value log uses.
	// Values <= 1 disable bucketization.
	ValueLogBucketCount     int
	ValueSeparationPolicies []*kv.ValueSeparationPolicy

	// ValueLogGCInterval specifies how frequently to trigger a check for value
	// log garbage collection. Zero or negative values disable automatic GC.
	ValueLogGCInterval time.Duration
	// ValueLogGCDiscardRatio is the discard ratio for a value log file to be
	// considered for garbage collection. It must be in the range (0.0, 1.0).
	ValueLogGCDiscardRatio float64
	// ValueLogGCParallelism controls how many value-log GC tasks can run in
	// parallel. Values <= 0 auto-tune based on compaction workers.
	ValueLogGCParallelism int
	// ValueLogGCReduceScore lowers GC parallelism when compaction max score meets
	// or exceeds this threshold. Values <= 0 use defaults.
	ValueLogGCReduceScore float64
	// ValueLogGCSkipScore skips GC when compaction max score meets or exceeds this
	// threshold. Values <= 0 use defaults.
	ValueLogGCSkipScore float64
	// ValueLogGCReduceBacklog lowers GC parallelism when compaction backlog meets
	// or exceeds this threshold. Values <= 0 use defaults.
	ValueLogGCReduceBacklog int
	// ValueLogGCSkipBacklog skips GC when compaction backlog meets or exceeds this
	// threshold. Values <= 0 use defaults.
	ValueLogGCSkipBacklog int

	// Value log GC sampling parameters. Ratios <= 0 fall back to defaults.
	ValueLogGCSampleSizeRatio  float64
	ValueLogGCSampleCountRatio float64
	ValueLogGCSampleFromHead   bool

	// ValueLogVerbose enables verbose logging across value-log operations.
	ValueLogVerbose bool

	// WriteBatchMaxCount bounds how many requests the commit worker coalesces in
	// one pass. NewDefaultOptions exposes a concrete default; zero is only
	// interpreted as a legacy unset value during normalization.
	WriteBatchMaxCount int
	// WriteBatchMaxSize bounds the byte size the commit worker coalesces in one
	// pass. NewDefaultOptions exposes a concrete default; zero is only
	// interpreted as a legacy unset value during normalization.
	WriteBatchMaxSize int64

	DetectConflicts bool
	HotRingEnabled  bool
	HotRingBits     uint8
	HotRingTopK     int
	// HotRingRotationInterval enables dual-ring rotation for hotness tracking.
	// Zero disables rotation.
	HotRingRotationInterval time.Duration
	// HotRingNodeCap caps the number of tracked keys per ring. Zero disables the cap.
	HotRingNodeCap uint64
	// HotRingNodeSampleBits controls stable sampling once the cap is reached.
	// A value of 0 enforces a strict cap; larger values sample 1/2^N keys.
	HotRingNodeSampleBits uint8
	// HotRingDecayInterval controls how often HotRing halves its global counters.
	// Zero disables periodic decay.
	HotRingDecayInterval time.Duration
	// HotRingDecayShift determines how aggressively counters decay (count >>= shift).
	HotRingDecayShift uint32
	// HotRingWindowSlots controls the number of sliding-window buckets tracked per key.
	// Zero disables the sliding window.
	HotRingWindowSlots int
	// HotRingWindowSlotDuration sets the duration of each sliding-window bucket.
	HotRingWindowSlotDuration time.Duration
	SyncWrites                bool
	// SyncPipeline enables a dedicated sync worker goroutine that decouples
	// WAL fsync from the commit pipeline. When false (the default), the commit
	// worker performs fsync inline. Only effective when SyncWrites is true.
	SyncPipeline bool
	ManifestSync bool
	// ManifestRewriteThreshold triggers a manifest rewrite when the active
	// MANIFEST file grows beyond this size (bytes). Values <= 0 disable rewrites.
	ManifestRewriteThreshold int64
	// WriteHotKeyLimit caps how many consecutive writes a single key can issue
	// before the DB returns utils.ErrHotKeyWriteThrottle. Zero disables write-path
	// throttling.
	WriteHotKeyLimit int32
	// WriteBatchWait adds an optional coalescing delay when the commit queue is
	// momentarily empty, letting small bursts share one WAL fsync/apply pass.
	// Zero disables the delay.
	WriteBatchWait time.Duration
	// WriteThrottleMinRate is the target write admission rate in bytes/sec when
	// slowdown pressure approaches the stop threshold. NewDefaultOptions
	// exposes a concrete default; zero is only interpreted as a legacy unset
	// value during normalization.
	WriteThrottleMinRate int64
	// WriteThrottleMaxRate is the target write admission rate in bytes/sec when
	// slowdown first becomes active. NewDefaultOptions exposes a concrete
	// default; zero is only interpreted as a legacy unset value during
	// normalization.
	WriteThrottleMaxRate int64

	// BlockCacheBytes bounds the in-memory budget for cached L0/L1 data blocks.
	// Deeper levels continue to rely on the OS page cache.
	BlockCacheBytes int64
	// IndexCacheBytes bounds the in-memory budget for decoded SSTable indexes.
	IndexCacheBytes int64

	// RaftLagWarnSegments determines how many WAL segments a follower can lag
	// behind the active segment before stats surfaces a warning. Zero disables
	// the alert.
	RaftLagWarnSegments int64

	// EnableWALWatchdog enables the background WAL backlog watchdog which
	// surfaces typed-record warnings and optionally runs automated segment GC.
	EnableWALWatchdog bool
	// WALBufferSize controls the size of the in-memory write buffer used by
	// the WAL manager. Larger buffers reduce syscall frequency at the cost of
	// memory. NewDefaultOptions exposes a concrete default; zero is only
	// interpreted as a legacy unset value during normalization.
	WALBufferSize int
	// WALAutoGCInterval controls how frequently the watchdog evaluates WAL
	// backlog for automated garbage collection.
	WALAutoGCInterval time.Duration
	// WALAutoGCMinRemovable is the minimum number of removable WAL segments
	// required before an automated GC pass will run.
	WALAutoGCMinRemovable int
	// WALAutoGCMaxBatch bounds how many WAL segments are removed during a single
	// automated GC pass.
	WALAutoGCMaxBatch int
	// WALTypedRecordWarnRatio triggers a typed-record warning when raft records
	// constitute at least this fraction of WAL writes. Zero disables ratio-based
	// warnings.
	WALTypedRecordWarnRatio float64
	// WALTypedRecordWarnSegments triggers a typed-record warning when the number
	// of WAL segments containing raft records exceeds this threshold. Zero
	// disables segment-count warnings.
	WALTypedRecordWarnSegments int64
	// RaftPointerSnapshot returns store-local raft WAL checkpoints used by WAL
	// watchdogs, GC policy, and diagnostics. It must return a detached snapshot.
	// Nil disables raft-specific backlog accounting.
	RaftPointerSnapshot func() map[uint64]raftmeta.RaftLogPointer

	// DiscardStatsFlushThreshold controls how many discard-stat updates must be
	// accumulated before they are flushed back into the LSM. Zero keeps the
	// default threshold.
	DiscardStatsFlushThreshold int

	// NumCompactors controls how many background compaction workers are spawned.
	// Zero uses an auto value derived from the host CPU count.
	NumCompactors int
	// CompactionPolicy selects how compaction priorities are arranged.
	// Supported values: leveled, tiered, hybrid.
	CompactionPolicy CompactionPolicy
	// NumLevelZeroTables controls when write throttling kicks in and feeds into
	// the compaction priority calculation. NewDefaultOptions populates a concrete
	// default; normalizeInPlace only backfills zero-valued legacy configs.
	NumLevelZeroTables int
	// L0SlowdownWritesTrigger starts write pacing when L0 table count reaches
	// this threshold. Defaults are populated up front; zero is only interpreted
	// as a legacy unset value during normalization.
	L0SlowdownWritesTrigger int
	// L0StopWritesTrigger blocks writes when L0 table count reaches this
	// threshold. Defaults are populated up front; zero is only interpreted as a
	// legacy unset value during normalization.
	L0StopWritesTrigger int
	// L0ResumeWritesTrigger clears throttling only when L0 table count drops to
	// this threshold or lower. Defaults are populated up front; zero is only
	// interpreted as a legacy unset value during normalization.
	L0ResumeWritesTrigger int
	// CompactionSlowdownTrigger starts write pacing when max compaction score
	// reaches this value. Defaults are populated up front; zero is only
	// interpreted as a legacy unset value during normalization.
	CompactionSlowdownTrigger float64
	// CompactionStopTrigger blocks writes when max compaction score reaches this
	// value. Defaults are populated up front; zero is only interpreted as a
	// legacy unset value during normalization.
	CompactionStopTrigger float64
	// CompactionResumeTrigger clears throttling only when max compaction score
	// drops to this value or lower. Defaults are populated up front; zero is only
	// interpreted as a legacy unset value during normalization.
	CompactionResumeTrigger float64
	// IngestCompactBatchSize decides how many L0 tables to promote into the
	// ingest buffer per compaction cycle. NewDefaultOptions populates a concrete
	// default; normalizeInPlace only backfills zero-valued legacy configs.
	IngestCompactBatchSize int
	// IngestBacklogMergeScore triggers an ingest-merge task when the ingest
	// backlog score exceeds this threshold. Defaults are populated up front; zero
	// is only interpreted as a legacy unset value during normalization.
	IngestBacklogMergeScore float64

	// CompactionValueWeight adjusts how aggressively the scheduler prioritises
	// levels whose entries reference large value log payloads. Higher values
	// make the compaction picker favour levels with high ValuePtr density.
	CompactionValueWeight float64

	// CompactionValueAlertThreshold triggers stats alerts when a level's
	// value-density (value bytes / total bytes) exceeds this ratio.
	CompactionValueAlertThreshold float64

	// IngestShardParallelism caps how many ingest shards can be compacted in a
	// single ingest-only pass. A value <= 0 falls back to 1 (sequential).
	IngestShardParallelism int
}

Options holds the top-level database configuration.

func NewDefaultOptions

func NewDefaultOptions() *Options

NewDefaultOptions returns the default option set.

type RaftLog added in v0.7.2

type RaftLog interface {
	Open(groupID uint64, meta *raftmeta.Store) (engine.PeerStorage, error)
}

RaftLog opens raft peer storage without exposing the underlying WAL manager.

type RaftStatsSnapshot added in v0.6.0

type RaftStatsSnapshot struct {
	GroupCount       int    `json:"group_count"`
	LaggingGroups    int    `json:"lagging_groups"`
	MinLogSegment    uint32 `json:"min_log_segment"`
	MaxLogSegment    uint32 `json:"max_log_segment"`
	MaxLagSegments   int64  `json:"max_lag_segments"`
	LagWarnThreshold int64  `json:"lag_warn_threshold"`
	LagWarning       bool   `json:"lag_warning"`
}

RaftStatsSnapshot summarizes raft log lag across tracked groups.

type RangeFilterStatsSnapshot added in v0.7.1

type RangeFilterStatsSnapshot struct {
	PointCandidates   uint64 `json:"point_candidates"`
	PointPruned       uint64 `json:"point_pruned"`
	BoundedCandidates uint64 `json:"bounded_candidates"`
	BoundedPruned     uint64 `json:"bounded_pruned"`
	Fallbacks         uint64 `json:"fallbacks"`
}

RangeFilterStatsSnapshot summarizes range-filter pruning activity on read paths.

type RegionStatsSnapshot added in v0.6.0

type RegionStatsSnapshot struct {
	Total     int64 `json:"total"`
	New       int64 `json:"new"`
	Running   int64 `json:"running"`
	Removing  int64 `json:"removing"`
	Tombstone int64 `json:"tombstone"`
	Other     int64 `json:"other"`
}

RegionStatsSnapshot reports region counts grouped by region state.

type Stats

type Stats struct {
	// contains filtered or unexported fields
}

Stats owns periodic runtime metric collection and snapshot publication.

func (*Stats) SetRegionMetrics

func (s *Stats) SetRegionMetrics(rm *metrics.RegionMetrics)

SetRegionMetrics attaches region metrics recorder used in snapshots.

func (*Stats) Snapshot

func (s *Stats) Snapshot() StatsSnapshot

Snapshot returns a point-in-time metrics snapshot without mutating state.

func (*Stats) StartStats

func (s *Stats) StartStats()

StartStats runs periodic collection of internal backlog metrics.

type StatsSnapshot

type StatsSnapshot struct {
	Entries    int64                             `json:"entries"`
	Flush      FlushStatsSnapshot                `json:"flush"`
	Compaction CompactionStatsSnapshot           `json:"compaction"`
	ValueLog   ValueLogStatsSnapshot             `json:"value_log"`
	WAL        WALStatsSnapshot                  `json:"wal"`
	Raft       RaftStatsSnapshot                 `json:"raft"`
	Write      WriteStatsSnapshot                `json:"write"`
	Region     RegionStatsSnapshot               `json:"region"`
	Hot        HotStatsSnapshot                  `json:"hot"`
	Cache      CacheStatsSnapshot                `json:"cache"`
	LSM        LSMStatsSnapshot                  `json:"lsm"`
	Transport  transportpkg.GRPCTransportMetrics `json:"transport"`
	Redis      metrics.RedisSnapshot             `json:"redis"`
}

StatsSnapshot captures a point-in-time view of internal backlog metrics.

type ValueLogStatsSnapshot added in v0.6.0

type ValueLogStatsSnapshot struct {
	Segments       int                        `json:"segments"`
	PendingDeletes int                        `json:"pending_deletes"`
	DiscardQueue   int                        `json:"discard_queue"`
	Heads          map[uint32]kv.ValuePtr     `json:"heads,omitempty"`
	GC             metrics.ValueLogGCSnapshot `json:"gc"`
}

ValueLogStatsSnapshot reports value-log segment status and GC counters.

type WALStatsSnapshot added in v0.6.0

type WALStatsSnapshot struct {
	ActiveSegment           int64             `json:"active_segment"`
	SegmentCount            int64             `json:"segment_count"`
	ActiveSize              int64             `json:"active_size"`
	SegmentsRemoved         uint64            `json:"segments_removed"`
	RecordCounts            wal.RecordMetrics `json:"record_counts"`
	SegmentsWithRaftRecords int               `json:"segments_with_raft_records"`
	RemovableRaftSegments   int               `json:"removable_raft_segments"`
	TypedRecordRatio        float64           `json:"typed_record_ratio"`
	TypedRecordWarning      bool              `json:"typed_record_warning"`
	TypedRecordReason       string            `json:"typed_record_reason,omitempty"`
	AutoGCRuns              uint64            `json:"auto_gc_runs"`
	AutoGCRemoved           uint64            `json:"auto_gc_removed"`
	AutoGCLastUnix          int64             `json:"auto_gc_last_unix"`
}

WALStatsSnapshot captures WAL head position, record mix, and watchdog status.

type WriteStatsSnapshot added in v0.6.0

type WriteStatsSnapshot struct {
	QueueDepth       int64   `json:"queue_depth"`
	QueueEntries     int64   `json:"queue_entries"`
	QueueBytes       int64   `json:"queue_bytes"`
	AvgBatchEntries  float64 `json:"avg_batch_entries"`
	AvgBatchBytes    float64 `json:"avg_batch_bytes"`
	AvgRequestWaitMs float64 `json:"avg_request_wait_ms"`
	AvgValueLogMs    float64 `json:"avg_vlog_ms"`
	AvgApplyMs       float64 `json:"avg_apply_ms"`
	AvgSyncMs        float64 `json:"avg_sync_ms"`
	AvgSyncBatch     float64 `json:"avg_sync_batch"`
	SyncCount        int64   `json:"sync_count"`
	BatchesTotal     int64   `json:"batches_total"`
	ThrottleActive   bool    `json:"throttle_active"`
	SlowdownActive   bool    `json:"slowdown_active"`
	ThrottleMode     string  `json:"throttle_mode"`
	ThrottlePressure uint32  `json:"throttle_pressure_permille"`
	ThrottleRate     uint64  `json:"throttle_rate_bytes_per_sec"`
	HotKeyLimited    uint64  `json:"hot_key_limited"`
}

WriteStatsSnapshot tracks write-path queue pressure, latency, and throttling.

Directories

Path Synopsis
cmd
nokv command
nokv-config command
nokv-redis command
Package file provides low-level file and mmap primitives shared by WAL, vlog, and SST layers.
Package file provides low-level file and mmap primitives shared by WAL, vlog, and SST layers.
lsm
Package manifest persists storage-engine metadata such as SST layout, WAL replay position, and value-log state.
Package manifest persists storage-engine metadata such as SST layout, WAL replay position, and value-log state.
pd
tso
raftstore
kv
Package vfs provides a tiny filesystem abstraction and fault-injection wrapper.
Package vfs provides a tiny filesystem abstraction and fault-injection wrapper.
Package vlog implements the value-log segment manager and IO helpers.
Package vlog implements the value-log segment manager and IO helpers.
Package wal implements the write-ahead log manager and replay logic.
Package wal implements the write-ahead log manager and replay logic.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL