Skip to content

Conversation

@yaga-simha
Copy link
Contributor

@yaga-simha yaga-simha commented Sep 18, 2025

User description

PR Type

Enhancement / Fix

Description

This PR is the beginning of the many changes that will improve the concurrent access to the services and locks within the system. This PR addresses improvements to the infra::local_lock which offers a global lock to control access to a single file. The old design had Lazy<RwLock<std::collections::HashMap<String, (Arc, Mutex)>> > which can be replaced with a Lazy<scc::HashMap<String, ()>> which does an equivalent job.


PR Type

Enhancement, Bug fix


Description

  • Replace global lock map with scc::HashMap

  • Simplify LockHolder and locking API

  • Update tests to new locking semantics

  • Add scc dependency in workspace


Diagram Walkthrough

flowchart LR
  OLD["Lazy<RwLock<HashMap<String, LockHolder>>>"] -- replaced by --> NEW["Lazy<scc::HashMap<String, ()>>"]
  APIOLD["lock(key) -> Result<LockHolder> + guard"] -- simplified to --> APINEW["lock(key) -> OccupiedEntry"]
  TESTS["Tests"] -- adapted to --> NEW
  DEPS["Cargo.toml"] -- add --> SCC["scc dependency"]
Loading

File Walkthrough

Relevant files
Enhancement
local_lock.rs
Switch to scc map and simplify locking                                     

src/infra/src/local_lock.rs

  • Replace RwLock> with scc::HashMap
  • Remove LockHolder internals and async Mutex
  • Change lock() to return OccupiedEntry via entry_async
  • Update tests to use new API and lengths
+10/-47 
schema.rs
Adapt schema handling to new local lock                                   

src/service/schema.rs

  • Update schema diff path to new lock API
  • Remove Result handling and Mutex guard acquisition
+1/-2     
Dependencies
Cargo.toml
Add scc dependency to root manifest                                           

Cargo.toml

  • Add scc dependency with equivalent feature
+1/-0     
Cargo.toml
Include scc in infra crate dependencies                                   

src/infra/Cargo.toml

  • Add scc as a workspace dependency
+1/-0     

@github-actions github-actions bot added ☢️ Bug Something isn't working Review effort 2/5 labels Sep 18, 2025
@github-actions
Copy link

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Potential Deadlock

Returning an scc OccupiedEntry as the guard may cause tasks to hold the entry lock longer than intended if they perform awaited operations while holding it. Verify scc entry guards are reentrant-safe for awaited sections and that the critical section is minimal, or wrap to ensure drop before awaits where appropriate.

pub async fn lock(key: &str) -> OccupiedEntry<'_, String, ()> {
    LOCAL_LOCKER
        .entry_async(key.to_string())
        .await
        .or_insert_with(|| ())
}
Unused Type

The new LockHolder struct is empty and unused. Consider removing it or implementing intended behavior to avoid confusion.

#[derive(Clone)]
pub struct LockHolder;
Map Growth

Entries are inserted but never explicitly removed, relying on guard drop semantics only. Confirm that dropping the OccupiedEntry removes the entry; otherwise, LOCAL_LOCKER may grow unbounded across distinct keys.

pub async fn lock(key: &str) -> OccupiedEntry<'_, String, ()> {
    LOCAL_LOCKER
        .entry_async(key.to_string())
        .await
        .or_insert_with(|| ())
}

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR refactors the local file locking mechanism to improve concurrent access patterns by replacing a complex nested locking structure with the scc (scalable concurrent containers) library. The changes span across four files:

Core Infrastructure Changes:

  • src/infra/src/local_lock.rs: Replaces the original Lazy<RwLock<HashMap<String, LockHolder>>> structure with Lazy<scc::HashMap<String, ()>>. The old LockHolder contained Arc<String> and Arc<Mutex<bool>> with explicit lock acquisition methods, while the new implementation returns an OccupiedEntry directly from scc::HashMap that acts as the lock guard through its lifetime.

  • src/infra/Cargo.toml and Cargo.toml: Add the scc dependency to enable the new concurrent data structure.

Service Layer Changes:

  • src/service/schema.rs: Updates the schema update locking to use the simplified API, changing from a two-step process (let local_lock = infra::local_lock::lock().await?; let _guard = local_lock.lock().await) to a single call (let _guard = infra::local_lock::lock().await).

Design Rationale:
The original implementation had what the developer describes as "anti-concurrent design patterns" with multiple layers of synchronization: an outer RwLock protecting a HashMap containing LockHolder structs, each with their own Arc<Mutex<bool>>. This created unnecessary contention and complexity. The new approach leverages scc::HashMap's lock-free concurrent design, where the returned OccupiedEntry provides the locking semantics through RAII (Resource Acquisition Is Initialization) - the lock is held for the lifetime of the entry and automatically released when dropped.

This change fits into the broader codebase by simplifying the fundamental locking primitive used throughout OpenObserve for controlling file access across various services. The infra::local_lock module is used extensively for ensuring thread-safe operations on shared resources, particularly in data ingestion and schema management workflows.

Confidence score: 2/5

  • This PR has significant implementation issues that could cause production problems
  • Score reflects fundamental flaws in the locking semantics and lifetime management of the new approach
  • Critical attention needed on local_lock.rs implementation and all files using the changed API

4 files reviewed, 1 comment

Edit Code Review Bot Settings | Greptile

};
drop(w);
Ok(locker)
pub async fn lock(key: &str) -> OccupiedEntry<'_, String, ()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Return type changed from Result<LockHolder> to OccupiedEntry - this is a breaking API change that will require all callers to be updated

@github-actions
Copy link

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Ensure real mutual exclusion

Returning OccupiedEntry without holding it across the critical section does not
provide mutual exclusion; _guard is unused and drops immediately, allowing all tasks
to proceed concurrently. Use the entry handle itself as the lock guard and keep it
alive for the duration of the critical section. Provide a RAII guard type that holds
the entry and releases it on drop.

src/infra/src/local_lock.rs [19-29]

 static LOCAL_LOCKER: Lazy<HashMap<String, ()>> = Lazy::new(|| HashMap::new());
 
-#[derive(Clone)]
-pub struct LockHolder;
-
-pub async fn lock(key: &str) -> OccupiedEntry<'_, String, ()> {
-    LOCAL_LOCKER
-        .entry_async(key.to_string())
-        .await
-        .or_insert_with(|| ())
+pub struct LocalLockGuard<'a> {
+    entry: OccupiedEntry<'a, String, ()>,
 }
 
+impl<'a> Drop for LocalLockGuard<'a> {
+    fn drop(&mut self) {
+        // Remove the key to allow the next waiter to acquire it.
+        let _ = self.entry.remove_entry();
+    }
+}
+
+pub async fn lock(key: &str) -> LocalLockGuard<'_> {
+    // Spin until we can insert the key and obtain an occupied entry.
+    loop {
+        match LOCAL_LOCKER.entry_async(key.to_string()).await {
+            scc::hash_map::Entry::Occupied(entry) => {
+                // Someone else holds the lock; yield and retry.
+                tokio::task::yield_now().await;
+                continue;
+            }
+            scc::hash_map::Entry::Vacant(v) => {
+                let _ = v.insert(());
+                // Re-acquire as occupied to hold it.
+                if let scc::hash_map::Entry::Occupied(entry) =
+                    LOCAL_LOCKER.entry_async(key.to_string()).await
+                {
+                    return LocalLockGuard { entry };
+                }
+            }
+        }
+    }
+}
+
Suggestion importance[1-10]: 8

__

Why: Correctly points out that returning an OccupiedEntry and dropping it immediately (as in tests) provides no mutual exclusion and suggests a RAII guard to hold the entry, addressing a critical concurrency bug. The improved code is conceptually sound for using the entry as a lock, though it introduces nontrivial changes.

Medium
Prevent lock key leakage

The current approach leaks keys because nothing removes the entry after use, causing
LOCAL_LOCKER.len() to grow unbounded and future callers to always see an occupied
entry. Ensure the lock releases by removing the key when the guard is dropped.
Provide a dedicated guard that removes the key on drop.

src/infra/src/local_lock.rs [24-29]

-pub async fn lock(key: &str) -> OccupiedEntry<'_, String, ()> {
-    LOCAL_LOCKER
-        .entry_async(key.to_string())
-        .await
-        .or_insert_with(|| ())
+pub struct LocalLockGuard<'a> {
+    key: String,
+    _marker: std::marker::PhantomData<&'a ()>,
 }
 
+impl<'a> Drop for LocalLockGuard<'a> {
+    fn drop(&mut self) {
+        let _ = LOCAL_LOCKER.remove(&self.key);
+    }
+}
+
+pub async fn lock(key: &str) -> LocalLockGuard<'_> {
+    let key_owned = key.to_string();
+    // Acquire by inserting when absent; retry until success.
+    loop {
+        if LOCAL_LOCKER
+            .entry_async(key_owned.clone())
+            .await
+            .or_insert_with(|| ())
+            .is_vacant()
+        {
+            continue;
+        } else {
+            // We successfully inserted (or saw existing); ensure we only proceed on first insertion.
+            if LOCAL_LOCKER.read(&key_owned, |_, _| ()).is_some() {
+                return LocalLockGuard {
+                    key: key_owned,
+                    _marker: std::marker::PhantomData,
+                };
+            }
+        }
+        tokio::task::yield_now().await;
+    }
+}
+
Suggestion importance[1-10]: 7

__

Why: Accurately identifies that entries are never removed, which can leak keys and break future locking, and proposes a guard removing the key on drop. However, the provided implementation has questionable use of is_vacant() and semantics that may not compile as written.

Medium
General
Hold guard through critical work

The _guard must be held across the entire critical section; ensure it is not dropped
early by limiting the critical section to a clear scope. Wrap the guarded operations
in a scope block so that the lock is released deterministically after the protected
work.

src/service/schema.rs [249-250]

 let cache_key = format!("{org_id}/{stream_type}/{stream_name}");
-let _guard = infra::local_lock::lock(&cache_key).await;
+{
+    let _guard = infra::local_lock::lock(&cache_key).await;
+    // guarded operations follow...
+    // check if the schema has been updated by another thread
+    let read_cache = STREAM_SCHEMAS_LATEST.read().await;
+    if let Some(updated_schema) = read_cache.get(&cache_key) {
+        // ...
+    }
+    // rest of the critical section...
+}
Suggestion importance[1-10]: 6

__

Why: Emphasizes keeping the lock guard alive across the critical section, which is important given the new API returns a handle that must be held; the scoped block makes the lifetime explicit. It's a reasonable correctness/readability improvement but moderate in impact.

Low

@yaga-simha yaga-simha force-pushed the fix/concurrent-local-file-locking branch 2 times, most recently from 76c65ca to ec90dc0 Compare September 18, 2025 12:50
@yaga-simha yaga-simha force-pushed the fix/concurrent-local-file-locking branch 6 times, most recently from f20e3df to 3980b4c Compare September 23, 2025 11:39
@yaga-simha yaga-simha force-pushed the fix/concurrent-local-file-locking branch from da41592 to 712bc71 Compare September 24, 2025 10:53
@testdino-playwright-reporter
Copy link

⚠️ Test Run Unstable


Author: hengfeiyang | Branch: fix/concurrent-local-file-locking | Commit: 6b8b26b

Testdino Test Results

Status Total Passed Failed Skipped Flaky Pass Rate Duration
All tests passed 376 351 0 21 4 93% 4m 30s

View Detailed Results

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

☢️ Bug Something isn't working Review effort 2/5

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants