Newest 'distributed-computing' Questions

Advice

0 votes

2 replies

57 views

How to make devices discover each other using WIFI

What's the best way to allow programs to discover each other on the network? Let's say we are writing a system that tracks the usage of computers over the network. We have an agent program that sends ...

Isembart

13

asked Dec 10 at 20:02

Advice

2 votes

2 replies

59 views

Efficient MPI Parallelization Strategies for Localized PDE Subproblems within a Globally Decomposed Domain

I am working on a global PDE problem that is solved using a standard domain-decomposition strategy (e.g., Scotch, METIS). This part of the computation is well balanced across all MPI processes. ...

hrx71

1

asked Dec 6 at 12:46

0 votes

1 answer

44 views

Upsert! Operation Throws "A table can't contain duplicate column names" Error

I have a base table A and a result table B in DolphinDB. Table B was initially empty and is used to store calculated results based on table A. When trying to insert the calculated results into table B,...

RORO

1

asked Oct 24 at 9:52

0 votes

0 answers

157 views

vLLM + Ray multi-node tensor-parallel deployment completely blocked by pending placement groups and raylet heartbeat failures

Environment: Ray version: 2.x vLLM version: 0.9.2 Python version: 3.9 OS / Container base: Linux (CentOS-based UBI8 in Kubernetes) Cloud / Infrastructure: AWS based Kubernetes cluster (pods scheduled ...

NullUser

9

asked Aug 5 at 17:38

3 votes

1 answer

139 views

In Apache Ignite the Replication mode and Partition mode does not work all together

I’m working with Apache Ignite 2.17.0. I load database tables into Ignite caches and run SQL queries using the SQLFieldsQuery API. Recently, I modified the cache configuration for some tables to use ...

kushal Baldev

799

asked Jul 29 at 17:31

0 votes

0 answers

64 views

Get two different nodes to access and distribute the same SQL table in Apache spark?

I have the following code to test. I created a table on worker 1. Then I tried to read the table on worker 2 and it got TABLE_OR_VIEW_NOT_FOUND. Worker 2 is in the some computer as Master. I ran the ...

Rick C. Ferreira

1

asked Jun 16 at 19:25

3 votes

2 answers

284 views

How Ray async actors handle calls to sync methods

I'm working with Ray async actors and I want to understand exactly what happens—at a deep technical level—when a synchronous method is called on such an actor. I know that calling a synchronous method ...

hegash

893

asked May 26 at 11:00

0 votes

0 answers

52 views

How to best partition my data with a 32 core EMR instance and make sure I max out the parallelize feature?

I’m optimizing a PySpark pipeline that processes records with a heavily skewed categorical column (category). The data has: A few high-frequency categories (e.g., 90% of records fall into 2-3 ...

Bilal Jamil

27

asked Apr 30 at 2:51

0 votes

0 answers

154 views

How to set up MS-MPI multi-machine communication between two Windows 11 systems?

I'm trying to set up a multi-machine communication environment using MS-MPI on two Windows 11 laptops, but I'm encountering some issues. Here are the details of my setup: Environment Details: ...

user29094781

1

asked Apr 5 at 6:29

1 vote

1 answer

135 views

Distributed REST API Calls using SPARK with maintaining consistency

I have a Spark DataFrame created from a Delta table, with one column of type STRUCT(JSON). For each row in this DataFrame, I need to make a REST API call using the JSON payload in the column. ...

uds0128

53

asked Mar 2 at 18:42

0 votes

0 answers

21 views

MLP Speed-Up in PySpark fluctuates with more cores – possible cache memory issue?

enter image description here I have conducted experiments running the MLP (Multi-Layer Perceptron) algorithm on a PC cluster with Apache Spark, with configurations ranging from small data to large ...

Syahel Razaba

1

asked Feb 16 at 22:23

0 votes

0 answers

336 views

PyTorch DDP Multi-Node Training: ncclInternalError: Internal check failed. Bootstrap : no socket interface found

I am trying to run a multi-node training job using PyTorch's DistributedDataParallel (DDP) following this guide. However, when I launch the job with torchrun, I encounter the following NCCL error on ...

yunjeong

1

asked Jan 31 at 7:19

0 votes

1 answer

983 views

Clearing Cached Data on Databricks Cluster

The problem I am facing is that my "used" memory is only around 16GB, however the cached memory takes up so much space, that I am forced to use a compute with higher memory (64GB). So I ...

Manav Karthikeyan

53

asked Jan 17 at 14:31

1 vote

0 answers

96 views

Segmentation Fault During Validation with MirroredStrategy on Multiple GPUs

I am training a model using TensorFlow 2.18.0 with the tf.distribute.MirroredStrategy across two GPUs. The training works fine on a single GPU, but when I try to run it on two GPUs, it ends with a ...

TGD

56

asked Jan 13 at 7:42

0 votes

1 answer

117 views

I want to use the distributed package in PyTorch for point-to-point communication between two ranks. but run error

def runTpoly(rank, size, pp, cs, pkArithmetics_evals, pkSelectors_evals, domain): init_process(rank, size) group2 = torch.distributed.new_group([1,2]) if rank == 0: device ...

wynne yin

1

asked Jan 7 at 10:59

0 votes

0 answers

82 views

Vertex AI Reduction Server returning 500 Internal Error

I am looking to finetune a pre-trained deberta model on Vertex AI with pytorch. I'm attempting to run a distributed job, making use of the Vertex AI reduction server. I'm following this notebook: ...

purpleFudge

1

asked Jan 1 at 14:59

0 votes

0 answers

66 views

How to develop multi-GPU modules in single-node single-GPU system in pytorch?

I'm developing a multi-GPU PyTorch application. Existing methods like scatter/gather in torch.distributed don't fulfill my requirements, therefore I need to develop forward/backprop steps which send ...

mirrortower

1

asked Dec 27, 2024 at 14:10

1 vote

0 answers

36 views

Distributed training with Trainer and ConstantLengthDataset classes

I have a custom ConstantLengthDataset class: class ConstantLengthDataset(IterableDataset): def __init__( self, tokenizer, dataset, infinite=False, ...

имя

11

asked Dec 25, 2024 at 16:25

0 votes

2 answers

65 views

Interrupting busy worker process

I am trying to benchmark an algorithm I developed. To this end, I run the algorithm on several instances and measure time, memory, other numbers... For each instance, I create a new process; partly ...

Bubaya

893

asked Dec 23, 2024 at 17:54

0 votes

1 answer

41 views

How to maintain synchronization between distributed python processes?

I have a number of workstations that run long processes containing sequences like this: x = wait_while_current_is_set y = read_voltage z = z + y The workstations must maintain synchronization with a ...

david

2,706

asked Dec 1, 2024 at 6:40

0 votes

1 answer

498 views

Writing to a delta table spark 3.5.3 delta lake 3.2.0

It seems I'm unable to write using the delta format from my spark job, but I'm not sure what I'm missing. I'm using spark 3.5.3 and deltalake 3.2.0. My error: Exception in thread "main" org....

William

141

asked Nov 26, 2024 at 23:00

0 votes

1 answer

232 views

How to configure Ray cluster to utilize the Full Capacity of Databricks Cluster

I have a Databricks cluster configured with a minimum of 1 worker and a maximum of 4 workers, with auto-scaling enabled. What should my Ray configuration (setup_ray_cluster) be to fully utilize the ...

question.it

3,018

asked Nov 8, 2024 at 4:40

0 votes

0 answers

383 views

Cuda failure ‘named symbol not found’ when run on 4 L4 GPUs

Can someone help me with the following error. The code works fine on the 2 T4 GPUs. But fails when run on the 4 L4 GPUs. I am extending the Gemma 2B model for a multi-label multi-class classification ...

Rakesh Jarupula

21

asked Oct 28, 2024 at 13:39

1 vote

0 answers

92 views

function to count maximum number of co-occurring entities for combination of 3.5 million ids?

Krithish Goli

11

asked Sep 27, 2024 at 18:53

1 vote

0 answers

22 views

Why am I getting "TXN_REQUEST_IGNORED ERROR 10906" in GridDB due to an unknown event during cluster operations?

I’m using GridDB for a distributed database setup and recently encountered the following error while performing operations across nodes in the cluster: from griddb_python import StoreFactory, ...

Samar Mohamed

71

asked Sep 25, 2024 at 21:39

0 votes

1 answer

73 views

Fault-tolerant queue-worker architecture in Kafka?

I am new to using queue-worker architectures and I'm interested in how to make it resilient to a worker failing. For example We have a pool of workers Alpha that put entries onto queue A Then the ...

Lubed Up Slug

178

asked Sep 24, 2024 at 21:08

2 votes

2 answers

226 views

Client request failure in raft

Imagine a 3 node raft cluster. Each node is in sync has log [1,2,3] and entry 3 is committed by the leader. Now leader receives an entry 4 but fails to commit it because of unreliable network and ...

Dumb_Pegasus

129

asked Sep 15, 2024 at 11:04

2 votes

0 answers

36 views

Why am I getting "SYNC_LOG_NOT_FOUND ERROR 20021" in GridDB during cluster synchronization?

I'm working with GridDB to manage a distributed cluster, and recently I’ve been encountering the following error during data synchronization across nodes in the cluster: 20021 SYNC_LOG_NOT_FOUND ERROR ...

Samar Mohamed

71

asked Sep 15, 2024 at 5:48

0 votes

1 answer

144 views

Apache ignite compute broadcast

I'm trying apache ignite and must say ignite documentation is incomplete. Anyway, I've setup two node cluster using docker images 2.14.0-arm14 and exposed all Ports for both ignite containers, however ...

JUser

196

asked Sep 12, 2024 at 9:15

1 vote

0 answers

81 views

Dask erring on GridSearchCV and RandomizedSearchCV

I am trying to train an xgboost model using dask. I already have the data transformed andhave prepared my data as follows: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, ...

joao pereira

65

asked Aug 26, 2024 at 1:04

2 votes

0 answers

432 views

How to Implement Distributed AutoFAISS with PySpark for Large-Scale Vector Indexing?

I’m working on a project that involves creating a vector search index for a massive dataset consisting of 1.3 trillion tokens. I want to use AutoFAISS in a distributed environment to handle the scale ...

Cauder

2,759

asked Aug 20, 2024 at 17:14

1 vote

1 answer

57 views

run a function on different nodes of a slurm cluster for different parameters

How do I call the function inner from outer, such that each call to inner runs on a different node? That is, for ij = 1, it runs on node 1 using all of its 16 cores, for ij = 2, it runs on node 2 ...

evening silver fox

135

asked Aug 15, 2024 at 12:21

0 votes

1 answer

113 views

Do I need Apache Spark to execute my Airflow DAG tasks?

I have a workflow with multiple DAGs. Every DAG has multiple tasks. These tasks are simple ETL tasks. It involves geo data in the form of kmls, csvs. An example task: We have meta data of road ...

ShariqHameed

1

asked Aug 11, 2024 at 15:46

0 votes

1 answer

56 views

reliable protocol guarantee complete delivery no in order promise

The sender is sending N packets to receiver. I want a protocol or method that guarantees delivery, each packet is received at least once. It is ok if some packets are received more than once due to ...

Yufei Zheng

15

asked Jul 11, 2024 at 9:36

0 votes

0 answers

35 views

How to securely conduct lottery-like draws with guaranteed randomness without auditing?

Is there an existing algorithm or method to conduct lottery-like draws that ensures secure and truly random results without the need for auditing? There are any lib to do this? I search on the web ...

aguiadouro

233

asked Jun 24, 2024 at 12:51

0 votes

2 answers

552 views

Unable to run code on Multiple GPUs in PyTorch - Usage shows only 1 GPU is being utilized

I am training a Transformer Encoder-Decoder based model for Text summarization. The code works without any errors but uses only 1 GPU when checked with nvidia-smi. However, I want to run it on all the ...

Abid Meraj

11

asked Jun 19, 2024 at 5:45

1 vote

2 answers

958 views

How to reliably implement fan out write pattern?

I'm trying to RELIABLY implement that pattern. For practical purposes, assume we have something similar to a twitter clone (in cassandra and nodejs). So, user A has 500k followers. When user A posts a ...

InglouriousBastard

55

asked May 27, 2024 at 13:38

1 vote

1 answer

418 views

Using torchrun with AWS sagemaker estimator on multi-GPU node

I would like to run a training job ml.p4d.24xlarge machine on AWS SageMaker. I ran into a similar issue described here with significant slowdowns in training time. I understand now that I should run ...

probably45

23

asked May 24, 2024 at 19:25

2 votes

0 answers

26 views

Why am I getting a "LM_WRITE_LOG_FAILED ERROR 80000" in GridDB when writing to the log file?

I'm using GridDB for managing a distributed database system and recently encountered the following error while trying to perform operations: 80000 LM_WRITE_LOG_FAILED ERROR Writing to log file failed. ...

omar esawy

91

asked May 24, 2024 at 12:03

2 votes

0 answers

25 views

Why am I getting a "JC_CONTAINER_NOT_OPENED ERROR 145034" in GridDB when performing operations on a container?

I'm working with GridDB to manage a distributed database and recently encountered the following error while performing operations on a container: 145034 JC_CONTAINER_NOT_OPENED ERROR Status check of ...

omar esawy

91

asked May 21, 2024 at 9:16

2 votes

0 answers

45 views

Why am I getting a "SYNC_CREATE_CONTEXT_FAILED ERROR 20037" during data synchronization in my GridDB cluster?

I'm working on a distributed system where I need to synchronize data across a cluster of nodes. However, I'm encountering an error during the synchronization process. The error message I get is: 20037 ...

omar esawy

91

asked May 17, 2024 at 11:03

3 votes

1 answer

77 views

Async leader election in unrooted spanning tree declares multiple winners

I am trying to implement the algorithm described in the image, using MPI. It is part of a University project where we are building a distributed satellite to ground station communication system. I ...

Stelios Papamichail

1,353

asked May 15, 2024 at 9:27

1 vote

0 answers

106 views

How to properly clean up non-serializable states associated with a Ray object?

Suppose I have a Ray actor that can create a Ray object that associates with some non-serializable states. In the following example, the non-serializable state is a temporary directory. class MyObject:...

Yang Bo

3,773

asked May 2, 2024 at 18:19

1 vote

1 answer

124 views

Error: invalid child spec in supervisor start_child function

In my erlang application i have a top level supervisor that monitors a cowboy server (gen_server): start_link() -> supervisor:start_link({local, ?SERVER}, ?MODULE, []). init([]) -> ...

salbh

71

asked May 2, 2024 at 18:01

2 votes

0 answers

104 views

How does XGBoost aggregate models being trained in a distributed fashion across n machines?

I am trying to understand how XGBoost distributed training works. The best explanation I've found so far is in this paper: https://ml-pai-learn.oss-cn-beijing.aliyuncs.com/%E6%9C%BA%E5%99%A8%E5%AD%A6%...

Altamash Rafiq

359

asked May 1, 2024 at 21:55

0 votes

2 answers

118 views

Why does decreasing partition count prevent a StageFailure due to large size of serialized results?

I am trying to generate a randomly sorted version of a large-ish dataframe on databricks. My go-to code is to use .orderBy(rand()) on the dataframe. This, however, seems to trigger a SparkException ...

Felipe

11.9k

asked Apr 23, 2024 at 0:06

0 votes

1 answer

168 views

Micrometer & Prometheus with Java subprocesses that can't expose HTTP

I have a complex product that runs like this. A parent Java process which expose an HTTP service. The parent process starts worker subprocesses (new JVM) and manage the lifecycle of them. Worker ...

Joey Liu

510

asked Mar 28, 2024 at 7:16

0 votes

1 answer

110 views

Least Connection Load balancing using Grpc

least_connection.proto code Node overloaded -- starting load balancing process Traceback (most recent call last): File "D:\lab7p2\least connection\node2.py", line 73, in <module> node....

Yash Pahlani

3

asked Mar 25, 2024 at 17:23

1 vote

1 answer

2k views

How to debug ValueError: `FlatParameter` requires uniform dtype but got torch.float32 and torch.bfloat16?

I'm trying to do Pytorch Lightning Fabric distributed FSDP training with Huggingface PEFT LORA fine tuning on LLAMA 2 but my code ends up failing with: `FlatParameter` requires uniform dtype but got ...

JobHunter69

2,376

asked Mar 22, 2024 at 17:36

0 votes

1 answer

53 views

Load pre-training parameters trained on a single GPU on multi GPUS on a single machine

I tried to load the pre-training parameters trained by a single GPU on a single machine with multiple GPUs, but errors such as Missing keys and Unexpected keys occurred. backbone_cfg = dict( ...

Mingshuai Zhao

1

asked Mar 18, 2024 at 13:51

Collectives™ on Stack Overflow