Skip to main content
Filter by
Sorted by
Tagged with
Advice
0 votes
2 replies
57 views

What's the best way to allow programs to discover each other on the network? Let's say we are writing a system that tracks the usage of computers over the network. We have an agent program that sends ...
Isembart's user avatar
Advice
2 votes
2 replies
59 views

I am working on a global PDE problem that is solved using a standard domain-decomposition strategy (e.g., Scotch, METIS). This part of the computation is well balanced across all MPI processes. ...
hrx71's user avatar
  • 1
0 votes
1 answer
44 views

I have a base table A and a result table B in DolphinDB. Table B was initially empty and is used to store calculated results based on table A. When trying to insert the calculated results into table B,...
RORO's user avatar
  • 1
0 votes
0 answers
157 views

Environment: Ray version: 2.x vLLM version: 0.9.2 Python version: 3.9 OS / Container base: Linux (CentOS-based UBI8 in Kubernetes) Cloud / Infrastructure: AWS based Kubernetes cluster (pods scheduled ...
NullUser's user avatar
3 votes
1 answer
139 views

I’m working with Apache Ignite 2.17.0. I load database tables into Ignite caches and run SQL queries using the SQLFieldsQuery API. Recently, I modified the cache configuration for some tables to use ...
kushal Baldev's user avatar
0 votes
0 answers
64 views

I have the following code to test. I created a table on worker 1. Then I tried to read the table on worker 2 and it got TABLE_OR_VIEW_NOT_FOUND. Worker 2 is in the some computer as Master. I ran the ...
Rick C. Ferreira's user avatar
3 votes
2 answers
284 views

I'm working with Ray async actors and I want to understand exactly what happens—at a deep technical level—when a synchronous method is called on such an actor. I know that calling a synchronous method ...
hegash's user avatar
  • 893
0 votes
0 answers
52 views

I’m optimizing a PySpark pipeline that processes records with a heavily skewed categorical column (category). The data has: A few high-frequency categories (e.g., 90% of records fall into 2-3 ...
Bilal Jamil's user avatar
0 votes
0 answers
154 views

I'm trying to set up a multi-machine communication environment using MS-MPI on two Windows 11 laptops, but I'm encountering some issues. Here are the details of my setup: Environment Details: ...
user29094781's user avatar
1 vote
1 answer
135 views

I have a Spark DataFrame created from a Delta table, with one column of type STRUCT(JSON). For each row in this DataFrame, I need to make a REST API call using the JSON payload in the column. ...
uds0128's user avatar
  • 53
0 votes
0 answers
21 views

enter image description here I have conducted experiments running the MLP (Multi-Layer Perceptron) algorithm on a PC cluster with Apache Spark, with configurations ranging from small data to large ...
Syahel Razaba's user avatar
0 votes
0 answers
336 views

I am trying to run a multi-node training job using PyTorch's DistributedDataParallel (DDP) following this guide. However, when I launch the job with torchrun, I encounter the following NCCL error on ...
yunjeong's user avatar
0 votes
1 answer
983 views

The problem I am facing is that my "used" memory is only around 16GB, however the cached memory takes up so much space, that I am forced to use a compute with higher memory (64GB). So I ...
Manav Karthikeyan's user avatar
1 vote
0 answers
96 views

I am training a model using TensorFlow 2.18.0 with the tf.distribute.MirroredStrategy across two GPUs. The training works fine on a single GPU, but when I try to run it on two GPUs, it ends with a ...
TGD's user avatar
  • 56
0 votes
1 answer
117 views

def runTpoly(rank, size, pp, cs, pkArithmetics_evals, pkSelectors_evals, domain): init_process(rank, size) group2 = torch.distributed.new_group([1,2]) if rank == 0: device ...
wynne yin's user avatar
0 votes
0 answers
82 views

I am looking to finetune a pre-trained deberta model on Vertex AI with pytorch. I'm attempting to run a distributed job, making use of the Vertex AI reduction server. I'm following this notebook: ...
purpleFudge's user avatar
0 votes
0 answers
66 views

I'm developing a multi-GPU PyTorch application. Existing methods like scatter/gather in torch.distributed don't fulfill my requirements, therefore I need to develop forward/backprop steps which send ...
mirrortower's user avatar
1 vote
0 answers
36 views

I have a custom ConstantLengthDataset class: class ConstantLengthDataset(IterableDataset): def __init__( self, tokenizer, dataset, infinite=False, ...
имя's user avatar
  • 11
0 votes
2 answers
65 views

I am trying to benchmark an algorithm I developed. To this end, I run the algorithm on several instances and measure time, memory, other numbers... For each instance, I create a new process; partly ...
Bubaya's user avatar
  • 893
0 votes
1 answer
41 views

I have a number of workstations that run long processes containing sequences like this: x = wait_while_current_is_set y = read_voltage z = z + y The workstations must maintain synchronization with a ...
david's user avatar
  • 2,706
0 votes
1 answer
498 views

It seems I'm unable to write using the delta format from my spark job, but I'm not sure what I'm missing. I'm using spark 3.5.3 and deltalake 3.2.0. My error: Exception in thread "main" org....
William's user avatar
  • 141
0 votes
1 answer
232 views

I have a Databricks cluster configured with a minimum of 1 worker and a maximum of 4 workers, with auto-scaling enabled. What should my Ray configuration (setup_ray_cluster) be to fully utilize the ...
question.it's user avatar
  • 3,018
0 votes
0 answers
383 views

Can someone help me with the following error. The code works fine on the 2 T4 GPUs. But fails when run on the 4 L4 GPUs. I am extending the Gemma 2B model for a multi-label multi-class classification ...
Rakesh Jarupula's user avatar
1 vote
0 answers
92 views

I have been given this: x | y 1 | a,b,c,d,e 2 | a,b,c,d 3 | a,c,d ... I'd like this: 1,2 | 4 (a,b,c,d) 1,3 | 3 (a,c,d) 2,3 | 3 (a,c,d) I have x -> 3*10^6 such rows (3 million records) y could ...
Krithish Goli's user avatar
1 vote
0 answers
22 views

I’m using GridDB for a distributed database setup and recently encountered the following error while performing operations across nodes in the cluster: from griddb_python import StoreFactory, ...
Samar Mohamed's user avatar
0 votes
1 answer
73 views

I am new to using queue-worker architectures and I'm interested in how to make it resilient to a worker failing. For example We have a pool of workers Alpha that put entries onto queue A Then the ...
Lubed Up Slug's user avatar
2 votes
2 answers
226 views

Imagine a 3 node raft cluster. Each node is in sync has log [1,2,3] and entry 3 is committed by the leader. Now leader receives an entry 4 but fails to commit it because of unreliable network and ...
Dumb_Pegasus's user avatar
2 votes
0 answers
36 views

I'm working with GridDB to manage a distributed cluster, and recently I’ve been encountering the following error during data synchronization across nodes in the cluster: 20021 SYNC_LOG_NOT_FOUND ERROR ...
Samar Mohamed's user avatar
0 votes
1 answer
144 views

I'm trying apache ignite and must say ignite documentation is incomplete. Anyway, I've setup two node cluster using docker images 2.14.0-arm14 and exposed all Ports for both ignite containers, however ...
JUser's user avatar
  • 196
1 vote
0 answers
81 views

I am trying to train an xgboost model using dask. I already have the data transformed andhave prepared my data as follows: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, ...
joao pereira's user avatar
2 votes
0 answers
432 views

I’m working on a project that involves creating a vector search index for a massive dataset consisting of 1.3 trillion tokens. I want to use AutoFAISS in a distributed environment to handle the scale ...
Cauder's user avatar
  • 2,759
1 vote
1 answer
57 views

How do I call the function inner from outer, such that each call to inner runs on a different node? That is, for ij = 1, it runs on node 1 using all of its 16 cores, for ij = 2, it runs on node 2 ...
evening silver fox's user avatar
0 votes
1 answer
113 views

I have a workflow with multiple DAGs. Every DAG has multiple tasks. These tasks are simple ETL tasks. It involves geo data in the form of kmls, csvs. An example task: We have meta data of road ...
ShariqHameed's user avatar
0 votes
1 answer
56 views

The sender is sending N packets to receiver. I want a protocol or method that guarantees delivery, each packet is received at least once. It is ok if some packets are received more than once due to ...
Yufei Zheng's user avatar
0 votes
0 answers
35 views

Is there an existing algorithm or method to conduct lottery-like draws that ensures secure and truly random results without the need for auditing? There are any lib to do this? I search on the web ...
aguiadouro's user avatar
0 votes
2 answers
552 views

I am training a Transformer Encoder-Decoder based model for Text summarization. The code works without any errors but uses only 1 GPU when checked with nvidia-smi. However, I want to run it on all the ...
Abid Meraj's user avatar
1 vote
2 answers
958 views

I'm trying to RELIABLY implement that pattern. For practical purposes, assume we have something similar to a twitter clone (in cassandra and nodejs). So, user A has 500k followers. When user A posts a ...
InglouriousBastard's user avatar
1 vote
1 answer
418 views

I would like to run a training job ml.p4d.24xlarge machine on AWS SageMaker. I ran into a similar issue described here with significant slowdowns in training time. I understand now that I should run ...
probably45's user avatar
2 votes
0 answers
26 views

I'm using GridDB for managing a distributed database system and recently encountered the following error while trying to perform operations: 80000 LM_WRITE_LOG_FAILED ERROR Writing to log file failed. ...
omar esawy's user avatar
2 votes
0 answers
25 views

I'm working with GridDB to manage a distributed database and recently encountered the following error while performing operations on a container: 145034 JC_CONTAINER_NOT_OPENED ERROR Status check of ...
omar esawy's user avatar
2 votes
0 answers
45 views

I'm working on a distributed system where I need to synchronize data across a cluster of nodes. However, I'm encountering an error during the synchronization process. The error message I get is: 20037 ...
omar esawy's user avatar
3 votes
1 answer
77 views

I am trying to implement the algorithm described in the image, using MPI. It is part of a University project where we are building a distributed satellite to ground station communication system. I ...
Stelios Papamichail's user avatar
1 vote
0 answers
106 views

Suppose I have a Ray actor that can create a Ray object that associates with some non-serializable states. In the following example, the non-serializable state is a temporary directory. class MyObject:...
Yang Bo's user avatar
  • 3,773
1 vote
1 answer
124 views

In my erlang application i have a top level supervisor that monitors a cowboy server (gen_server): start_link() -> supervisor:start_link({local, ?SERVER}, ?MODULE, []). init([]) -> ...
salbh's user avatar
  • 71
2 votes
0 answers
104 views

I am trying to understand how XGBoost distributed training works. The best explanation I've found so far is in this paper: https://ml-pai-learn.oss-cn-beijing.aliyuncs.com/%E6%9C%BA%E5%99%A8%E5%AD%A6%...
Altamash Rafiq's user avatar
0 votes
2 answers
118 views

I am trying to generate a randomly sorted version of a large-ish dataframe on databricks. My go-to code is to use .orderBy(rand()) on the dataframe. This, however, seems to trigger a SparkException ...
Felipe's user avatar
  • 11.9k
0 votes
1 answer
168 views

I have a complex product that runs like this. A parent Java process which expose an HTTP service. The parent process starts worker subprocesses (new JVM) and manage the lifecycle of them. Worker ...
Joey Liu's user avatar
  • 510
0 votes
1 answer
110 views

least_connection.proto code Node overloaded -- starting load balancing process Traceback (most recent call last): File "D:\lab7p2\least connection\node2.py", line 73, in <module> node....
Yash Pahlani's user avatar
1 vote
1 answer
2k views

I'm trying to do Pytorch Lightning Fabric distributed FSDP training with Huggingface PEFT LORA fine tuning on LLAMA 2 but my code ends up failing with: `FlatParameter` requires uniform dtype but got ...
JobHunter69's user avatar
  • 2,376
0 votes
1 answer
53 views

I tried to load the pre-training parameters trained by a single GPU on a single machine with multiple GPUs, but errors such as Missing keys and Unexpected keys occurred. backbone_cfg = dict( ...
Mingshuai Zhao's user avatar

1
2 3 4 5
58