23,854 questions
Advice
0
votes
1
replies
69
views
Book Recommendation in PyTorch
I am looking to find a book on PyTorch that is suitable for beginners, Ive used sklearn in the past for ML its a simple workflow for me prepare the X and Y data, fit/train a model, and make ...
0
votes
0
answers
15
views
DLL initialization routine failed on package import
I have tried loading PyTorch in iPython but get a DLL initialization error. In a normal python console it works fine, as below. Windows 10, Miniconda installation.
Can anyone advise how I need to ...
0
votes
1
answer
54
views
How to make a 2 Label Confusion Matrix and exporting into a json file?
I have to train a convolutional neural network on a dataset. The NN itself works and does what it's supposed to but now I want to make a confusion matrix and export it into a json file for further ...
0
votes
0
answers
46
views
Sign Language with PyTorch GRU [closed]
I'm currently training a GRU model on American Sign Language (ASL) using a Kaggle dataset,
while tweaking the parameters I achieved a peak accuracy of 44.7% on training and 28.2% on testing,
that is ...
Advice
1
vote
1
replies
34
views
Reproducibility Hugging Face Transformer models
If I'm using any transformer model loaded from the Hugging Face Hub with Python, is it somehow possible to reproduce all the seeds, that have been used for the model training/fine-tuning?
Seeds/...
Advice
0
votes
1
replies
31
views
Is the torch.fx traced graph topologically sorted?
Dependency layer: The layers whose outputs are passed to the current layer. Basically, The current layer is dependent on the outputs of the dependency layers.
For a project, I need to know if the ...
0
votes
1
answer
48
views
How to use NeuralForecast and PyTorch Lightning on Intel GPU (XPU / torch.xpu)?
PyTorch supports Intel GPU through torch.xpu, but PyTorch Lightning does not currently have built-in XPU accelerator support.
Because NeuralForecast uses Lightning under the hood, that also blocks ...
Best practices
1
vote
0
replies
26
views
How do you effectively reuse helper functions and training pipelines across multiple PyTorch projects?
I’ve been working on multiple machine learning projects using PyTorch, and I keep running into the same issue: a lot of code ends up being repeated across projects.
This includes things like:
...
0
votes
0
answers
35
views
Torch c++ binding cordump with torch c++ extension binding
I found torch-text is archived, and I still want to use it ,because there is a course that uses it. but Since it was archived, I always meet signture missing problem, so I want to fork and fix it for ...
0
votes
0
answers
58
views
XTTS v2 produces hallucinations when running multiple inferences sequentially, but works fine individually
I'm using XTTS v2 fine-tuned for Vietnamese (vnTTS).
Problem:
- Running inference on a single sentence → perfect output
- Running inference on multiple sentences in a loop → weird sounds/...
-1
votes
0
answers
98
views
Converting .h5 model weights (no architecture) to .pth
I have an .h5 file that contains only model weights, not the model architecture. I want to use these weights in a PyTorch model and convert them into a .pth file.
Some context:
The .h5 file does not ...
3
votes
0
answers
36
views
How to convert the MLP in MoE to 4 bit quantization?
I'm doing some research about the information encoding with LLMs and need to find a way to quantize the weights of the MLP layers(MoE) to 4 bits and even customized mixed precision. Consider
from ...
-2
votes
0
answers
133
views
ONNXRuntimeError: CUDA error: cudaErrorNoKernelImageForDevice:no kernel image is available for execution on the device
I'm trying to run model on GPU:
clf2 = PunctCapSegModelONNX.from_pretrained(
"1-800-BAD-CODE/xlm-roberta_punctuation_fullstop_truecase",
ort_provider=["CUDAExecutionProvider&...
Advice
0
votes
4
replies
86
views
Converting XLSX File to a FASTA format in Python
Need to extract data (Peptide_Sequence) from XLSX output file to a FASTA file. I'm using pandas. A FASTA file is a standard plain-text format in bioinformatics, used to store nucleotide or amino acid ...
0
votes
1
answer
60
views
Is there a way to prioritise --index-url but still look at other places in pip install / download?
I want to setup to use only CPU (saves e.g. space) and for a package it's described in e.g. How to install torch without nvidia? pip install p --index-url https://download.pytorch.org/whl/cpu. Now I ...
3
votes
1
answer
82
views
How to robustly intercept PyTorch GPU OOM in a Python subprocess and dynamically adjust batch_size within an autonomous AI Agent loop?
I am building an autonomous AI Agent (managing training workflows) that automatically generates PyTorch/OpenMMLab training scripts and executes them in a background subprocess.
One of the common ...
Best practices
0
votes
2
replies
50
views
Can half-precision fp16 can transfer hyperparameter that tuned to fp32 directly or not?
I use google colab to train my model
I have trained the model on fp32 and use random grid to search the hyperparameter. the training phase is slow; it takes around 3.24it/s.
I want to ask can I use ...
Best practices
5
votes
5
replies
132
views
Optimizing a Gaussian penalty function for HSL color compatibility in PyTorch/NumPy
I am currently developing an AI-driven fashion recommendation system, specifically focusing on a "Multi-modal Context-Aware Decision Model." A critical component of my recommendation engine ...
Advice
0
votes
1
replies
58
views
can V-JEPA be used to detect audience engagement during a seminar from live video
I am experimenting with the V-JEPA model developed by Meta for video understanding.
My goal is to analyze a live video stream of people attending a seminar and determine their engagement level (for ...
0
votes
0
answers
65
views
Sentence Transformer Stuck at Loading (Google Cloud Instance)
I use this code to load sentence transformer in a GCP VM instance (no GPU). This is a dask plugin used on dask worker.:
class NLPSetup(WorkerPlugin):
def __init__(self, bucket_uri):
self....
Advice
2
votes
0
replies
141
views
Is clothing-invariant person recognition possible using still images only?
I am working on a person recognition system for learning purposes.
My goal is:
Maintain a small gallery of known people (multiple images per person)
Given a new query image, return the most similar ...
3
votes
0
answers
98
views
Implemented PPO algorithm fails to train
I wrote a PPO-based reinforcement learning code for the Gymnasium CarRacing-v3 environment.
(The code was generated with the help of Gemini)
However, even after 200,000 frames, the training does not ...
0
votes
0
answers
58
views
Problem of freeze metrics after first epoch
I encountered a problem with metrics fading after the first training epoch. During the first epoch, the model training proceeds normally. The loss metrics decrease, and the accuracy increases. The ...
4
votes
2
answers
153
views
ModelCheckpoint not saving last validating checkpoint when save_last=True
I am using pytorch lightning to train my model, here I use the lightning callback ModelCheckpoint, with the following settings:
ModelCheckpoint(
dirpath="path/to/dir",
monitor="...
5
votes
0
answers
164
views
CUDA error: CUBLAS_STATUS_INVALID_VALUE in cublasGemmEx() with PyTorch, fp16=False
I am using an RTX 3060 (12GB VRAM) and implementing a RAG pipeline with the BGE-M3 embedding model.
Initially, I installed PyTorch with the CUDA 12.8 wheel (my NVIDIA driver supports CUDA 12.9). ...
2
votes
1
answer
67
views
PyTorch ValueError: optimizer got an empty parameter list when building a Logistic Regression Model
I tried making a logistic regression model using nn.Module
class LogisticRegressionModel(nn.Module):
def __init__(self, input_dim= None) -> None:
super().__init__()
if input_dim ...
Advice
1
vote
3
replies
90
views
What does it mean that Pytorch's torch.mul is "unbound"?
Running help(torch.Tensor.mul) gives:
Help on method_descriptor:
mul(...) unbound torch._C.TensorBase method
mul(value) -> Tensor
See :func:`torch.mul`.
What does unbound mean in this ...
3
votes
2
answers
124
views
why can't I pass input and target tensors directly to nn.CrossEntropyLoss?
on Python 3.13, torch 2.10.0+cu130
import torch
loss = nn.CrossEntropyLoss()
loss(torch.tensor((.1, .2)), torch.tensor((.3, .4)))
returns -
tensor(0.4811)
but why does
nn.CrossEntropyLoss(torch....
1
vote
0
answers
70
views
Why does BatchNorm1d fail with batch size 1 in training mode?
I am training a small PyTorch model and want to use nn.BatchNorm1d.
When the batch size is 1 and the model is in training mode, I get the error below;
ValueError: Expected more than 1 value per ...
0
votes
0
answers
45
views
Trained and loaded CycleGAN model is giving distorted output images
I trained a CycleGAN model on Google Colab using this repository - https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix
The model should enhance dark images. I tested the model on my test dataset ...
1
vote
1
answer
69
views
GluonTS DeepAREstimator fails to load checkpoint in PyTorch 2.6
I am currently working on a project where I have to use GluonTS (the DeepAREstimator and DLinearEstimator). At the beginning it worked well. But now, even when I use the example code from the GluonTS ...
2
votes
2
answers
60
views
"from torch_geometric.data import Data" throwing an error
If I run a py module with only these imports (no additional code) it works fine and the output is Process finished with exit code 0:
import torch.utils.data
from torch.utils.data.dataloader import ...
0
votes
0
answers
32
views
post training quantized model gets the error "Copying from quantized Tensor to non-quantized Tensor is not allowed" even though I'm not copying tensor
I got a pretrained resnet 18 model from this lane detection repo in order to use it as an ADAS(advanced driver assistance systems) function for an electric car making competition. My current goal is ...
0
votes
0
answers
37
views
How to properly handle LSTM states during training with SliceSampler in TorchRL?
I am implementing a Reinforcement Learning environment using torchrl where the agent uses an LSTM-based policy. My goal is to train the agent on sequences sampled from a replay buffer. While I have ...
0
votes
0
answers
46
views
Why does .view() fail after permuting dimensions for a GRU?
I'm trying to train a character-level GRU on Linux kernel source but the training loop keeps crashing with this error:
RuntimeError: view size is not compatible with input tensor's size and stride (...
1
vote
0
answers
72
views
PyTorch and NVIdia Flare is taking all computing resource on machine learning experiments
I am utilizing PyTorch for federated experiments. As my experiments involves 50 datasets with models, so, I have to run multiple ML models experiments parallelly.
The code for training ML model is ...
0
votes
0
answers
40
views
torch dataloader next-method when using multiple workers
I have a Dataset that is based on IterableDataSet, looking like that
class MyDataSet(torch.utils.data.IterableDataset):
def __init__(self):
# doing init stuff here
def __iter__(self):
...
Advice
2
votes
1
replies
55
views
Why do we use requires_grad=True in the input here?
# Example of target with class indices
loss = nn.CrossEntropyLoss()
input = torch.randn(3, 5, requires_grad=True) <=============== WHY ?
target = torch.empty(3, dtype=torch.long).random_(5)
output =...
6
votes
0
answers
136
views
Docker load fails with wrong diff id calculated on extraction for large CUDA/PyTorch image (Ubuntu 22.04 + CUDA 12.8 + PyTorch 2.8)
About
I am trying to create a Docker image with the same Dockerfile with Python 3.10, CUDA 12.8, and PyTorch 2.8 that is portable between two machines:
Local Machine: NVIDIA RTX 5070 (Blackwell ...
0
votes
0
answers
244
views
Memory access fault by GPU node-1 (Agent handle: 0x26f5dbf0) on address 0x7749d0333000. Reason: Write access to a read-only page
I am currently on a project to segment 3D-LSM images using self-supervised model and i have been trying to perform a dryrun(testing pre-training) on the AMD GPU droplet on digitalocean. the configs of ...
1
vote
1
answer
57
views
Why does PyTorch GPU matmul give correct results without torch.cuda.synchronize()?
I'm learning GPU programming with PyTorch and I'm confused about when torch.cuda.synchronize() is actually necessary.
I have this code that compares CPU and GPU matrix multiplication:
import torch
...
1
vote
1
answer
101
views
torch.matmul(S, v) where S is symmetric and v is a vector: how to speed up computations?
Let S be a nxn symmetric matrix and v a n 1-dimensional vector.
We need to compute inside a pytorch loss function the vector (S x v) in an efficient manner.
Do you know if there is a way to keep ...
0
votes
0
answers
45
views
Proper utilization of sliding window inferer
I am training an Encoder-Decoder network to reconstruct brain CT images. Due to OOM (Out of Memory) errors with full-sized images, I implemented a sliding window approach for training and inference.
...
2
votes
1
answer
76
views
PyTorch: trying to create a joint dataset with different transforms results in both datasets having same transform
I'm very new to PyTorch and am attempting to create a dataset for which a given sample has both unmasked and masked data associated with it, or in other words, the first piece of data is just the ...
3
votes
1
answer
68
views
How can the backward function in tensor influence the matrix in model
class SoftmaxRegission(torch.nn.Module):
linear: torch.nn.Linear
def __init__(self, num_features: int, num_classes: int):
super(SoftmaxRegission, self).__init__()
self.linear =...
0
votes
0
answers
46
views
Parameter count difference between UNETR paper and MONAI implementation
I am comparing many deep learning models to each other, including UNETR, on the BTCV dataset and noticed a discrepancy in the reported number of parameters.
In their paper titled "UNETR: ...
0
votes
0
answers
71
views
In PyTorch how do I perform bf16 or f16 matmul with accumulation in f32 explicitly?
I need fast compute but the resulting sums need higher precision for downstream tasks just for one specific op out of many in the model.
torch.bmm has an out_dtype parameter but the documentation does ...
0
votes
2
answers
137
views
Difference between torch.nn.Module and torch.Tensor?
I’m learning PyTorch and I see two common classes: torch.Tensor and torch.nn.Module. I’m a bit confused about their differences and when to use each.
Here’s what I understand so far:
torch.Tensor ...
0
votes
0
answers
91
views
Onnx cannot be read with Microsoft.ML on Windows 10 (19045.5854)
Everything I describe here works perfectly fine on my computer that has Windows 11 (Version 10.0.26200). However, on a computer that has Windows 10 (10.0.19045), it does not work. This is a client's ...
0
votes
0
answers
57
views
Issue with converting mobile_sam.pt to onnx format (decode part)
I want to use the mobile_sam.pt model in web browser, so I need the onnx format of that.
I tried these method, but always getting the same error.
segment-anything -
samexporter
The error above ...