Newest 'reinforcement-learning' Questions

0 votes

0 answers

72 views

Custom GRPO Trainer not Learning

I am new to reinforcement learning. So as an educational exercise, I am implementing the GRPO from scratch with pytorch. My goal is mimic how TRL works, but boil it down to just the loss function and ...

csnate

1,661

asked Dec 15 at 1:07

0 votes

1 answer

47 views

When using PyTorch torchrl TD0Estimator, how to handle the "done" and "terminated" flag

Based on the TD0Estimator documentation, it is using 2 Tensordict keys to flag whether episode has ended or not. But i can't seems to find any indication when and how to use it. As an example, let's ...

Bejo

13

asked Dec 11 at 5:16

Advice

0 votes

1 replies

95 views

How to read a large Python project (for example, a project of Deep Learning or Reinforcement Learning)

I've downloaded many Python projects about Reinforcement Learning from Github, but each takes me too much time to read. It's easy to comprehend a simple Python project with only a few *.py files, but ...

Xingrui Zhuang

27

asked Nov 28 at 9:09

Advice

0 votes

0 replies

31 views

When using TensorDictPrioritizedReplayBuffer, should I apply the priority weight manually or not?

With Prioritized Experience Replay (PER), we use Beta parameter, so we can find weight that will be used to offset the bias introduced by PER. Now, with PyTorch's TensorDictPrioritizedReplayBuffer, I ...

Bejo

13

asked Nov 25 at 6:43

Advice

1 vote

0 replies

34 views

How can I design “story-driven NPCs” in a reinforcement-learned environment? Looking for development directions and architectural advice

I’m working on a thesis about "story-driven NPCs in a reinforcement-learning world", and I’m building a small multi-agent RL environment as a prototype. However, I’m unsure how to push the ...

DucTruong

1

asked Nov 21 at 16:38

3 votes

0 answers

69 views

KeyError: 'advantages' in PPO MARL using Ray RLLib

I use ray 2.50.1 to implement a MARL model using PPO. However, I meet the following problem: 'advantages' KeyError: 'advantages' During handling of the above exception, another exception occurred: ...

geniusadven

31

asked Oct 22 at 3:49

0 votes

0 answers

83 views

Trouble configuring R-group substitution in REINVENT 4 (AstraZeneca) — validation errors for RLConfig and ScorerConfig

I’m using AstraZeneca’s REINVENT 4 (v4.6.27) to generate SMILES from a scaffold via R-group substitution, optimizing for 5-HT2A / D2 / 5-HT1A (maximize) and minimizing H1 / M1 / α1A, with DockStream ...

Reuben Udohaya

1

asked Sep 30 at 15:39

1 vote

0 answers

79 views

Pytorch on Import torchrl.data: AttributeError: provides. Did you mean: 'providedBy'?

I’m new at this and I’m trying to dabble in Pytorch and PytorchRL. However, as the topic states, before I can even load up the model, I get that AtrributeError message. This is the full error message ...

Steve Brother

946

asked Sep 15 at 1:06

0 votes

1 answer

95 views

PermissionError: [Errno 13] Permission denied: 'Qwen3-0.6B-SFT'

I am getting the following error when running training, using the TRL library in the following HuggingFace space: vishaljoshi24/trl-4-dnd. My SDK is Docker and as far as I'm aware there are not ...

Vishal Joshi

1

asked Sep 10 at 15:48

0 votes

1 answer

177 views

Getting different results across different machines while training RL

While training my RL algorithm using SBX, I am getting different results across my HPC cluster and PC. However, I did find that results consistently are same within the same machine. They just diverge ...

desert_ranger

1,859

asked Aug 28 at 20:42

1 vote

0 answers

46 views

Does Stable_Baselines3 store the seed rng while saving?

I was wondering if a model might provide different performance if we load it at different times, while running a stochastic program. Because depending on when the model is loaded, various functions (...

desert_ranger

1,859

asked Aug 28 at 15:28

0 votes

0 answers

66 views

Agent instantly teleports or moves too fast to capture the flag

I'm creating a Capture-the-Flag style game in Unity using ML-Agents. The setup includes: 2 Agents (one per team) Each team has a flag and a base NavMesh is also added to the floor. and navmesh agents ...

Avi Garg

11

asked Jul 31 at 8:59

1 vote

1 answer

54 views

How to resolve the type error in pomegranate?

I am trying to set up a dummy code for the pomegranate (below), but for some reason I am getting an error when I try to run the ConditionalCategorical(). How do I resolve it? from pomegranate....

Isaac A

589

asked Jul 27 at 0:28

1 vote

0 answers

51 views

Actor loss can't decrease in TD3-HER

I try to train a TD3-HER based agent in Carla, and the training environment is Endless-v0, but the loss curves of actor and critic in training look like this The curves and the videos of agents show ...

Jiashu Li

11

asked Jul 24 at 11:34

1 vote

0 answers

42 views

Difference between tokens generated on a configuration in two different contexts

I have a model that given a configuration, or state (of a Rubik's cube, but whatever, it is a sequence of integers) generates a movement (from 0 to 5). This movement can be used to bring the ...

Nikio

111

asked Jul 23 at 13:08

1 vote

1 answer

142 views

How can I properly add seed/options to a dmc2gym environment with Gymnasium? [closed]

import gymnasium as gym import dmc2gym gymenv = gym.make("CartPole-v0") gymenv.reset(seed=42, options=None) # It won't go wrong, no problem dmcenv = dmc2gym.make(domain_name="quadruped&...

Xingrui Zhuang

27

asked Jul 23 at 7:54

1 vote

1 answer

136 views

Error Raised with SAC for Centralized Training, Decentralized Execution in Ray RLlib

I'm using a slight variant of the RockPaperScissors multi-agent environment from the Ray RLlib documentation as a test environment to verify that a custom RLModule for Centralized Training, ...

Nelson Salazar

43

asked Jul 22 at 22:45

1 vote

1 answer

92 views

DQN fails to learn good policy for Atari Pong

I'm trying to implement the findings from this DeepMind DQN paper (2015) from scratch in PyTorch using the Atari Pong environment. I've tested my Deep Q-Network on a simple test environment, where ...

Rohan Patel

21

asked Jul 22 at 4:52

0 votes

1 answer

96 views

How TorchRL deals with multiple trajectories in the same batch?

I am trying to understand how the DQN algorithm with RNNs works in PyTorch's RL API through this tutorial. However, the way some of the classes handle episodes and batches during training are unclear ...

Ícaro Lorran

218

asked Jul 14 at 13:50

1 vote

0 answers

67 views

My RandomSampler() is always generating the same parameters

I used TPESampler and set it as follows while optimizing with optuna: sampler=optuna.samplers.TPESampler(multivariate=True, n_startup_trials=10, seed=None). But in the 10 startup_trials process, it ...

YYYC

11

asked Jul 10 at 13:39

0 votes

0 answers

85 views

RL Trading Agent Can't Learn Sensible Behavior Even on a Simple Sine Wave — What Am I Doing Wrong?

I’ve been building a reinforcement learning trading agent using a synthetic sine wave as the price series — basically the simplest dataset I could imagine to test whether an agent can learn to buy low ...

Oleg Bizin

169

asked Jul 8 at 19:57

0 votes

0 answers

52 views

Unable to reproduce training results in a dummy vector using stablebaseline3

I created a custom Gymnasium environment and trained an agent using Stable-Baselines3 with DummyVecEnv and VecNormalize. The agent performs well during training and consistently reaches the goal. ...

Amir Hosein Nourian

11

asked Jun 27 at 15:55

0 votes

1 answer

82 views

When using TensorDictPrioritizedReplayBuffer, should i store "td_error" field in TensorDict data?

Let's say you are gonna train DDPG or any algorithm that use Prioritized Replay Buffer. When using torchrl TensorDictPrioritizedReplayBuffer, after you calculate td_error, you gonna use it to call ...

Bejo

13

asked Jun 25 at 4:40

0 votes

0 answers

150 views

RecurrentPPO model from sb3-contrib always gives me policy gradient loss and explained variance close to 0

Im working on trainning an RPPO agent to handle a temperature control system. Here's some snippet of the code. class TempControlSeqEnv(gym.Env): def __init__(self, curriculum_phase, time_steps=5): ...

Anthony

1

asked Jun 24 at 16:31

0 votes

0 answers

49 views

Trouble training DQN agent in CarRacing environment — poor learning progress

I'm trying to train a DQN agent on my CarRacing environment, but I'm struggling to make the agent learn anything meaningful — the total episode reward stays very low and doesn't improve over time. def ...

papierowka

11

asked Jun 8 at 10:42

0 votes

0 answers

137 views

Why does pybullet physics sim behave non-realistic in simple experiment?

I want to know if pybullet is accurate enough to use it to simulate physics environments to train machine learning models. I want to create a line following robot that follows a line based on what it ...

alienare 4422

121

asked May 12 at 10:50

0 votes

0 answers

54 views

Soft Actor Critic -- actor forward pass outputs NaN immediately

I have the following code for my Actor network in my Soft Actor Critic. However, once the batch is full and it starts to do back prop, (batch size: 128), after about 5 iterations of updating the ...

Zubin Oommen

1

asked May 9 at 3:00

0 votes

0 answers

31 views

Why can't I implement ENERGYM?

I am trying to implement the ENERGYM library in Python as per the following paper: https://www.mdpi.com/2076-3417/11/8/3518 and https://bsl546.github.io/energym-pages/sources/install_min.html After ...

Matthew Fleishman

1

asked May 1 at 19:21

0 votes

0 answers

43 views

Multiple environment resets per iteration using Ray RLlib DQN (D3QN) with a simple custom Gym environment

Hello, I’m experimenting with Ray RLlib’s DQN (Dueling Double DQN) on a minimal custom environment, but I keep seeing many resets in a single training iteration, even though each episode completes ...

طه الشريف

1

asked Apr 26 at 13:48

0 votes

0 answers

20 views

wand Sweep Agent Not Injecting ${args} into Command (Hyperparameters Never Passed)

I’m trying to run a W&B hyperparameter sweep against my Python training script, but none of the sweep’s hyperparameter flags ever get passed to the script. Instead, the agent simply runs my base ...

Dalek

4,388

asked Apr 25 at 19:19

0 votes

0 answers

27 views

How could I get the images in custom policy in metadrive?

I am trying to custom a policy using metadrive frame work.To make a decision, I want to get the rgb_camera information in vehicle. I've already check the official files and I just know I can get frame ...

Jiangde Tu

1

asked Apr 20 at 5:51

0 votes

0 answers

53 views

Reinforcement learning Atari video recording crashes

I am learning Reinforcement Learning and picked the Atari Breakout environment to learn. I have trained the neural nets, and now am trying to create some videos to visualize the results. I am using ...

Leo

85

asked Apr 16 at 13:49

2 votes

1 answer

193 views

I keep getting this error, cuda available 'RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu

I'm training a transformer model using RLlib's PPO algorithm, but I encounter a device mismatch error: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, ...

Thanasis Mpoulionis

23

asked Apr 11 at 11:41

0 votes

1 answer

59 views

Why does the CityLearn tutorial keep throwing me a ValueError?

I am following this CityLearn tutorial. I got through the first part (RBC) without incident. However, when I implement the second part (Q-learning, literally copy and paste from the site), I keep ...

Matthew Fleishman

1

asked Apr 9 at 17:07

2 votes

1 answer

379 views

Reproducibility of JAX calculations

I am using JAX in running Reinforcement Learning (RL) & Multi-Agent Reinforcement Learning (MARL) calculations. I have noticed the following behaviour: In RL, my results are always fully ...

amavrits

57

asked Apr 9 at 7:19

1 vote

2 answers

311 views

Ray rllib episode_reward_mean not showing

Can anyone explain to me why the episode_reward_mean is NOT part of the results dictionary? Is it replaced by a different key in the latest API? I see env_runners/episode_return_mean and env_runners/...

aaden

23

asked Apr 8 at 7:03

1 vote

0 answers

30 views

Matlab Reinforcement Learning, Issue with obtaining gradient from Qvalue critic using dlfeval,dlgradient,dlarrays

I'm trying to implement a custom agent, and inside my agent I'm running into issues with obtaining the gradient of the Q value with respect to my actor network parameters. I have my code below, main ...

Sliferslacker

53

asked Apr 7 at 14:12

0 votes

0 answers

31 views

Error when training the model for sensor data using Kafka and RDL index 4096 is out of bounds for axis 0 with size 4096

This is my code using python, RDL Reinforcement Deep Learning and Kafka, both zookeeper and and kafka server are working with no issue, when I run my following code using jupyter: import json import ...

Developer

29

asked Mar 22 at 5:30

2 votes

1 answer

33 views

How can I import the internal `VectorSARTTrajectory` from ReinforcementLearningCore.jl?

I'm trying to use the VectorSARTTrajectory type from the ReinforcementLearningCore.jl package, because it is also mentioned in the Introduction to RL.jl I found the implementation of ...

Paul Weis

83

asked Mar 12 at 9:03

1 vote

0 answers

68 views

In Gymnasium, how can one run a vector environment's function in parallel similar to how step() can be run in parallel?

I have a custom Gymnasium environment, RLToy-v0 from the library MDP Playground. It separates out the transition function and reward function from the step function and calls them individually inside ...

Warm_Duscher

1,474

asked Mar 5 at 19:40

0 votes

0 answers

67 views

Performance of my MountainCar gym environment does not improve at all

I am new to reinforcement learning and I am trying to solve different environments in gymnasium library. But no matter what I try, the reward for Mountain Car env turns out to be -200 both during ...

Lakshmanan M

1

asked Mar 4 at 7:48

0 votes

0 answers

50 views

DQN agent: loss decreases, cumul. reward stagnates, q-values are very similar over all actions and get higher and higher

My loss is decreasing, but my agent isnt learning in a very easy environment... The cumulative reward stagnates and when printing the q-values they are identical or at least very similar over all ...

Sum

11

asked Mar 1 at 17:05

0 votes

0 answers

49 views

I created a very simple RL model for trading stocks but its output is the same regardless of input

I've been trying to build a basic Neural Net that analyzes a stock's price and chooses whether to buy, sell, or hold. # Importing Libraries import torch import torch.nn as nn import torch.optim as ...

Agastya P

1

asked Feb 13 at 15:04

0 votes

0 answers

53 views

Cartpole with Q-Learning not Learning Anything

So I was trying to solve the cartpole problem. This is a common problem when dealing with reinforcement learning. Essentially, you have a cart that is balancing a pole. The cart can move left or right....

artemisFowl47

1

asked Feb 13 at 15:04

1 vote

0 answers

92 views

Calculate correlation on dict type variables

I have a dataframe named hyperparam_df which looks like the following: repo_name file_name \ 0 DeepCoMP deepcomp/util/simulation.py ...

Brie MerryWeather

70

asked Feb 11 at 13:38

0 votes

0 answers

40 views

IndexError: too many indices for array in reinforcement learning model using Pennylane and PyTorch

I am working on a quantum reinforcement learning model using PennyLane, PyTorch, and a stock trading environment from finrl library.When I run my training function, I get following error. Any help is ...

user1566490

75

asked Feb 4 at 11:39

1 vote

1 answer

417 views

How to use jax.vmap with a tuple of flax TrainStates as input?

I am setting up a Deep MARL framework and I need to assess my actor policies. Ideally, this would entail using jax.vmap over a tuple of actor flax TrainStates. I have tried the following: import jax ...

amavrits

57

asked Feb 1 at 13:23

0 votes

0 answers

97 views

Using Pytorch Sequential for reinforcement model without gym

I'm trying to play a game with AI, but I want to do it in real time. Because of that, I'm not using gym to create an environment. I want to take a screenshot, preprocess it, then pass it through the ...

Controller816

11

asked Jan 30 at 6:27

0 votes

1 answer

209 views

EvalCallback hangs in stable-baselines3

I'm trying to train an A2C model in stable-baselines3 and the EvalCallback appears to freeze when it is called. I cannot figure out why. Below you will find a script that recreates this problem. ...

Finncent Price

897

asked Jan 29 at 22:34

-1 votes

1 answer

69 views

Actor-Critic behaved strange on cliff walking [closed]

0. Backgrounds I was trying to train a AC-based agent for a task with large observation space. The task is similar to a huge cliff walk task. The agent starts at some random point on a 20 * 20 * 5 * 9 ...

Eric Monlye

127

asked Jan 28 at 11:30

Collectives™ on Stack Overflow