Outperforms TD3, SAC, and DDPG across every MuJoCo and Box2D environment tested.
Created by Mohammad Asadolahi — Senior Agentic AI Engineer | Agentic AI Architectures In The Wild
State-of-the-art continuous control algorithms — TD3, SAC, DDPG — all share a fundamental limitation: blind exploration. They inject fixed, state-independent noise into agent decisions, hoping randomness alone will uncover optimal policies. This is catastrophically sample-inefficient and, in many real-world domains, simply infeasible.
Exploration should not be random. It should be learned.
TDS introduces a stochastic policy that outputs both a mean action
Detailed per-environment learning curves (including zoomed views) are available in the
Plots/directory.
TDS combines three proven design principles with one novel contribution:
| Component | Design Choice | Rationale |
|---|---|---|
| Twin Critics | Two independent Q-networks, take the minimum | Mitigates overestimation bias |
| Delayed Actor Updates | Actor updated every 2 critic steps | Stabilizes training by reducing policy lag |
| Stochastic Policy (novel) | Actor outputs |
Enables gradient-driven, state-dependent exploration |
| Target Networks | Polyak-averaged target actor + critics | Smooth bootstrapping targets |
State → Linear(obs_dim, 256) → LayerNorm → ReLU
→ Linear(256, 256) → LayerNorm → ReLU
→ μ: Linear(256, action_dim) → Tanh × max_action
→ σ: Linear(256, action_dim) → Sigmoid → Clamp[0.1, 1.0] × max_action
The actor's
[State, Action] → Linear(obs_dim + action_dim, 256) → ReLU
→ Linear(256, 256) → ReLU
→ Linear(256, 1) → Q-value
pip install torch gymnasium[mujoco] numpy pandas matplotlibpython Main.pyEvaluation runs every 5,000 steps over 10 episodes. Training logs and results are saved to ./data/{env}/{algorithm}/seed{n}/.
| Parameter | Value |
|---|---|
| Actor Learning Rate ( |
3e-4 |
| Critic Learning Rate ( |
3e-4 |
| Discount Factor ( |
0.99 |
| Soft Update Rate ( |
0.005 |
| Batch Size | 100 |
| Replay Buffer Size | 1,000,000 |
| Hidden Layers | 256 × 256 |
| Actor Update Interval | Every 2 critic steps |
| Random Exploration Steps | 10,000 |
| Target Policy Noise | 0.2 × max_action |
| Noise Clipping | ±0.5 × max_action |
├── Actor.py # Stochastic actor network (μ + σ heads)
├── Critic.py # Twin Q-network architecture
├── Agent.py # TDS agent — action selection, learning, target updates
├── Replay_Buffer # Standard experience replay
├── Main.py # Training loop — Ant-v4 with evaluation
├── Requirements.py # Dependency list
├── TDS_solving_gymnasium_Ant_v4.ipynb # Interactive notebook — full TDS training run
├── TDS.png # Architecture diagram
├── TDS_policy_architecture.png # Policy architecture diagram
├── Benchmarks/
│ ├── DDPG_ANT_V3.ipynb # DDPG baseline comparison
│ ├── SAC_ANT_V3.ipynb # SAC baseline comparison
│ └── TD3_ANT_V3.ipynb # TD3 baseline comparison
└── Plots/ # All learning curves + benchmark tables
| Environment | Domain | Action Dim | Observation Dim |
|---|---|---|---|
| Ant-v4 | MuJoCo | 8 | 27 |
| Humanoid-v4 | MuJoCo | 17 | 376 |
| Walker2d-v4 | MuJoCo | 6 | 17 |
| Hopper-v4 | MuJoCo | 3 | 11 |
| BipedalWalker-v3 | Box2D | 4 | 24 |
| LunarLander-v2 | Box2D | 2 | 8 |
@article{asadolahi2023tds,
title={TDS: A Novel Stochastic Off-Policy Actor-Critic Algorithm
for Continuous Reinforcement Learning},
author={Asadolahi, Mohammad},
journal={Research Square (Preprint)},
year={2023},
url={https://www.researchsquare.com/article/rs-3041837/v1}
}Questions? Feature requests? Open an Issue — contributions and discussions are welcome.
This is a research project. The author assumes no liability for deployment in production environments.


