Reinforcement Learning
Reinforcement learning plays a vital role in the era of large language models, particularly in reasoning and alignment. From RLHF to Chain-of-Thought reasoning, RL provides key technical foundations for enhancing LLM capabilities.
Recommended Learning Resources
Prof. Shiyu Zhao's RL Course (Westlake University)
Highlights: The mathematical foundations of reinforcement learning, from scratch to deep understanding
Resources:
- Book & Slides: GitHub repository
- Video lectures: Full course on Bilibili
- Why this course: Rigorous math derivations with a solid theoretical foundation
RethinkFun RL Series
Recommended creator: @RethinkFun
Core videos:
- RL fundamentals explained — plain-language explanations with illustrated principles and formula derivations
- PPO and GRPO algorithm walkthroughs — illustrated algorithm internals
Beginner Tutorials
- Minimal Introduction to Reinforcement Learning — an intuitive walkthrough of MDP, DP/MC/TD, Q-learning, policy gradients, and PPO
RL Papers and Projects (Pending Review)
📊 Paper shortlist: Click to view the full list of RL papers pending review
Status categories: Not Started, Evaluating, Completed, Not Recommended, Not Open-Sourced
GRPO Reproduction References
TRL Framework
- Project: TRL (Transformer Reinforcement Learning)
- Highlights: HuggingFace's official RL framework
- Supported algorithms: GRPO, PPO, DPO, and more
Chain-of-Thought (CoT)
Core Concept
Chain-of-Thought reasoning is a key technique for exposing an LLM's reasoning process, improving both interpretability and reasoning capability.
Notable Papers and Projects
CoT-Valve: Length-Compressible Chain-of-Thought Tuning
- Highlights: A technique for tuning Chain-of-Thought reasoning with compressible length
- Source: HuggingFace Daily Papers
MCoT (Multi-Chain-of-Thought)
- Project: Awesome-MCoT
- Highlights: Multi-chain reasoning that improves performance on complex reasoning tasks
Latent CoT
- Project: Awesome-Latent-CoT
- Core idea: Move reasoning from linguistic symbols into the latent space to capture richer and more complex thought processes
Multimodal CoT
Chain-of-Thought reasoning that combines visual and textual information — showing strong capability on multimodal tasks.
Survey on Latent-Space Reasoning
- Key survey: HIT's first survey on latent-space reasoning
- Core argument: Reshapes the boundaries of LLM reasoning by exploring reasoning mechanisms in the latent space
DeepSeek-R1 Deep Dive
DeepSeek-R1, as a model with standout reasoning capabilities, has technical details worth in-depth study:
- Reasoning mechanism design
- Training strategy analysis
- Performance evaluation methodology
Suggested Learning Path
Foundations
- Math prerequisites: probability, dynamic programming, optimization theory
- Core concepts: MDP, value functions, policies, returns
- Classical algorithms: Q-learning, policy gradients, Actor-Critic
Intermediate
- Modern algorithms: PPO, TRPO, SAC, TD3
- LLM applications: RLHF, Constitutional AI
- Chain-of-Thought techniques: CoT, MCoT, Latent CoT
Practice
- Frameworks: OpenAI Gym, Stable Baselines3, TRL
- Hands-on projects: game AI, dialogue system optimization
- Paper reproduction: reproducing and improving upon key algorithms
Application Areas
LLM Alignment
- RLHF: Learning from human feedback to improve output quality
- Constitutional AI: Principle-based approach to AI alignment
- DPO: Direct Preference Optimization — a simplified alternative to RLHF
Reasoning Capability
- Chain-of-Thought reasoning: Improves performance on complex reasoning tasks
- Tool use: Training models to invoke external tools
- Code generation: Improves programming capability
Multi-Agent Systems
- Cooperative learning: Multiple agents solve problems together
- Competitive learning: Individual capabilities improve through competition
- Social learning: Agents learn from the behavior of other agents
Frontier Trends
- Offline RL: Learning policies from static datasets
- Meta-learning: Algorithms that adapt quickly to new tasks
- Safe RL: Ensuring the safety of both the learning process and the learned policy
- Explainable RL: Improving the interpretability of decision-making
Additional Notes
-
Chain-of-Thought / Multi-step CoT (MCoT) / Latent CoT
-
GRPO learning resources:
- Bilibili playlist: https://space.bilibili.com/18235884/search?keyword=GRPO
- PPO/GRPO algorithm explainer: https://www.bilibili.com/video/BV15cZYYvEhz/
- Paper and resource collection: https://github.com/yaotingwangofficial/Awesome-MCoT
-
RL math foundations textbook: GitHub https://github.com/MathFoundationRL/Book-Mathmatical-Foundation-of-Reinforcement-Learning
贡献者
最近更新
Involution Hell© 2026 byCommunityunderCC BY-NC-SA 4.0