unsloth-grpo
Unsloth-grpo enables training of reasoning models using Group Relative Policy Optimization (GRPO). This technique replaces traditional PPO Reward and Value models with group statistics, achieving 8x m
Also installable via skills CLI
npx skills add cuba6112/skillfactory/skills/unsloth-grpo