unsloth-dpo
Direct Preference Optimization (DPO) in Unsloth provides a way to align models with human preferences using paired data (chosen/rejected). Unsloth optimizes this process by allowing refmodel=None, sig
Also installable via skills CLI
npx skills add cuba6112/skillfactory/skills/unsloth-dpo