Conversation
73b47a3 to
3a6aacc
Compare
JiwaniZakir
left a comment
There was a problem hiding this comment.
The Muon-specific flags (--train_muon_adjust_lr_fn, --train_muon_nesterov, --train_muon_ns_steps) have no apparent validation that they're only meaningful when --train_optim muon is selected — users who accidentally set --train_muon_ns_steps 10 while using Adam will get silent no-ops with no warning. A guard in the option parsing or register_optimizer that raises or warns when Muon-specific options are set alongside a non-Muon optimizer would prevent this confusion.
The switch from self.netG_A.parameters() to self.get_named_parameters(("G_A", self.netG_A)) in b2b_model.py is the right approach for Muon, since it needs to distinguish 2D parameters (eligible for Newton-Schulz orthogonalization) from 1D parameters (biases, norms) that fall back to AdamW — but the diff is truncated so it's not possible to verify whether build_named_parameters in util/optimizer_factory.py actually implements this split correctly. If 1D parameters like LayerNorm weights are accidentally routed through the Nesterov/orthogonalization path, training will silently diverge.
The default --train_muon_ns_steps 5 is reasonable, but --train_muon_nesterov True as a flag default that is True out of the box is unusual for a flag-type option — in most argparse setups a flag implies store_true with a default of False, so documenting it as defaulting to True in both options.md and options.rst may be misleading or incorrect depending on how the option is actually registered.
No description provided.