:============================================================================:
╱ │││││ │ │ │││││││││ ││││││││││ │ ││ │││││││╱│ ││││││ │││ ││││││││ ││ ││
│ ││╱ │◆│ │ •│││╲│││││ ╱│││││╱│││ │ ││ ││││╲││╱╲ ││││││ │╱│ ││││││││ ││ ││
│◆ ╱╱│ │ │ •│◆ ╲│││╲││││◆││╱││╱│╲╲│ │ │╲ │││╱││╱╱│ │╱││╱╱•╲│╱ ◆│╲││││││ ││ │╲
│╲╱╱╲│╱╱ │ ╲ •││││╱╱│││╱╱╲╲│╱╲│╱││╲│•││╲││╱│╱╱│╱│╱││││││╲╱╱╲◆ ╱╲││││││ ││ ││
│╱╱│╲│╱│╱│◆•│╲◆│││╲│╱││╱╱││╲│╱│╱│╲│╲│ ││╲╱╱│╱╲╱╱││╱│_ - \.╱╱│╲╱╱╲│╱│╱╱╱•││╲││
│╱╱│╲╱╲╱╱│ ╱│╲╲╱││╱│╱│╲╱╱╱╲╱╱│╲╱╱│╱╱│╱╱│╲╱╱│╱╲╱│││╲││╱╱╲╱╱│╱╲╱╲││╱│╱╱╱│╱││╲╱│
\ │││╱_ - \.╲╲╲││╲│╲╲╱╱│╲│╱╱╱│╲│╲╲││││╲╲│╲│╱ ╲╲│╲╱╱╲╲╲╲ │ ╱╱ ╱│╱ │╱╲╲╱│╲╲╱│
│╲││╱╲╱•╱│╱ ╱╲╲│╲│╱│╱││╲╱╲│╱╲ │││││││ ╲╲╱││╲││╲╲╲╲╲╲╲││╲╱│╱╱╱││ ││╲│╱╲││││╱
╲╱│╱•╱╲ ││◆ │ ╲╲││╱╱││││╱ ╲││•│╲│││││◆◆│╲││╱╱││╲╲╲ ╲│╲╲╲││╱╱╱││• ◆││╲│•│╲│││╱
│╱│╱ ││ ╱│ │ ╲╲╲╱│╱╲│╱│◆ ╱╲╲ │╲│││╱│ ││╲│◆││││╲╲ │││╲│││╱│•││ ╱│╲│ ││││││
││││◆││•╱│ │ ╲│╱││╲│╱│ ││╲◆│││││││ ││╲│ ╱│││◆╲╲│╲│╲╲╱││╱ ││ ││││•││││││
││││ ││ ││ │ │││││ │││ ││╲ ││╱╱│││ ││││ ││││ ╲ ╲││││││││ ││ ││││ ││││││
:============================================================================:
e88~-_ 888 e e 888~-_
d888 \ 888 d8b d8b 888 \
8888 888 d888bdY88b 888 |
8888 888 / Y88Y Y888b 888 /
Y888 / 888 / YY Y888b 888_-~
"88_-~ 888____ / Y888b 888 ~-_
----
crumb ( she | xe | fae | it )
02/26/2026
I have two tasks that share a model that here I'm training with rslora, for
now we can call them the generator and discriminator. The generator task
receives a prefix, a short snippet of text for which it is asked to generate
a completion after some deliberation. Outputs from the generator task should
correspond to high reward when the discriminator assigns a high probability
that the output came from the ground truth set of completions (and when the
completion is of desired length). The discriminator task is implemented as a
layer normalization and linear head on top of the last hidden states of the
underlying model, trained to output class scores for either generated or
ground-truth with a cross-entropy loss. I use an exponential moving average
of rewards as a baseline for simplicity.
=================================================================
| reasoning tokens | completion tokens | kl coeff | model tuned |
| :--------------: | :---------------: | :------: | :---------: |
| 256 | 256 | 0.05 | qwen3-8b |
======================================================================
| grad clip | rank | alpha | bs | optim | beta1 | beta2 | schedule |
| :--------: | :--: | :---: | :-: | :---: | :---: | :---: | :------: |
| 1.0 | 64 | 8 | 4 | adam | 0.95 | 0.95 | linear |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| id | steps | g-lr | d-lr | d-scale | mauve |
| :-: | :---: | :--: | :--: | :-----: | :---: |
| 00 | 800 | 2e-5 | 1e-4 | 10.0 | 0.119 |
| 01 | 800 | 2e-5 | 5e-6 | 0.1 | 0.073 |
| 02 | 800 | 5e-6 | 1e-4 | 0.1 | 0.072 |
| 03 | 800 | 5e-6 | 5e-6 | 10.0 | 0.075 |
| ---------------------------------------------
| 04 | 800 | 1e-5 | 2e-5 | 1.0 | 0.09 |
| ---------------------------------------------
| 05 | 1600 | 3e-5 | 2e-4 | 30.0 | 0.375 |
I calculate the final run's hyperparameters using steepest ascent in log-space
with standard orthogonal coding, then round to nearest "clean" values (so it's
pretty). The step size is normalized by the upper bound interval for handling
the minor asymmetry in `d-lr` (0.6 vs 0.7 decades from center). I should have
probably seen it coming that "just make them bigger" was going to do better.
"MAUVE is obtained by computing Kullback–Leibler (KL) divergences between the
two distributions in a quantized embedding space of a foundation model." I
leave c at 5 and use num_buckets = (2000//10) for measuring based on the last
2000 sequences generated (corresponding to the last 500 steps... yes not
ideal but I'm rushing this!), all other settings are left as the defaults in:
https://krishnap25.github.io/mauve/
Data is taken from the Pile, which is first filtered for texts with a length
greater than 2x the completion tokens, then random windows are selected. The
last `completion tokens` tokens from the random window are taken as the
ground truth completion and the remaining tokens are used as a prefix which
goes into the prompt:
<|im_start|>system
You act as a causal model of language.
Example:
"""
user:
<prefix>
The user will present a prefix they wish for you to continue,
then you'll start at the last few words
</prefix>
assistant:
...last few words, then continuation starts,
and you just keep generating, without any other commentary.
"""
<|im_start|>user
<prefix>
{text}
</prefix>
<|im_start|>assistant
<think>
Before the discriminator was a linear head, I had a reasoning task for it. In
that setup the best performing (increasing mauve the most) generator reward
was based on a value function baseline for the discriminator. The reward took
inspiration from absolute zero reasoner (analogs: proposer=generator,
solver=discriminator) where the reward for the proposer is (1 - solver pass-
rate) if the pass-rate is above zero, otherwise zero. I don't have the compute
to run the discriminator multiple times for each sequence but the pass-rate is
~essentially the baseline for the solver, so I substituted it with my value
function giving g_reward = ((1 - d_value) if d_value > t else 0) where t is
used to simulate group dynamics: if less than 1/8th of 8 attempts don't get
it, the pass-rate would be zero, so I set t to 1/8. A lot of the time the
discriminator task would stop using reasoning. My guess for why is that the
value function would give the model such a useful estimate for how realistic a
sample was that it just tied that directly to its answer. Need to investigate
this more for solid evidence and test using separate models for separate tasks
but am moving quickly.
All of the rollouts generated here will be used to create a synthetic data
pipeline where ground truth completions are given plausible reasoning traces,
which I'll do a finetune on before running back through the trainer.
:============================================================================:
╱ │││││ │ │ │││││││││ ││││││││││ │ ││ │││││││╱│ ││││││ │││ ││││││││ ││ ││
│ ││╱ │◆│ │ •│││╲│││││ ╱│││││╱│││ │ ││ ││││╲││╱╲ ││││││ │╱│ ││││││││ ││ ││
│◆ ╱╱│ │ │ •│◆ ╲│││╲││││◆││╱││╱│╲╲│ │ │╲ │││╱││╱╱│ │╱││╱╱•╲│╱ ◆│╲││││││ ││ │╲
│╲╱╱╲│╱╱ │ ╲ •││││╱╱│││╱╱╲╲│╱╲│╱││╲│•││╲││╱│╱╱│╱│╱││││││╲╱╱╲◆ ╱╲││││││ ││ ││
│╱╱│╲│_ - \.╱│╱│◆•│╲◆│││╲│╱││╱╱││╲│╱│╱│╲│╲│ ││╲╱╱│╱╲╱╱││╱│╱╱│╲╱╱╲│╱│╱╱╱•││╲││
│╱╱│╲╱╲╱╱│ ╱│╲╲╱││╱│╱│╲╱╱╱╲╱╱│╲╱╱│╱╱│╱╱│╲╱╱│╱╲╱│││╲││╱╱╲╱╱│╱╲╱╲││╱│╱╱╱│╱││╲╱│
\ │││╱╲╲╲││╲│╲╲╱╱│╲│╱╱╱│╲│╲╲││││╲╲│╲│╱ ╲╲│╲╱╱╲╲╲╲ │_ - \. ╱╱ ╱│╱ │╱╲╲╱│╲╲╱│
│╲││╱╲╱•╱│╱ ╱╲╲│╲│╱│╱││╲╱╲│╱╲ │││││││ ╲╲╱││╲││╲╲╲╲╲╲╲││╲╱│╱╱╱││ ││╲│╱╲││││╱
╲╱│╱•╱╲ ││◆ │ ╲╲││╱╱││││╱ ╲││•│╲│││││◆◆│╲││╱╱││╲╲╲ ╲│╲╲╲││╱╱╱││• ◆││╲│•│╲│││╱
│╱│╱ ││ ╱│ │ ╲╲╲╱│╱╲│╱│◆ ╱╲╲ │╲│││╱│ ││╲│◆││││╲╲ │││╲│││╱│•││ ╱│╲│ ││││││
││││◆││•╱│ │ ╲│╱││╲│╱│ ││╲◆│││││││ ││╲│ ╱│││◆╲╲│╲│╲╲╱││╱ ││ ││││•││││││
││││ ││ ││ │ │││││ │││ ││╲ ││╱╱│││ ││││ ││││ ╲ ╲││││││││ ││ ││││ ││││││
:============================================================================: