:============================================================================:
 ╱  │││││ │  │  │││││││││ ││││││││││ │ ││ │││││││╱│ ││││││ │││  ││││││││ ││ ││
 │  ││╱ │◆│  │ •│││╲│││││ ╱│││││╱│││ │ ││ ││││╲││╱╲ ││││││ │╱│  ││││││││ ││ ││
 │◆ ╱╱│ │ │ •│◆ ╲│││╲││││◆││╱││╱│╲╲│ │ │╲ │││╱││╱╱│ │╱││╱╱•╲│╱ ◆│╲││││││ ││ │╲
 │╲╱╱╲│╱╱ │  ╲ •││││╱╱│││╱╱╲╲│╱╲│╱││╲│•││╲││╱│╱╱│╱│╱││││││╲╱╱╲◆ ╱╲││││││ ││ ││
 │╱╱│╲│╱│╱│◆•│╲◆│││╲│╱││╱╱││╲│╱│╱│╲│╲│ ││╲╱╱│╱╲╱╱││╱│_ - \.╱╱│╲╱╱╲│╱│╱╱╱•││╲││
 │╱╱│╲╱╲╱╱│ ╱│╲╲╱││╱│╱│╲╱╱╱╲╱╱│╲╱╱│╱╱│╱╱│╲╱╱│╱╲╱│││╲││╱╱╲╱╱│╱╲╱╲││╱│╱╱╱│╱││╲╱│
\  │││╱_ - \.╲╲╲││╲│╲╲╱╱│╲│╱╱╱│╲│╲╲││││╲╲│╲│╱ ╲╲│╲╱╱╲╲╲╲ │  ╱╱ ╱│╱  │╱╲╲╱│╲╲╱│
 │╲││╱╲╱•╱│╱ ╱╲╲│╲│╱│╱││╲╱╲│╱╲ │││││││ ╲╲╱││╲││╲╲╲╲╲╲╲││╲╱│╱╱╱││   ││╲│╱╲││││╱
 ╲╱│╱•╱╲ ││◆ │ ╲╲││╱╱││││╱ ╲││•│╲│││││◆◆│╲││╱╱││╲╲╲ ╲│╲╲╲││╱╱╱││• ◆││╲│•│╲│││╱
 │╱│╱ ││ ╱│  │ ╲╲╲╱│╱╲│╱│◆ ╱╲╲ │╲│││╱│  ││╲│◆││││╲╲ │││╲│││╱│•││   ╱│╲│ ││││││
 ││││◆││•╱│  │  ╲│╱││╲│╱│  ││╲◆│││││││  ││╲│ ╱│││◆╲╲│╲│╲╲╱││╱ ││   ││││•││││││
 ││││ ││ ││  │  │││││ │││  ││╲ ││╱╱│││  ││││ ││││ ╲ ╲││││││││ ││   ││││ ││││││
:============================================================================:

        
                                                                               
                 e88~-_  888          e    e           888~-_                  
                d888   \ 888         d8b  d8b          888   \                 
                8888     888        d888bdY88b         888    |                
                8888     888       / Y88Y Y888b        888   /                 
                Y888   / 888      /   YY   Y888b       888_-~                  
                 "88_-~  888____ /          Y888b      888 ~-_                 
                                                  ----          

                                    crumb              ( she | xe | fae | it )
                                  02/26/2026          

I have two tasks that share a model that here I'm training with rslora, for 
now we can call them the generator and discriminator. The generator task 
receives a prefix, a short snippet of text for which it is asked to generate 
a completion after some deliberation. Outputs from the generator task should 
correspond to high reward when the discriminator assigns a high probability 
that the output came from the ground truth set of completions (and when the 
completion is of desired length). The discriminator task is implemented as a 
layer normalization and linear head on top of the last hidden states of the 
underlying model, trained to output class scores for either generated or 
ground-truth with a cross-entropy loss. I use an exponential moving average 
of rewards as a baseline for simplicity.


       =================================================================
       | reasoning tokens | completion tokens | kl coeff | model tuned |
       | :--------------: | :---------------: | :------: | :---------: |
       | 256              | 256               | 0.05     | qwen3-8b    |
     ======================================================================
     | grad clip  | rank | alpha | bs  | optim | beta1 | beta2 | schedule |
     | :--------: | :--: | :---: | :-: | :---: | :---: | :---: | :------: |
     | 1.0        | 64   | 8     | 4   | adam  | 0.95  | 0.95  | linear   |
                ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                | id  | steps | g-lr | d-lr | d-scale | mauve |
                | :-: | :---: | :--: | :--: | :-----: | :---: |
                | 00  | 800   | 2e-5 | 1e-4 | 10.0    | 0.119 |
                | 01  | 800   | 2e-5 | 5e-6 | 0.1     | 0.073 |
                | 02  | 800   | 5e-6 | 1e-4 | 0.1     | 0.072 |
                | 03  | 800   | 5e-6 | 5e-6 | 10.0    | 0.075 |
                | ---------------------------------------------
                | 04  | 800   | 1e-5 | 2e-5 | 1.0     | 0.09  |
                | ---------------------------------------------
                | 05  | 1600  | 3e-5 | 2e-4 | 30.0    | 0.375 |

I calculate the final run's hyperparameters using steepest ascent in log-space
with standard orthogonal coding, then round to nearest "clean" values (so it's
pretty). The step size is normalized by the upper bound interval for handling 
the minor asymmetry in `d-lr` (0.6 vs 0.7 decades from center). I should have 
probably seen it coming that "just make them bigger" was going to do better.

"MAUVE is obtained by computing Kullback–Leibler (KL) divergences between the 
two distributions in a quantized embedding space of a foundation model." I 
leave c at 5 and use num_buckets = (2000//10) for measuring based on the last 
2000 sequences generated (corresponding to the last 500 steps... yes not 
ideal but I'm rushing this!), all other settings are left as the defaults in:
https://krishnap25.github.io/mauve/

Data is taken from the Pile, which is first filtered for texts with a length 
greater than 2x the completion tokens, then random windows are selected. The 
last `completion tokens` tokens from the random window are taken as the 
ground truth completion and the remaining tokens are used as a prefix which 
goes into the prompt:


           <|im_start|>system
           You act as a causal model of language.
           Example:
           """
           user:
           <prefix>
           The user will present a prefix they wish for you to continue, 
           then you'll start at the last few words
           </prefix>
           assistant:
           ...last few words, then continuation starts, 
           and you just keep generating, without any other commentary.
           """
           <|im_start|>user
           <prefix>
           {text}
           </prefix>
           <|im_start|>assistant
           <think>


Before the discriminator was a linear head, I had a reasoning task for it. In
that setup the best performing (increasing mauve the most) generator reward
was based on a value function baseline for the discriminator. The reward took 
inspiration from absolute zero reasoner (analogs: proposer=generator, 
solver=discriminator) where the reward for the proposer is (1 - solver pass-
rate) if the pass-rate is above zero, otherwise zero. I don't have the compute
to run the discriminator multiple times for each sequence but the pass-rate is
~essentially the baseline for the solver, so I substituted it with my value
function giving g_reward = ((1 - d_value) if d_value > t else 0) where t is
used to simulate group dynamics: if less than 1/8th of 8 attempts don't get
it, the pass-rate would be zero, so I set t to 1/8. A lot of the time the
discriminator task would stop using reasoning. My guess for why is that the
value function would give the model such a useful estimate for how realistic a
sample was that it just tied that directly to its answer. Need to investigate
this more for solid evidence and test using separate models for separate tasks
but am moving quickly.

All of the rollouts generated here will be used to create a synthetic data 
pipeline where ground truth completions are given plausible reasoning traces,
which I'll do a finetune on before running back through the trainer.


:============================================================================:
 ╱  │││││ │  │  │││││││││ ││││││││││ │ ││ │││││││╱│ ││││││ │││  ││││││││ ││ ││
 │  ││╱ │◆│  │ •│││╲│││││ ╱│││││╱│││ │ ││ ││││╲││╱╲ ││││││ │╱│  ││││││││ ││ ││
 │◆ ╱╱│ │ │ •│◆ ╲│││╲││││◆││╱││╱│╲╲│ │ │╲ │││╱││╱╱│ │╱││╱╱•╲│╱ ◆│╲││││││ ││ │╲
 │╲╱╱╲│╱╱ │  ╲ •││││╱╱│││╱╱╲╲│╱╲│╱││╲│•││╲││╱│╱╱│╱│╱││││││╲╱╱╲◆ ╱╲││││││ ││ ││
 │╱╱│╲│_ - \.╱│╱│◆•│╲◆│││╲│╱││╱╱││╲│╱│╱│╲│╲│ ││╲╱╱│╱╲╱╱││╱│╱╱│╲╱╱╲│╱│╱╱╱•││╲││
 │╱╱│╲╱╲╱╱│ ╱│╲╲╱││╱│╱│╲╱╱╱╲╱╱│╲╱╱│╱╱│╱╱│╲╱╱│╱╲╱│││╲││╱╱╲╱╱│╱╲╱╲││╱│╱╱╱│╱││╲╱│
\  │││╱╲╲╲││╲│╲╲╱╱│╲│╱╱╱│╲│╲╲││││╲╲│╲│╱ ╲╲│╲╱╱╲╲╲╲ │_ - \.  ╱╱ ╱│╱  │╱╲╲╱│╲╲╱│
 │╲││╱╲╱•╱│╱ ╱╲╲│╲│╱│╱││╲╱╲│╱╲ │││││││ ╲╲╱││╲││╲╲╲╲╲╲╲││╲╱│╱╱╱││   ││╲│╱╲││││╱
 ╲╱│╱•╱╲ ││◆ │ ╲╲││╱╱││││╱ ╲││•│╲│││││◆◆│╲││╱╱││╲╲╲ ╲│╲╲╲││╱╱╱││• ◆││╲│•│╲│││╱
 │╱│╱ ││ ╱│  │ ╲╲╲╱│╱╲│╱│◆ ╱╲╲ │╲│││╱│  ││╲│◆││││╲╲ │││╲│││╱│•││   ╱│╲│ ││││││
 ││││◆││•╱│  │  ╲│╱││╲│╱│  ││╲◆│││││││  ││╲│ ╱│││◆╲╲│╲│╲╲╱││╱ ││   ││││•││││││
 ││││ ││ ││  │  │││││ │││  ││╲ ││╱╱│││  ││││ ││││ ╲ ╲││││││││ ││   ││││ ││││││
:============================================================================: