On RL for LLM fine-tuning

2025-09-09.

Contents

During Q1 and Q2 of 2025, I was cooking a model for product property extraction given product description (textual) - with same training data as BERT-CRF, the fine-tuned QWen-2.5-7B models failed to surpass BERT-CRF on supply-side product description (which is longer and more complex) and succeeded on surpass on demand-side product description (which is shorter and more succinct).

The fine-tuning recipe is simple as follows.

- Method: Supervised Fine-Tuning (SFT)
- Data format:
  - Input: {Instructions (Task desc + Multi req + Output format)}
  - Output: {JSON properties as PropName-PropValList key-value pairs}

Phenomenon: Overfitting to position bias

xxx