LLM Adaptation for Legal Writing Assessment

Context

About six months before Rubrics as Rewards came out, my mom, a law professor, asked me why ChatGPT couldn't provide useful feedback on her students' assignments, even with a custom prompt and a detailed evaluation rubric. This question made me curious about zero-shot generalization in data-sparse, high-precision domains. I also began to wonder how much few-shot prompting or parameter-efficient fine-tuning could push accuracy and alignment.

I got to explore that question during my capstone at Brown, supervised by Professor Stephen Bach, whose work on SFT for zero-shot task generalization continues to shape my research interests. This project used Mistral-7B-Instruct's zero-shot behavior as a baseline, then compared two few-shot adaptation paths for legal writing feedback: in-context learning via prompting and in-distribution adaptation via LoRA fine-tuning. I also separately evaluated the SoftSRV synthetic generation framework as a way to stretch a very small dataset of roughly 30 legal memoranda.

Problem

LRW (Legal Research & Writing) classes cap at 15 to 25 students, so professor time is scarce. Students who need help with fundamentals like syntax, argument structure, and citations lose precious 1-on-1 time that could go toward advancing their legal reasoning. Accurate, just-in-time feedback has the potential to meaningfully reduce this gap.

Setup

I used Mistral-7B-Instruct for both fine-tuning (quantized to 8-bit with BitsAndBytes) and few-shot prompting. My dataset consisted of 30 legal memoranda created by law school instructors based on authentic 1L student work, all responding to the same legal research prompt: whether a fictional client's conduct counted as "tumultuous" behavior under Massachusetts law. I segmented each memo into IREAC components (Issue Statement, Rule Explanation, Rule Application, Conclusion). Of the 30 samples, 15 included expert feedback and annotated rubrics for training. I used the full set for the synthetic data experiments.

I compared three main configurations: zero-shot as a baseline, few-shot in-context learning via 1/3/5-shot prompting, and few-shot in-distribution adaptation via LoRA fine-tuning (r=8, a=16, dropout=0.1, 100 epochs). I also evaluated a modified SoftSRV (SSMC variant) using 64 MLPs with both BERT and Legal-BERT encoders. Compute constraints shaped the design. I cut the MLP count from 128 to 64 and training steps from 20,000 to 10,000 compared to the SoftSRV paper.

For the SoftSRV generation runs, the prompt included a compact case packet: the Massachusetts disorderly conduct rule, a short definition of tumultuous conduct, and brief summaries of four Massachusetts state cases, and a sample memo.

I annotated individual claims in model outputs for accuracy and cross-validated them against professor feedback on the corresponding memos.

One note: I dropped the Rule Statement IREAC component from my analysis because of a preprocessing error I made during data preparation. I didn't identify "tumultuous" as a sub-rule of "disorderly conduct" when splitting the memos. Mistral later made the same mistake repeatedly during evaluation.

Results

The useful split is by task shape: fine-tuning handled checklist-style issue statements, while few-shot prompting held up better on fact-heavy rule applications.

Manual rubric evaluation by IREAC section
Section	Method	Correct Ratings	Hallucination Rate
Issue Statements	Fine-tuning	100%	11.1%
Issue Statements	Few-shot	16.6%	12.65%
Rule Explanations	Fine-tuning	33%	5.5%
Rule Explanations	Few-shot	16.6%	22.9%
Rule Applications	Fine-tuning	66.7%	49.5%
Rule Applications	Few-shot	83.3%	6.4%

Fine-tuning was strongest on Issue Statements. Mistral reliably completed structured checklists on whether key elements were present, achieving 100% accuracy, and its supporting reasoning was accurate 95.8% of the time. The weak point was prose feedback: it sometimes mischaracterized legal hierarchy or hallucinated. Few-shot prompting was much worse on the same task, with 16.6% rate-sheet accuracy and unsupported legal reasoning.

A 1-shot evaluation of an issue statement, on a memo set in Massachusetts:

1-shot model output

Under Ohio Revised Code § 2905.01(B), a person commits disorderly conduct if they engage in conduct that is likely to cause inconvenience, annoyance, or alarm to another person. In this case, the issue is whether Ms. Michelle Archer's actions during the protest at City Hall on March 15, 2022, constituted tumultuous behavior under Ohio law, thereby supporting her conviction for disorderly conduct. Specifically, the prosecution argues that Ms. Archer's loud and disruptive speech, along with her refusal to comply with police orders to disperse, demonstrate tumultuous behavior. However, defense counsel contends that Ms. Archer's actions were protected by her First Amendment rights to free speech and assembly, and therefore cannot be considered tumultuous within the meaning of Ohio law.

In the 1-shot evaluation above, Mistral relocated the case from Massachusetts to Ohio, invented a new statute, supplied procedural details, prosecution arguments, and defense contentions that were not in the source memo.

Neither fine-tuning nor few-shot prompting consistently evaluated rule explanations correctly. Fine-tuning completed structured checklists correctly 37.5% of the time and aligned with professor reasoning 33.3% of the time. Its hallucination rate stayed low at 5.5%. Few-shot prompting did worse on both fronts: 16.6% checklist accuracy and a much higher hallucination rate of 22.9%.

Both evaluation approaches fared much better on the rule application component. Few-shot prompting completed structured checklists correctly 83.3% of the time with a hallucination rate of 6.4%. Fine-tuning completed structured checklists correctly 66.7% of the time but hallucinated on approximately half of the samples. In two thirds of evaluations, fine-tuned Mistral abandoned the rate sheet it was provided and invented new evaluation criteria. This section is confounded by the sample order: I received lower quality rule application samples earlier while I was fine-tuning and higher quality samples during few-shot experiments.

Increasing shot count (1 to 3 to 5) did not consistently improve quality and often made feedback worse. On one rule explanation example, incorrect feedback rose from 25.0% at 1-shot to 63.6% at 5-shot.

Neither few-shot nor fine-tuning reliably handled legal hierarchy or identified when key legal precedent was missing from the analysis. Both approaches also struggled with structural writing issues like overly narrow topic sentences, mixing precedent and client facts, and citation issues. Modern systems with citation verification or web search tools would likely mitigate some of these issues.

Synthetic Data

I was interested in experimenting with SoftSRV to see if I could bootstrap more usable data out of my small training set. I adapted the SSMC variant by training separate MLPs to transform BERT and Legal-BERT embeddings into soft prompts for a frozen Mistral model. Ultimately, I generated eight frozen models: one per IREAC section and encoder pair. I used each model to generate five synthetic samples.

For synthetic data generation, I paired the learned soft prompts with explicit instructions, a compact case background, brief precedent summaries, and a sample memo. I did not generate a single usable memo. I found hallucinations in 80% of generated rule explanation sections, including non-existent cases and fabricated citations. The SoftSRV models also consistently transplanted client facts from the case into invented precedent. The learned soft prompts kept overriding the written instructions and reproducing training-set patterns, including common student errors. The frozen models using the Legal-BERT encoder provided no measurable advantage over standard BERT across sections. I enforced conservative token limits for each section to force a concise academic style, but this caused 70% of issue statement generations to end mid-sentence.

Takeaways

Performance depended heavily on task structure. Fine-tuning produced reliable structured assessment on formulating sections, but became brittle once the task required flexible, fact-sensitive reasoning. Few-shot prompting performed better on open-ended sections like rule applications but failed to generalise to structured tasks.