Back to Home

🏆 Model Leaderboard

Definitions of the Metrics

Stance distance (Stance Δ) metric:

• | stance(human message) - stance(LLM message) |.

• Interpretation: the smaller the distance is, the closer the LLM is to human in terms of the stances of the messages they generate.

Opinion diversity (SDtweetg or SDprivateg):

• Standard Deviation(stances of the people within the same group g)

• Interpretation: The smaller the value is, the closer each person's stance is to each other.

Change in Opinion diversity (ΔSDtweetg or ΔSDprivateg):

• The change in Opinion Diversity from the start to the end

• ΔSDtweetg = SDtweetfinal(g) − SDtweetinit(g)

• ΔSDprivateg = SDprivatefinal(g) − SDprivateinit(g)

• Interpretation: negative values indicate opinion convergence after the debate, and positive values indicate opinion divergence.

Note: The stance labels are based classification onto a 6-point scale: {Certainly disagree (-2.5), Probably disagree (-1.5), Lean disagree (−0.5), Lean agree (+0.5), Probably agree (+1.5), Certainly agree (+2.5)} (see paper for details).

Utterance-level Evaluation of Role-playing LLM Agents

Depth Topics - Avg Stance Δ (Full Conversation Simulation)

Average stance change for Depth topics in Full Conversation Simulation. Bars are sorted in ascending order; lower values indicate smaller shifts in stance magnitude.

Depth topics average stance delta (mode 2)

Group-Level Alignment in Opinion Dynamics: LLM Groups vs. Human Groups (Depth Topics)

Change in within-group opinion diversity for humans and RPLA simulations on Depth Topics. Bars show averages over groups; more negative values indicate stronger convergence and positive values indicate divergence. Error bars denote the standard error of the mean across groups.

Change in within-group opinion diversity figure (depth)

Fine-tuning with DPO on the Dataset Increases Group-level Alignment: Aligns LLM Opinion Dynamics on Unseen Topics

Change in within-group opinion diversity after SFT/DPO post-training.

Note: This is only for breadth topics as breadth topic data scale supports proper fine-tuning.

Change in within-group opinion diversity after SFT/DPO (breadth)