Pre-computation in progress... First run takes several hours. Check container logs.

Gender Classification Quality

This system evaluates gender classification quality based on inter-model agreement. When multiple models agree on a classification, confidence increases. As new models are added, the consensus strengthens.

Total classified

High confidence

Contested

Models used

Confidence distribution

Consensus gender	Unanimous	Strong	Majority	Contested	Total

Methodology & formula

How it works: For each author-article row, all available LLM gender classifications are read. The "consensus" is the gender chosen by the majority of models.
Confidence tiers:

Unanimous: all models agree (100%)
Strong: at least 75% of models agree (e.g. 3 out of 4)
Majority: over 50% agree but below 75%
Contested: perfect tie (50/50), no clear consensus

As new models are added, the consensus becomes more robust.

SELECT "gender", "ministral", "llama", "qwen" FROM article_authors -- For each row: count M votes vs F votes -- consensus = majority gender -- confidence = unanimous / strong / majority / contested -- Scanned in chunks of 5M rows by id range

Model reliability (consensus agreement)

Model	Agreement rate	Agrees	Disagrees	% Male	% Female	Bias

Methodology & formula

What it shows: for each model, the percentage of times it agrees with the consensus gender. The bias indicates whether a model tends to classify more males or females compared to consensus.
Formula: reliability = agrees / (agrees + disagrees) x 100
Bias: (model % male) - (consensus % male). Positive = leans male, negative = leans female.

Confidence over time

Methodology & formula

What it shows: how the confidence tier distribution varies over time. An increase in the "unanimous" percentage indicates models agree more for that period.