Pre-computation in progress... First run takes several hours. Check container logs.
Gender Classification Quality
This system evaluates gender classification quality based on inter-model agreement. When multiple models agree on a classification, confidence increases. As new models are added, the consensus strengthens.
Total classified
-
High confidence
-
Contested
-
Models used
-
Confidence distribution
| Consensus gender | Unanimous | Strong | Majority | Contested | Total |
|---|
Methodology & formula
How it works: For each author-article row, all available LLM gender classifications are read.
The "consensus" is the gender chosen by the majority of models.
Confidence tiers:
Confidence tiers:
- Unanimous: all models agree (100%)
- Strong: at least 75% of models agree (e.g. 3 out of 4)
- Majority: over 50% agree but below 75%
- Contested: perfect tie (50/50), no clear consensus
SELECT "gender", "ministral", "llama", "qwen"
FROM article_authors
-- For each row: count M votes vs F votes
-- consensus = majority gender
-- confidence = unanimous / strong / majority / contested
-- Scanned in chunks of 5M rows by id range
Model reliability (consensus agreement)
| Model | Agreement rate | Agrees | Disagrees | % Male | % Female | Bias |
|---|
Methodology & formula
What it shows: for each model, the percentage of times it agrees with the consensus gender.
The bias indicates whether a model tends to classify more males or females compared to consensus.
Formula:
Bias:
Formula:
reliability = agrees / (agrees + disagrees) x 100
Bias:
(model % male) - (consensus % male). Positive = leans male, negative = leans female.
Confidence over time
Methodology & formula
What it shows: how the confidence tier distribution varies over time.
An increase in the "unanimous" percentage indicates models agree more for that period.