Pre-computation in progress... First run takes several hours. Check container logs.

Gender Classification Quality

This system evaluates gender classification quality based on inter-model agreement. When multiple models agree on a classification, confidence increases. As new models are added, the consensus strengthens.

Total classified
-
High confidence
-
Contested
-
Models used
-
Confidence distribution
Consensus gender Unanimous Strong Majority Contested Total
Methodology & formula
How it works: For each author-article row, all available LLM gender classifications are read. The "consensus" is the gender chosen by the majority of models.
Confidence tiers:
  • Unanimous: all models agree (100%)
  • Strong: at least 75% of models agree (e.g. 3 out of 4)
  • Majority: over 50% agree but below 75%
  • Contested: perfect tie (50/50), no clear consensus
As new models are added, the consensus becomes more robust.
SELECT "gender", "ministral", "llama", "qwen" FROM article_authors -- For each row: count M votes vs F votes -- consensus = majority gender -- confidence = unanimous / strong / majority / contested -- Scanned in chunks of 5M rows by id range
Model reliability (consensus agreement)
Model Agreement rate Agrees Disagrees % Male % Female Bias
Methodology & formula
What it shows: for each model, the percentage of times it agrees with the consensus gender. The bias indicates whether a model tends to classify more males or females compared to consensus.
Formula: reliability = agrees / (agrees + disagrees) x 100
Bias: (model % male) - (consensus % male). Positive = leans male, negative = leans female.
Confidence over time
Methodology & formula
What it shows: how the confidence tier distribution varies over time. An increase in the "unanimous" percentage indicates models agree more for that period.