Pre-computation in progress... First run takes several hours. Check container logs.

Dataset Overview

This analysis covers the entire PubMed database, examining over 32 million articles and 140 million author-article pairs from 1945 to 2024. It represents the largest dataset ever analysed in this field, approximately 320 times larger than previous studies.

Total articles
--
Author-article pairs
--
Unique authors
~18.7M
Time period
--
Medical disciplines
32
LLM models used
4
% Female in 2024
--
Leaky pipeline gap
--
Male / Female
Methodology & formula
What it shows: overall gender distribution for the latest year (2024), calculated from the DeepSeek v3 column. Each author is classified as M (male) or F (female) by the LLM based on forename and surname. Unclassifiable or ambiguous cases are excluded from the donut.
SELECT py.year, aa."gender", COUNT(*) FROM article_authors aa JOIN pmid_year py ON aa.pmid = py.pmid WHERE aa."gender" IN ('m','f') GROUP BY py.year, aa."gender"
% Female authors per year
Methodology & formula
Formula: % F = female / (male + female + other) × 100 per year.
The chart shows the trend in female author percentage from 1945 to 2024. Data comes from the article_authors table JOINed with pmid_year (deduplicated by PMID). Each author-article pair is counted once.
SELECT py.year, aa."gender", COUNT(*) FROM article_authors aa JOIN pmid_year py ON aa.pmid = py.pmid WHERE aa."gender" IS NOT NULL AND aa."gender" != '' GROUP BY py.year, aa."gender" -- Aggregated in Python: % female = f / (m+f+other) × 100