Gender Gap in Biomedical Research

<h3>Data Source</h3> <p>The complete PubMed baseline was obtained from the National Center for Biotechnology Information (NCBI) FTP servers in February 2025. XML files were parsed using custom Python scripts (lxml library) to extract: article metadata (PMID, title, abstract, publication date), author information (forename, surname, author position), journal data, Medical Subject Headings (MeSH) terms, and citation data.</p>

methods_data_source_text

<h3>Gender Classification</h3> <p>Gender was assigned using multiple large language models (LLMs) via REST API. Each model received complete author names (forename and surname) with a custom prompt requesting binary gender classification (male/female) based on name-based inference and cultural context.</p> <p><strong>Models used:</strong></p> <ul> <li><strong>DeepSeek v3</strong> - Column "gender" (primary classification)</li> <li><strong>Ministral 3B</strong> - Column "mistralai/ministral-3b-2512"</li> <li><strong>LLaMA 3.1 8B</strong> - Column "llama-3.1-8b"</li> <li><strong>Qwen3 VL 8B</strong> - Column "qwen/qwen3-vl-8b"</li> </ul> <p>Prior validation studies have demonstrated approximately 97% accuracy for LLM-based gender classification from names.</p>

methods_gender_text

<h3>Discipline Classification</h3> <p>MeSH terms were mapped to 32 predefined medical specialty categories using DeepSeek v3. Each article's MeSH terms were submitted with a prompt requesting assignment to one or more specialty categories. Each category was counted at most once per article, though articles could contribute to multiple categories.</p>

methods_disciplines_text

LLM models used

Column ID	Label
`gender`	DeepSeek v3
`mistralai/ministral-3b-2512`	Ministral 3B
`llama-3.1-8b`	LLaMA 3.1 8B
`qwen/qwen3-vl-8b`	Qwen3 VL 8B

Methodology