Analyzing Country Influence in Research Fields Using DTM (with BERTopic) and Labeled LDA
Introduction
This project focuses on analyzing the evolution of research topics and assessing the influence of countries in research fields. It utilizes Dynamic Topic Modeling (DTM) and Labeled Latent Dirichlet Allocation (LDA) to understand thematic trends, compute influence metrics, and measure topic and country-specific impact. By integrating both models, this analysis provides insights into the dynamic interplay between research topics and countries’ contributions over time.
Objectives
Topic Evolution Analysis: Utilize DTM with a sliding window approach to track how topics evolve in prominence and composition over five-year intervals (1990–2020).
Influence Metrics: Quantify the influence of countries on research topics using cosine similarity and influence ratios, differentiating self-similarity (topic's relation to its past) from external influence.
Topic Concentration: Assess the concentration of research efforts on specific topics using the Hirschman-Herfindahl Index (HHI) and weighted HHI to identify generalist versus specialist topics.
Collaboration Networks: Evaluate country-specific contributions by analyzing term distributions generated through Labeled LDA.
Integrated Insights: Leverage DTM's temporal topic modeling to capture topic evolution over time and connect it with Labeled LDA's country-specific term distributions. This integration enables a holistic analysis by linking time labels from DTM with country labels from Labeled LDA, providing a comprehensive view of how specific countries influence and contribute to evolving research trends across different time periods.
Methodology
1. Dynamic Topic Modeling (DTM)
Purpose: Captures the evolution of research topics using a sliding window approach for five-year intervals.
Implementation:
Topic Generation: Topics are modeled using BERTopic, which integrates UMAP for dimensionality reduction and HDBSCAN for clustering.
Term Alignment: Ensures consistent term representation across time windows by aligning terms using cosine similarity.
Sliding Windows: Analyzes topics over time by dividing data into overlapping windows, e.g., 1990–1994, 1991–1995.
Outputs:
Term distributions for each time slice to identify key topics.
Topic trajectories to track changes in prevalence and composition.
2. Labeled LDA
Purpose: Assigns research topics to countries based on term distributions.
Implementation:
Research Abstract data from OpenAlex was used to construct country-specific term distributions.
Normalization: Ensured that term distributions represent proportional contributions across countries.
Outputs term-level probability distributions per country for temporal comparison with DTM topics.
3. Cosine Similarity & Influence Ratios
Cosine Similarity: Measures alignment between:
DTM topic term distributions (in the current and preceding intervals).
Country-specific term distributions derived from Labeled LDA.
Influence Ratios: Quantifies how much a country's thematic alignment has shaped or been shaped by the evolution of a topic:
High Self-Similarity: Indicates influence of a topic's own history.
High Country Similarity: Reflects a country’s influence in shaping that topic.
4. Hirschman-Herfindahl Index (HHI)
Purpose: Measures topic concentration, distinguishing between:
Generalist Topics: Broadly distributed across countries.
Specialist Topics: Concentrated contributions from a few countries.
Weighted HHI: Incorporates influence ratios to refine concentration metrics.
Key Findings
Topic Trends
Shifts in Prominence: Topics evolve in alignment with global research priorities, reflecting broader collaborations and shifting foci.
Emerging Specializations: The prevalence of niche topics increases in recent intervals, aligning with technological advancements in Astronomy and Astrophysics.
Country Influence
Dominance of Research Leaders: Countries like the USA, Germany, and the UK exhibit high influence ratios and centrality measures in more number of topics, highlighting their leadership in shaping research topics.
Rising Contributors: Emerging countries like China and India show growing contributions, as evidenced by rising influence ratios and topic-specific presence.
Concentration Metrics
Diverse Focus: Generalist topics like "Galaxies" exhibit high prevalence and broad contributions across countries.
Niche Focus: Specialist topics demonstrate significant contributions from a select few countries, underscoring regional research strengths.
Conclusion
This integrated analysis using DTM and Labeled LDA highlights the dynamic nature of research landscapes. By combining temporal modeling with entity-specific insights, it captures the dual narratives of topic evolution and country-specific influence, offering a comprehensive view of global research in Astronomy and Astrophysics. This methodology can serve as a blueprint for analyzing similar domains. This project is currently ongoing, and its progress can be tracked on [GitHub link].