Apr
AI Lund Lunch seminar: Tracing Industrial Change Over Time: Harmonizing Industry Classifications in Swedish Register Data for Machine Learning
Topic: Tracing Industrial Change Over Time: Harmonizing Industry Classifications in Swedish Register Data for Machine Learning
When: 29 April, 2026 12.00 to 13.00 CET
Where: Online
Speaker: Philipp Stark, Human Geography, Lund University
Abstract
Swedish register data offer a uniquely powerful resource for longitudinal research, providing population-wide, decades-long coverage of individuals, workplaces, and firms with detailed information. Yet preparing this data for modern AI and machine learning pipelines requires confronting a fundamental problem in the nature of these mostly categorical records. Despite having hundreds of fine-grained industry sector codes for each individual and firm, the Swedish Standard Industrial Classification (SNI) was substantially revised in 2007, expanding, switching, and reorganizing its industry categories. Critically, the mapping between the old and new systems is non-invertible: a single old code can correspond to multiple new codes, and vice versa. Statistics Sweden (SCB), which provides this data, acknowledges that an exact backward translation is in principle impossible. Probabilistic harmonization is therefore a necessary precondition for any serious longitudinal analysis spanning the revision.
Working with over 7 million individual records across years, we develop a simple, yet effective method based on maximum likelihood estimation that provides a directly implementable solution for backward code inference, achieving above 90% accuracy. Although we tested more complex models for finer-grained inference, we find no meaningful gain from increased model complexity, which nicely illustrates Occam's razor and the fundamental limits imposed by high-cardinality categorical data. Our solution provides a robust harmonization ready for use in AI-based analyses on Swedish register data, resulting in a consistent 30+ year panel for all Swedish workplaces. This unlocks longitudinal ML applications such as transformer-based sequence models and time-series analyses that currently cannot span the 2007 break. We release the methodology as an open-source package, applicable to future classification revisions.
Bio: Philipp Stark is a Postdoctoral Fellow at the Department of Human Geography, Lund University, where he also initiated a computational research infrastructure supporting AI-driven and data-intensive projects within the department. His work sits at the intersection of AI and social science, applying machine learning methods to foundational social science questions. Among other projects, his current research focuses on using large-scale administrative register data and deep learning to study labor market dynamics, human capital, regional development, and mobility.
Before joining Lund, Philipp completed his doctorate in Computer Science (Dr. rer. nat.) at the University of Tübingen in 2024, where he was based at the Hector Research Institute of Education Sciences and Psychology. His dissertation examined cognitive processing in virtual reality learning environments, combining signal processing, machine learning, and experimental methods. His academic interest is guided by the curiosity to understand human systems by combining rigorous scientific design with modern AI methods.
About the event
Location:
Online - link by registration
Contact:
jonas [dot] wisbrant [at] control [dot] lth [dot] se