Main Body

7 Cleaning, Consolidating, and Sharing Data

By So Hee Hyun, Abhijnya Menakur, Emma Dums, and Kim Spencer (disclosure at the end of the chapter)

“The consistent management of research data is crucial for the success of long-term and large-scale collaborative research, as it forms the basis for efficiency, continuity, and quality of the research.” (Finkel et al., 2020)

Introduction

To support the broader community in making meaningful use of National Research Mentoring Network (NRMN) Phase II research data, the NRMN Coordination Center set out to create a common measures dataset that could facilitate ongoing mentorship research within the biomedical workforce (see Chapter 4 for more information on common measures). This effort involved two components: 1) developing a single-study common measures dataset for each project using data collected across all intervention groups and time points during the grant; and 2) compiling a multi-study common measures dataset by integrating data from all 11 single-study common measures datasets.

As introduced in previous chapters, the development of these datasets supported several key objectives:

  • Establish accountability for data management
  • Promote transparency for data handling and usage
  • Provide clarity on our role as the NRMN Coordination Center in only using data from required and common measures
  • Explain our goal to move beyond investigation of single interventions to assess the collective efficacy of programming and research outcomes
  • Create a multi-study common measures dataset that is available for research

Together, these efforts laid the groundwork for a coordinated, collaborative, and data-informed approach to mentorship research during NRMN Phase II. These efforts also contributed toward building a multi-study common measures dataset that could be used for future research.

The Role of the NRMN Coordination Center in Data Management

The NRMN Coordination Center played an active role in data management. Since each research study was responsible for its own data collection, a thorough data management system was needed to review, clean, and consolidate shared data. Our team members who engaged in this work included a co-investigator who oversaw all data management activities, a researcher/statistician who led data cleaning and consolidation, and several additional team members who supported in data cleaning, tracking, and organization.

Starting in year 2, we began requesting data from each research team. Once data was received, we needed to review shared data files to ensure accuracy, inquire about any data concerns, and share regular updates with research managers. As noted in Chapter 5, each research study had a unique research design and used different methods for collecting data. Additionally, the experience levels of research managers varied widely. These factors contributed to variability in reporting practices as well as codebook and dataset preparation. While we had limited involvement in overseeing individual research study data processes, we established clear expectations and consistent workflows to produce reliable datasets.

As part of our role as the coordination center, we prioritized identifying a long-term, sustainable solution for storing the common data and managing future data access. To ensure that the dataset would remain accessible to both NRMN research community members and external researchers, we began evaluating data repository options during the final year of the grant. Our decision to deposit the data with the Inter-university Consortium for Political and Social Research (ICPSR) at the University of Michigan was guided by several key considerations: ICPSR’s reputation for preserving social science data, its support for detailed metadata and documentation, and its long-term commitment to data accessibility. Alongside the repository transfer, we focused on creating thorough documentation to support transparency, usability, and proper interpretation of the data by future researchers. Together, these efforts ensured that the dataset would remain a reliable and valuable resource beyond the lifespan of the grant.

Overview of the Data Management Process

A well-structured data management process with intentional organization that promotes transparency and accessibility is important for annual progress and ongoing collaboration (Michener, 2015). To facilitate this, we used a combination of tools, including Box, Google Drive, and IBM’s Statistical Software Package, SPSS, for storage and processing. Intentional organization supported that files were clearly named, version-controlled, and systematically archived, making them easy to locate and interpret across research teams. Prioritizing accessibility meant that team members could reliably access the correct data and supporting materials, even as roles shifted or new staff joined. This approach reduced confusion and enabled consistent progress across project years.

To streamline data organization, we created 11 dedicated Box folders, one for each research study. These folders served as the primary storage locations for both raw and cleaned data submitted by each research team. Access to each study’s folder was restricted to the research study principal investigators, research managers, and designated members of the coordination center team, ensuring the shared data remained secure and appropriately managed. Within these folders, the data was organized by survey type, cohort, and the year of data transfer. Research teams uploaded their data in various formats, including SPSS, Excel, CSV, and DTA files. These submissions typically included both raw data (as originally exported from survey platforms) and cleaned datasets that reflected any processing or re-coding done by the research team. Supporting materials such as codebooks and survey files were also included to document variable definitions and response options. To maintain cross-study consistency, we downloaded the datasets from each research team folder and converted them into SPSS format. Raw data for each research team was stored in that team’s dedicated folder, while SPSS files containing common measure data were uploaded to a shared “common measure data” subfolder within each team’s Box space. This structure maintained uniformity across research studies and facilitated collaboration by giving each team access to a standardized version of their own data aligned with the common measures.

For single-study data shared with our team, we reviewed the data to assess consistency and alignment with the common measures. This included reviewing question prompts, scale items, and scoring. In some cases, additional follow-up with the research team was needed to resolve data issues, either through email or online meetings. During this review process, we also examined the data for any discrepancies or issues that needed addressing before compiling it into the NRMN Phase II multi-study common measures dataset that included data from all 11 research teams. To verify that all necessary common measures were included, we used the common measure maps and tracking spreadsheets provided by the research teams (see Chapter 5 for more information). These spreadsheets also included the number of survey respondents by cohort and survey, which we compared with the data to ensure that all responses had been shared. Throughout the review process, we reached out to research teams with follow-up questions, allowing them to share any additional data or clarification. This collaborative process helped confirm that the multi-study common measures dataset accurately reflected the common measures collected across research teams.

In cases where data files lacked clear labels or necessary documentation, we requested that research teams provide their codebooks and upload them to Box. We reviewed these codebooks and used the provided information to ensure the dataset values were aligned with the survey designs. If research teams had not yet created codebooks or did not respond to our request, we used the survey files stored on Google Drive (shared at the beginning of data collection) to match measures to the data and interpret variables accurately. Because some teams shared raw, uncleaned data while still actively collecting responses, variable names or coding schemes were not always finalized. When changes or additions were made to surveys during the study, we requested research teams share updated versions to maintain an accurate log of files.

To track the progress and workflow of the data management process, we established a timeline for data transfers and related tasks. We began yearly data transfer requests in grant year 2 and continued through the end of the grant. We sent the final data transfer request two months before the official end of the 5-year grant period, at which point we merged all data into the unified multi-study common measures dataset. While the overall data transfer timeline provided some standardization, we also made adjustments as needed to accommodate data collection schedules, data readiness, and capacity. Some research teams experienced delays in data collection due to shifts in program schedules or recruitment timelines. Others faced limitations in staffing or technical capacity, affecting their ability to prepare and submit datasets on schedule. We adjusted by offering deadline extensions, clarifying submission expectations, and providing technical support to help teams meet the requirements while accommodating their varying circumstances.

Figure 9 illustrates the full data management process, from initial data requests to data consolidation. Key steps included:

  1. Requesting and receiving data from research teams
  2. Reviewing and validating submissions for alignment with common measures
  3. Following up with research teams to clarify submissions or request missing information
  4. Cleaning and merging data at the research team level
  5. Uploading finalized, study-specific datasets to each research team’s Box folder
  6. Compiling a unified common measures dataset across all research teams for NRMN Coordination Center use
  7. Uploading the multi-study common measures dataset to Box for NRMN Coordination Center use

 

Flowchart of workflow as described in main text.

Figure 9. Common measures data management workflow. A flowchart outlining the steps involved in managing common measures data across multiple research studies. The process includes reviewing and validating submitted data for alignment with common measures, following up to resolve discrepancies or request missing elements, and cleaning and merging data at the team level. Finalized datasets are uploaded to research team-specific Box folders. The process concludes with the consolidation of a unified common measures dataset for NRMN Coordination Center use.

A more detailed description of the processing methods used to support data cleaning and consolidation is provided in the following section.

Data Cleaning and Consolidation: Resolving Data Issues

Data cleaning and consolidation are fundamental to transforming raw data into a reliable and unified dataset (Van Den Broeck et al., 2005). In the early years of the grant, our approach to data cleaning and consolidation was relatively unstructured. Given the exploratory nature of the work and the need to remain responsive to evolving project needs, we initially prioritized flexibility over formal systems. However, as data from multiple research teams began accumulating, it became increasingly clear that the lack of structure was creating inefficiencies, particularly when it came to identifying and resolving duplicate records or inconsistencies in the shared data from the research teams. As data had been collected from multiple research teams across cohorts, sites, and surveys, issues such as discrepancies, duplicates, and inconsistencies were common. These challenges were inherent, given the diverse sources of data and unpredictability of what would surface during the process. For example, duplicate participant records often arose because participants completed surveys at different time points or across cohorts within a study, but their records were inconsistently linked due to variations in participants ID formats from the research team. These factors compounded the difficulty of accurately identifying and merging duplicate records and required follow- up efforts from our team to resolve.

Another challenge was that research teams shared data in various forms, with some providing raw datasets that included both complete and partial responses and others submitting cleaned datasets with varying levels of completeness. This resulted in the multi-study common measures dataset including a mix of complete and incomplete records. We therefore recognized the need for a more organized and proactive approach midway through the grant and developed a set of internal tracking materials and documented processes to support data cleaning and consolidation. We also conducted careful reviews and maintained ongoing coordination with research teams to clarify what was included in each dataset and to apply consistent handling during the consolidation. These materials and processes significantly improved our ability to manage quality across waves of incoming data.

During this more proactive phase, we resolved a range of issues, including missing participant IDs, absence of data in all common measures, participant duplicates, and demographic discrepancies. We also ensured that data collected across different sources and time points were formatted consistently and could be accurately integrated. To carry out this work efficiently, we used a combination of automated tools (e.g., SPSS and Python) and manual review, enabling a comprehensive and efficient data cleaning process. To promote the accuracy of changes made during the data cleaning and consolidation process, we implemented a multi-step internal review process. Initial cleaning was typically conducted by one coordination center team member using standardized tools and scripts. Ambiguous cases, such as conflicting demographic values or potential duplicate entries, were flagged for discussion and reviewed collaboratively. When needed, we consulted with the research teams to confirm details or resolve inconsistencies. This layered review process ensured that any modifications were deliberate and agreed upon by multiple team members.The initial multi-study common measures dataset contained over 17,000 records across 11 research studies. After applying a series of exclusion criteria including removal of records with missing participant IDs, incomplete cases without common measures, and unresolved duplicates, the dataset was reduced to approximately 16,000 valid survey responses following the initial data processing phase.

Cases were excluded based on the following criteria:

  • Missing participant ID from single-study datasets
  • No data provided for any common measures
  • Duplicates that could not be resolved or merged
  • Records identified as test entries or placeholder data
  • Records flagged as duplicate responses submitted by the same individual (e.g., partial or blank entries generated by reopening a survey)

Throughout this process, we worked closely with the research teams to verify participant uniqueness, particularly in cases where individuals may have appeared in multiple cohorts or research studies. While complete verification was not always feasible due to limitations in available identifying information, this step was essential for minimizing duplication and striving to represent each participant only once at the end of initial cleaning. Identifiable information such as participant IDs, names, and email addresses was available only when research teams shared these data with us. Such identifiers were essential for verifying participant uniqueness and accurately merging records, particularly when individuals appeared across multiple cohorts or studies. Access to identifiable data was strictly controlled and limited to authorized personnel under data use agreements and institutional review board (IRB) protocols. The multi-study common measures dataset used for papers and working groups contained, for example,  de-identified data to protect participant privacy.

Because some research teams shared raw data before conducting their own internal cleaning, the data we received contained extraneous cases, such as blank or partial responses that appeared as duplicate entries. These were likely the result of participants opening a survey, closing it, and then reopening it later, which created multiple entries with minimal or no data. Such records were systematically identified, reviewed by the research team, and removed during our data review. We also found instances where participants had been enrolled in multiple cohorts within the same research study, further complicating the verification process. To address this, we reached out to each research team, sharing a list of participants with duplicate records across cohorts. Each research team then confirmed which record should be included in the common dataset.

While we were able to identify unique individuals within each cohort of a research study, fully identifying individuals who participated across multiple research studies was challenging, as most research teams did not capture data on cross-study participation. Ultimately, this was only possible when research teams provided participant names and emails in the data shared with our NRMN Coordination Center team, which was not a requirement. Because not all studies shared this information, we were unable to determine cross-study participation and did not report on this outcome.

Because research teams tailored measures to their specific research study objectives and designs, not every case included data for every single common measure (see Chapter 4 for more information on which measures were required), as well as measures that were used in common by more than one team even though they were not required. Incomplete surveys, skipped questions, and variations in whether certain questions were mandatory also contributed to missing data. While all research studies included required common measures, the variations in how these measures were applied sometimes affected the consistency and completeness of the data. For example, one research study collected demographic data at the end of the intervention, while others gathered it at the start. As a result, demographic data was not available for every participant and some variables were missing for certain individuals.

Beyond cleaning the data from the research teams, we also created a set of categorical variables to enhance interpretability and support downstream analysis. These included general context based on distribution and tracking spreadsheets provided by the research teams, such as research team name, survey type, survey start and end dates, cohort or site information, and intervention status at the respondent level. In addition, we created individual-level classification variables to support stratified analysis. These included role-based identifiers (e.g., mentor vs. mentee) and a career stage classification variable that grouped participants into categories such as undergraduate students, postdoctoral scholars, faculty, and medical students. To develop this classification, we used participants’ self-reported academic rank and the target population data provided by each research team.

To prepare the data for transfer to the data repository and improve the granularity of career stage and intervention information, we followed up with research teams during the no-cost extension year to request additional details for participants, specifically to determine their intervention status and whether they were early-career, mid-career, or senior faculty. While unanticipated, this follow-up was crucial for refining the classification and ensuring accurate categorization of both career stage and intervention status by participant. This additional information, also included as categorical variables in the dataset, played a pivotal role in segmenting the data and supporting more detailed analyses across different study designs.

Reflecting on the full data cleaning experience, it was clear that having an even more structured and proactive approach from the beginning could have alleviated much of the extra effort required later. Early planning and a well-defined strategy would have streamlined the resolution of data issues, particularly the time-consuming task of identifying duplicate records at various stages. This experience highlights the importance of upfront organization and the significant efficiency it brings to the data management process. It provides an essential lesson for those in coordination center roles, where meticulous planning can significantly reduce the need for backtracking and rework.

Handling Text Responses in Common Measures

Cleaning and standardizing free-text responses, such as participants’ reported home institution and organization information, presents unique and complex challenges that are common in multi-source datasets collected across diverse research teams and cohorts. These responses often contain inconsistencies, ambiguities, and a variety of formatting issues that can significantly impact the accuracy and usability of the data if not carefully addressed. This section provides a detailed example of how we approached these challenges by combining automated processing with rigorous manual review. We highlight broader lessons learned in our data management efforts, illustrating the importance of thoughtful, multi-layered strategies, including text response standardization, that contribute to producing an accurate, consistent dataset that is ready for in-depth analysis.

Current Institution/Organization

All 11 research studies collected information on their participants’ organization or institution. While teams that recruited at specific sites provided us with this information in a uniform format for all their participants, most studies collected this information by asking participants to enter their institution or organization names through open-text fields. A few teams used a drop-down list for participants to select their institution/organization; during the initial data cleaning process, these values were converted to text entries to match the open-text responses. These varied text responses across participants and studies led to inconsistencies in the multi-study common measures dataset such as abbreviation differences, typographical errors, unclear entries, and names of institutions with no branch/campus information. To keep institution and organization information consistent and reliable, we developed a step-by-step cleaning process to standardize the text responses to a consistent format.

The initial cleaning process included correcting typographical errors, removing special characters and unclear words or names if we couldn’t confidently standardize the response. We matched abbreviations with their standard forms (e.g., “Univ.” to “University”) using Python scripts with Fuzzy matching algorithms (Kleshch, 2024; Kostanyan & Harmandayan, 2022). These algorithms were also used to identify and clean the similar institutions and organization names across the dataset (such as “University of Texas” and “University of Texas at Austin”). Python scripts using regular expression (regex) algorithms applied previously created dictionaries to flag unclear cases for manual review. All the cleaned and flagged entries were verified and confirmed by cross-referencing official websites of institutions and organizations. Most institution and organization names were cleaned and standardized using the scripts, but entries that still could not be verified after manual review were marked as missing.

Additionally, where possible, we completed missing institution information by extracting institutional details from participants’ email addresses using a Python script. This script utilized regular expressions (regex) to isolate the domain of email addresses (e.g., extracting university.edu from name@university.edu). We then manually reviewed the extracted domains to identify potential institutional affiliations, provided the domain was clearly institutional (i.e., not from generic email providers like Gmail or Yahoo). Email domains were also used to verify and validate unclear institution entries when available. For example, when a participant entered “asu” or “udm” but their email domains were “@asu.edu” and “@udmercy.edu,” we could confirm the entries corresponded to Arizona State University and University of Detroit Mercy, respectively.

After cleaning and standardizing Institution and organization names, we classified each academic institution into types such as public four-year, private four-year, community college, liberal arts, technical schools, Ivy League, Minority Serving Institution (MSI) and others. Organizations were also classified into categories such as healthcare, non-profit, government, corporations and research centers. First, we separated academic institutions (universities and colleges) from organizations using Python script with Pandas and regex packages along with manual checking of all entries. For example, entries that have indicators like “university” and “college” were flagged as academic institutions, while those with terms like “clinic,” “LLC,” or “hospital” were flagged ad as non-academic organizations.

Next, we used the Fuzzy matching method to classify academic institutions with keywords-based rules and dictionaries. For instance, “Wayne Community College” was classified as community college, and a list of Ivy leagues (the U.S. news) was used to identify Ivy league institutions. When the custom lists and dictionaries were insufficient, we referred to the Carnegie classification database, Integrated Postsecondary Education Data System (IPEDS), and official institution websites to confirm institution types. To retain accurate information, we implemented three classification fields to capture multiple types when applicable. For example, a public community college that is also a Minority Serving Institution (MSI) would be assigned both categories. MSI-designated institutions were carefully identified through manual review using sources such as the 2023-2024 MSI List developed by National Aeronautics and Space Administration (NASA) Minority University Research and Education Project (MUREP) and official institution websites.

A similar approach was used to classify organizations. A Python script using keyword-matching dictionaries flagged entries including keywords like “Medical Center” or “Hospital” as healthcare organizations. Unclear entries were flagged for manual review, and those were verified using official organization websites. Two classification fields were included to capture multiple types for non-academic organizations if more than one category applied. Once entries were cleaned and classified by institution and organization type, we added location information such as city, state, and country for each standardized entry. We developed a Python script that used pattern recognition algorithms and regex to extract location information from names. For example, “University of California, Berkeley” was processed to assign “Berkeley” as the city, “California” as the state, and “US” as the country. For international institutions or organizations, state data was marked as missing since not all countries follow a state system. City and country information were retained, using standardized country abbreviations like UK for United Kingdom.

We also used a Large Language Model (LLM) Application Programming Interface (API) to suggest possible location information when it was not clearly stated in the entries, such as “Harvard University” or “Mayo Clinic.” All suggestions were verified manually by checking the official websites of the institutions and organizations, with extra attention given to entries that were flagged by the script. The flagged entries included those with generic or non-branch specific institution names, such as “University of California” and “University of Wisconsin System,” where no specific branch or campus was mentioned. When a branch or campus could not be identified, only confirmed location information was included. For example, in the entry “University of Wisconsin System,” the state was listed as “Wisconsin” and the country as “US,” while the city was left blank. All location suggestions from the API were reviewed manually to ensure they were accurate and consistent.

This cleaning process helped us standardize all open-text entries of institutions and organizations from participants. The Python scripts were used along with rigorous manual review and regular team check-ins to confirm and validate the institution and organization details. A diagram of the data cleaning process for institution and organization responses can be found in Figure 10.

Flowchart of workflow as described in main text.

Figure 10. Cleaning process for institution/organization names. A flowchart showing the systematic cleaning of institution data through four connected processes: 1) Institution name processing (extracting from email domains when missing); 2) Standardization and review (formatting, resolving variations, handling ambiguous cases); 3) Location validation (verifying geographic data integrity); and 4) Classification (categorizing as academic/non-academic) before final mapping back to the original dataset.

By developing a structured process for data cleaning, we demonstrate how a coordination center might approach handling a collective dataset drawn from multiple research studies, each with its own unique survey design, time points, and participant group. Aggregating data from different sources into a unified common dataset requires careful attention to inconsistencies, varying formats, and missing or incomplete information. The strategies and tools developed for this specific dataset can be applied to similar large-scale data aggregation efforts, ensuring that diverse data sources are harmonized and standardized in a way that facilitates meaningful and accurate analysis across research studies.

Data Sharing: Making the Data Accessible

One of the core principles of the data management process was ensuring that the data was not only cleaned and consolidated but also made accessible for future research, while maintaining privacy and adhering to ethical standards. This section outlines how we shared data with the research teams, external users, and future researchers, as well as the strategies employed to ensure ongoing access to the dataset.

Sharing Common Measures Data with Each Research Team

Over the course of the grant, we coordinated yearly data transfers through Box, where each research team could upload data files to their respective folders. This process provided research teams with consistent access to the data they contributed to the coordination center while also allowing the coordination center to easily track updates. Sharing the cleaned data back with contributing research teams was an essential part of maintaining transparency, honoring data ownership, and enabling teams to verify how their data had been consolidated and used. Several months before the end of the grant, we shared the final single-study common measures dataset with each research team and requested quality checks to confirm the accuracy and consistency of the data. This step was essential for resolving any remaining discrepancies and ensuring each dataset was finalized before compiling the multi-study common measures dataset for wider distribution.

Sharing Unified Common Measures Data

Once the data was cleaned and consolidated across the 11 research teams, we stored the multi-study common measures dataset in Box, ensuring it was easily accessible to our coordination center team. To protect privacy and preserve the integrity of the data, we implemented strict access controls. These measures ensured that only authorized personnel could access the data, in compliance with ethical standards and to safeguard participant confidentiality. The compiled dataset was then organized and prepared for internal use across the NRMN Phase II community, including the coordination center and research teams with the goal of cross-team collaboration and enabled broader analyses across studies. Review and approval for the process of sharing the consolidated dataset was granted by the University of Wisconsin–Madison Institutional Review Board (IRB protocol number 2019-0956), ensuring that all data-sharing practices complied with ethical guidelines.

Sharing Data Information via the Data Portal

Toward the latter part of the grant period, the NRMN Coordination Center launched the NRMN Phase II Data Site, a public portal designed to support the NRMN research community. The site includes the list of common measures, descriptors of the multi-study common measures dataset, and a list of publications from NRMN Phase II, offering comprehensive resources to facilitate further research and analysis using the multi-study common measures dataset (see Chapter 4 for more information).

Data Requests to the NRMN Coordination Center

The NRMN Coordination Center established a process for managing internal requests for access to the multi-study common measures dataset until the data was permanently transferred to the data repository. By the end of the grant period, we received three formal internal data requests within the NRMN Phase II research community: two requests from the coordination center, which allowed us to test and refine the request process, and one request from a research team. Each request was carefully reviewed to ensure it met the established criteria for access, which includes ethical considerations, the intended use of the data, compliance with relevant regulations, and confirmation from each PI regarding the use of their data and any measures that require permission. For the research team’s request, we also held meetings to discuss their objectives and how they planned to use the multi-study common measures dataset. This request was ultimately not fulfilled, as the team had requested additional time to review the measures and ultimately rescinded their request. We created a step-by-step guide to standardize the process for responding to each data request (see Appendix 14: Data Request Process) and used a Google Form to manage data request inquiries (see Appendix 15: Common Data Inquiry Form).

Disseminating Data and Research Impact

Data dissemination played a vital role in sharing the findings and showcasing the impact of the NRMN initiative. In the last years of the grant we prioritized dissemination work, including writing a descriptive paper, exploring data subsets, and conducting additional analyses. One of these efforts was the publication of a descriptive data paper (Hyun et. al., 2025), which provided an in-depth summary of participants’ characteristics and mentoring experiences. The paper serves as a valuable resource for researchers looking to use the NRMN Phase II common data in future research studies, focusing on key aspects of the dataset that can contribute to the development of their work.

As mentioned in earlier chapters, we were able to extend our work into a sixth year due to a no-cost extension of the grant period. Other data work is set to continue after the no-cost extension year. A COVID-19 working group, for example, contributed to increased interest in the dataset. Researchers studying the effects of the pandemic on various populations found the common data to be a valuable resource for their investigations. A bibliometric analysis of NRMN Phase II studies has also served as a valuable tool to track the reach and influence of the research community (McDaniels & Sorkness, unpublished). Additionally, a frequency of mentorship analysis paper (Hyun et al., unpublished) examined mentorship across a large, diverse sample of mentors and mentees, finding that time spent on mentoring activities varied widely and was shaped by demographics, mentoring role, career stage, and institutional context. By monitoring citations, the analysis will provide insights into how the NRMN Phase II is contributing to our broader understanding of mentoring interventions. As of writing of this book, the outcomes of these research studies are still emerging, and the impact of this work is yet to be fully assessed.

This no-cost extension played a significant role in both data management and dissemination, as it gave us more time to complete tasks and finalize deliverables. This extension granted us additional time to refine the dataset, ensuring it was of the highest quality before being distributed more widely. It also allowed more opportunities for research teams to engage with the data, refine their analyses, and contribute to the overall body of research. This extension highlights the importance of having sufficient resources to support the ongoing efforts and ensure comprehensive data management.

Ensuring Future Access

To ensure the long-term accessibility of the data for future researchers, we focused on creating comprehensive documentation that would facilitate transparency and understanding. This documentation provided a detailed explanation of the dataset’s structure, variables, and context, enabling future users to fully comprehend the data and its intended use, as well as its limitations. Given the dataset’s complexity and its potential application across diverse research contexts, transparency and clarity were essential to maintaining its usability over time. As part of our planning, we consulted with colleagues who had prior experience depositing data with ICPSR. Their most consistent advice was to initiate contact with ICPSR as early as possible, as their team could then provide tailored guidance based on the characteristics of our dataset. Acting on this recommendation, we met with ICPSR staff during our no-cost extension year in early spring 2025, ahead of the grant’s conclusion in June, to present an overview of the multi-study common measures dataset, discuss whether it aligned with their deposit standards, and determine a feasible timeline for the deposit process.

These conversations proved instrumental. They not only helped us identify the documentation and materials we would need to prepare, but also gave ICPSR insight into the complexity and diversity of the dataset. Because the data were aggregated across multiple research studies, each involving different applications of the common measures, distinct interventions, and varied time points, ICPSR recommended organizing the documentation according to each research team. This approach will help future users more effectively navigate the dataset and understand the context in which different components were collected. The dataset transfer is scheduled for late 2025. By combining early engagement with ICPSR, careful planning, and clear, structured documentation, we aimed to ensure that the dataset remains a valuable resource well beyond the conclusion of the grant. These efforts reflect our broader commitment to data stewardship and to supporting future research that builds on this foundational work.

Conclusion

Coordination centers play a vital role in managing complex, multi-study datasets by facilitating consistent data collection, cleaning, consolidation, and sharing across diverse research teams. The NRMN Coordination Center’s experience with the common measures dataset was challenging but rewarding. While perfect harmonization across all datasets is nearly impossible, we developed evolving structures and strategies to manage the multi-study common measures dataset effectively, which helped maintain its reliability and usability.

For those planning to establish coordination centers, our experience highlights the importance of proactive planning, strong communication, and robust documentation to streamline workflows and reduce inconsistencies. Clear processes for data sharing, ethical oversight, and participant privacy are essential to foster collaboration while maintaining trust.

Engaging early with data repositories and providing comprehensive, well-organized documentation ensures long-term stewardship and accessibility. These strategies reflect best practices in responsible data management and demonstrate the critical role coordination centers play in supporting collaborative research infrastructures. The lessons learned from the NRMN Coordination Center offer a valuable roadmap for managing multi-study data effectively and sustainably.

Lessons Learned

  • Standardize survey design and data collection methods when feasible. Extra care should be taken to ensure that common measures used across surveys follow the same prompts, scale items, and scoring. Different survey versions, formats, response scales, and variable definitions across research teams created challenges for data consolidation and comparison. While perfect standardization can be difficult, especially in multi-study consortia, encouraging research teams to use consistent survey instruments and response options from the outset can significantly improve the ease of data cleaning and analysis. Even partial alignment in key measures can significantly reduce ambiguity, enhance data quality, and improve the interpretability of results.
  • Consider intervention and data collection timeline alignment early on and how it will impact data cleaning and consolidation. Aligning time points across studies would ideally facilitate smoother data merging and longitudinal comparisons. However, in many Request for Applications (RFA)-driven consortia, differences in study designs, funding timelines, and objectives make perfect alignment nearly impossible. Nonetheless, early communication about timing and efforts to harmonize time points where possible can mitigate complexity in later data consolidation.
  • Implement a robust system for tracking participants to minimize duplicates. Issues with duplicates arose from blank or partial responses by the same participants, as well as double-counting due to overlap across cohorts and research studies. This duplication occurred because research teams shared raw data before completing their internal cleaning, as their studies were still ongoing during the timeframe of the NRMN Coordination Center’s requests for data. To improve data integrity, reduce duplication, and accurately identify unique individuals, implementing a more robust system for tracking participants is essential.
  • Develop a strategy for reviewing and resolving data issues early. Lack of standardized coding for missing data and inconsistent documentation of multiple survey versions limited usability. Without knowing which measures appeared on each survey version, distinguishing truly missing data was difficult. Establishing structured processes and tools early to track survey versions, variable inclusion, and missing data classification can improve data quality and streamline analysis.
  • Consider user accessibility. We anticipate that users with limited data analysis experience may face challenges navigating the dataset. Greater focus on clear documentation, comprehensive metadata, and user-friendly tools will make the data more accessible for a broad range of users.

Disclosure: During an earlier stage of the writing process, author AM used Quillbot to edit their writing. As a team, we had different thoughts and agreement with the use of large language model (LLM) tools (e.g., artificial intelligence (AI)) in our writing. To honor the views of each author and remove the influence of AI tools, the author rewrote the sections in which Quillbot had been used. In the spirit of sharing lessons learned, we recommend that teams establish guidelines on the use of LLM in early stages of any writing process.

License

Icon for the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License

Running a National Research Coordination Center: Lessons Learned from NRMN Phase II Copyright © 2026 by Taylor Ajamian, Emma Dums, Jada Holmes, Julie Hau, Krystina Karcz, Melissa McDaniels, Abhijnya Menakur, Christine Pfund, Fátima Sancheznieto, Lisette Serrano, Christine Sorkness, Kim Spencer, and Emily Utzerath is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, except where otherwise noted.