Background: The NYU Health Science Library maintains a data catalog to help researchers locate data for reuse and make their data discoverable by others. Data and supporting resources (e.g., software code) are added to the data catalog through outreach to researchers and periodic review of new publications. The NYU Data Catalog employs “subject domains,” a set of high-level descriptors, to assist users who prefer to browse rather than perform targeted searches. These subject domains were developed based on the initial datasets entered in the catalog with a focus on clinical and population health research. To accommodate growth of the size and scope of the data catalog, the team needed to expand the subject domains to include more concepts.
Description: After consultation with other librarians who are addressing similar issues at their institutions, a project coordinator determined that an output-informed approach through the assessment of publications by researchers at the institution would best support growth. Utilizing R to access NCBI Entrez API, the coordinator developed a methodology to determine which Subject Domains should be added to the data catalog to facilitate user browsing. Through the API, the coordinator compiled a list of PubMed publication identifiers by authors affiliated with the institution, extracted MeSH terms, ordered the terms by frequency of use, then judged their applicability as subject domains as opposed to other metadata fields that already existed in the data catalog. The data catalog team continued the curation process by evaluating the appropriateness of each term's granularity, whether or not MeSH term meanings would be apparent to researchers, and possible overlap between terms. The team reviewed these findings and reached consensus for each subject domain term.
Conclusion: Out of 9,412 extracted MeSH terms, the top 214 terms (representing 49.9% of all publications) were further screened. 5 new MeSH terms were selected and 3 prior terms were combined, overall increasing the pool of existing subject domains by 28%. Using the newly expanded set of controlled vocabulary, the data catalog team will re-catalog over 300 datasets, then assess user adoption through web analytics on browsing versus searching trends. This approach to creating a set of controlled vocabulary has other potential applications in the development of local data discovery platforms .