Even datasets that are not initially envisioned as evolving may become so as minor errors are identified and corrected during use. Rather, as research around a data product grows, there might be many valid versions produced. Evolving datasets are never “finished,” and as such there is no “master” or “canonical” version. In some cases, datasets may be expected to continue to evolve over extended periods (e.g. ![]() For example, a dataset on biological organisms might be expanded through the addition of new records or improved through the correction of spelling mistakes in taxonomic names. Typical changes may include improving the quality of existing data, adding new data, re-structuring the dataset content, or integrating with other datasets. Īn evolving (or "living") dataset is one that is subject to occasional or recurrent change. In particular, in some areas, such as our own area of ecology and evolution, we are only starting to support the fact that some high-quality datasets may be evolving entities. Yet, while the last decade has witnessed a rapid and exciting change in attitudes towards data sharing, the scientific community is still grappling with how to effectively maintain and distribute open source datasets. Evidence of this trend is seen in the increasing numbers of stand-alone “data papers” appearing in both standard domain-level journals and specialized data journals. Increasingly, funding bodies, publishers, and scientific social norms are recognizing the value of sharing datasets, including as stand-alone products without any accompanying analyses. ![]() Sharing of a high-quality dataset-a collection of measurements, stored in 1 or several files-is now considered a first-class scientific output. Moreover, we argue that this model allows for individual research groups to achieve a dynamic and versioned model of data delivery at no cost. Our workflow utilizes tools and platforms used for development and distribution of successive versions of an open source software program, including version control, GitHub, and semantic versioning, and applies these to the analogous process of developing successive versions of an open source dataset. In this article, we describe a workflow for maintaining and distributing successive versions of an evolving dataset, allowing users to retrieve and load different versions directly into the R platform. So far, however, platforms for data sharing offer limited functions for distributing and interacting with evolving datasets- those that continue to grow with time as more records are added, errors fixed, and new data structures are created. Multiple platforms now allow easy publication of datasets. The sharing and re-use of data has become a cornerstone of modern science.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |