Analyzing data properties using statistical sampling:
illustrated on scientific file formats

Kunkel, Julian M.

Download

Preview

Text (Open Access)
- Published Version
· Available under License Creative Commons Attribution Non-commercial.

Advice

Please see our End User Agreement.

It is advisable to refer to the publisher's version if you intend to cite from this work. See Guidance on citing.

Tools

Lists

Kunkel, J. M. (2016) Analyzing data properties using statistical sampling: illustrated on scientific file formats. Supercomputing Frontiers and Innovations, 3 (3). pp. 34-39. ISSN 2409-6008 doi: 10.14529/jsfi160304

Abstract/Summary

Understanding the characteristics of data stored in data centers helps computer scientists in identifying the most suitable storage infrastructure to deal with these workloads. For example, knowing the relevance of file formats allows optimizing the relevant formats but also helps in a procurement to define benchmarks that cover these formats. Existing studies that investigate performance improvements and techniques for data reduction such as deduplication and compression operate on a subset of data. Some of those studies claim the selected data is representative and scale their result to the scale of the data center. One hurdle of running novel schemes on the complete data is the vast amount of data stored and, thus, the resources required to analyze the complete data set. Even if this would be feasible, the costs for running many of those experiments must be justified. This paper investigates stochastic sampling methods to compute and analyze quantities of interest on file numbers but also on the occupied storage space. It will be demonstrated that on our production system, scanning 1% of files and data volume is sufficient to deduct conclusions. This speeds up the analysis process and reduces costs of such studies significantly.

Altmetric Badge

Dimensions Badge

Item Type	Article
URI	https://centaur.reading.ac.uk/id/eprint/77674
Identification Number/DOI	10.14529/jsfi160304
Refereed	Yes
Divisions	Science > School of Mathematical, Physical and Computational Sciences > Department of Computer Science
Publisher	Publishing Center of South Ural State University
Download/View statistics	View download statistics for this item

Download Statistics

Downloads

Downloads per month over past year

Deposit Details

CORE (COnnecting REpositories)

University Staff: Request a correction | Centaur Editors: Update this record

Date Deposited:	19 Jun 2018 15:59	Date item deposited into CentAUR
Last Modified:	15 Jun 2025 09:22	Date item last modified