Related papers: Technical Report: CSVM Ecosystem

Technical Report: CSVM format for scientific tabular data

The CSVM (CSV with metadata data) is issued from CSV format and used for storing experimental data, models, specifications. CSVM allows the storage of tabular data with a limited but extensible amount of metadata. This increases the…

Quantitative Methods · Quantitative Biology 2012-07-25 Gérôme Beyries , Frédéric Rodriguez

Technical report: CSVM dictionaries

CSVM (CSV with Metadata) is a simple file format for tabular data. The possible application domain is the same as typical spreadsheets files, but CSVM is well suited for long term storage and the inter-conversion of RAW data. CSVM embeds…

Computational Engineering, Finance, and Science · Computer Science 2012-08-13 Frédéric Rodriguez

Specification-based CSV Support in VDM

CSV is a widely used format for data representing systems control, information exchange and processing, logging, etc. Nevertheless, the format is riddled with tricky corner cases and inconsistencies, which can make input data unreliable,…

Software Engineering · Computer Science 2023-03-29 Leo Freitas , Aaron John Buhagiar

RawArray: A Simple, Fast, and Extensible Archival Format for Numeric Data

Raw data sizes are growing and proliferating in scientific research, driven by the success of data-hungry computational methods, such as machine learning. The preponderance of proprietary and shoehorned data formats make computations slower…

Databases · Computer Science 2022-01-02 David S. Smith

Data management to support reproducible research

We describe the current state and future plans for a set of tools for scientific data management (SDM) designed to support scientific transparency and reproducible research. SDM has been in active use at our MRI Center for more than two…

Quantitative Methods · Quantitative Biology 2015-02-25 B. A. Wandell , A. Rokem , L. M. Perry , G. Schaefer , R. F. Dougherty

SAVIME: A Multidimensional System for the Analysis and Visualization of Simulation Data

Scientific applications produce a huge amount of data, which imposes serious management and analysis challenges. In particular, limitations in current database management systems prevent their adoption in simulation applications, in which…

Databases · Computer Science 2019-03-18 Hermano Lustosa , Fabio Porto

The Collection Virtual Machine: An Abstraction for Multi-Frontend Multi-Backend Data Analysis

Getting the best performance from the ever-increasing number of hardware platforms has been a recurring challenge for data processing systems. In recent years, the advent of data science with its increasingly numerous and complex types of…

Databases · Computer Science 2020-04-10 Ingo Müller , Renato Marroquín , Dimitrios Koutsoukos , Mike Wawrzoniak , Sabir Akhadov , Gustavo Alonso

SimFS: A Simulation Data Virtualizing File System Interface

Nowadays simulations can produce petabytes of data to be stored in parallel filesystems or large-scale databases. This data is accessed over the course of decades often by thousands of analysts and scientists. However, storing these volumes…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-02-11 Salvatore Di Girolamo , Pirmin Schmid , Thomas Schulthess , Torsten Hoefler

DataFed: Towards Reproducible Research via Federated Data Management

The increasingly collaborative, globalized nature of scientific research combined with the need to share data and the explosion in data volumes present an urgent need for a scientific data management system (SDMS). An SDMS presents a…

Databases · Computer Science 2020-04-09 Dale Stansberry , Suhas Somnath , Jessica Breet , Gregory Shutt , Mallikarjun Shankar

Combining data and metadata: hybrid tabular file formats

When working with astronomical data, metadata is also important. A general-purpose file format for transmission, processing and archiving large datasets should facilitate, among other things, both efficient processing of bulk data and…

Instrumentation and Methods for Astrophysics · Physics 2026-03-17 Mark Taylor

A Multi-Media Exchange Format for Time-Series Dataset Curation

Exchanging data as character-separated values (CSV) is slow, cumbersome and error-prone. Especially for time-series data, which is common in Activity Recognition, synchronizing several independently recorded sensors is challenging. Adding…

Databases · Computer Science 2019-08-05 Philipp M. Scholl , Benjamin Völker , Bernd Becker , Kristof Van Laerhoven

Wrangling Messy CSV Files by Detecting Row and Type Patterns

It is well known that data scientists spend the majority of their time on preparing data for analysis. One of the first steps in this preparation phase is to load the data from the raw storage format. Comma-separated value (CSV) files are a…

Databases · Computer Science 2019-07-29 Gerrit J. J. van den Burg , Alfredo Nazabal , Charles Sutton

General Scaled Support Vector Machines

Support Vector Machines (SVMs) are popular tools for data mining tasks such as classification, regression, and density estimation. However, original SVM (C-SVM) only considers local information of data points on or over the margin.…

Artificial Intelligence · Computer Science 2010-09-28 Xin Liu , Ying Ding , Forrest Sheng Bao

Convolutional Support Vector Machine

The support vector machine (SVM) and deep learning (e.g., convolutional neural networks (CNNs)) are the two most famous algorithms in small and big data, respectively. Nonetheless, smaller datasets may be very important, costly, and not…

Machine Learning · Computer Science 2020-02-19 Wei-Chang Yeh

SasCsvToolkit -- A versatile parallel 'bag-of-tasks' job submission application on heterogeneous and homogeneous platforms for Big Data Analytics such as for Biomedical Informatics

Background: The need for big data analysis requires being able to process large data which are being held fine-tuned for usage by corporate. It is only very recently that the need for big data has caught attention for low budget corporate…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-03-08 Abhishek Narain Singh

Simulation and evaluation of cloud storage caching for data intensive science

A common task in scientific computing is the derivation of data. This workflow extracts the most important information from large input data and stores it in smaller derived data objects. The derived data objects can then be used for…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-05-10 Tobias Wegner , Mario Lassnig , Peer Ueberholz , Christian Zeitnitz

Digitizing scientific data and data retrieval techniques

Storing data is easy, but finding and using data is not. It is desirable that the data is stored in a structured format, which can be preserved and retrieved in future. Creating Metadata for the data is one way of creating structured data…

Information Theory · Computer Science 2011-01-04 Ranjeet Devarakonda , Giri Palanisamy , Jim Green

A simple C++ library for manipulating scientific data sets as structured data

Representing scientific data sets efficiently on external storage usually involves converting them to a byte string representation using specialized reader/writer routines. The resulting storage files are frequently difficult to interpret…

Computational Engineering, Finance, and Science · Computer Science 2007-05-23 Christoph Best

Towards a Common Format for Computational Material Science Data

Information and data exchange is an important aspect of scientific progress. In computational materials science, a prerequisite for smooth data exchange is standardization, which means using agreed conventions for, e.g., units, zero base…

Materials Science · Physics 2016-07-19 Luca M. Ghiringhelli , Christian Carbogno , Sergey Levchenko , Fawzi Mohamed , Georg Huhs , Martin Lueders , Micael Oliveira , Matthias Scheffler

A Subset of the CERN Virtual Machine File System: Fast Delivering of Complex Software Stacks for Supercomputing Resources

Delivering a reproducible environment along with complex and up-to-date software stacks on thousands of distributed and heterogeneous worker nodes is a critical task. The CernVM-File System (CVMFS) has been designed to help various…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-03-30 Alexandre F Boyer , Christophe Haen , Federico Stagni , David R C Hill