Related papers: Guidelines for data analysis scripts
Data analysis is at the core of scientific studies, a prominent task that researchers and practitioners typically undertake by programming their own set of automated scripts. While there is no shortage of tools and languages available for…
AI systems produce large volumes of logs as they interact with tools and users. Analysing these logs can help understand model capabilities, propensities, and behaviours, or assess whether an evaluation worked as intended. Researchers have…
It is intended in this document to introduce a handy systematic method for enumerating all possible data dependency cases that could occur between any two instructions that might happen to be processed at the same time at different stages…
The development and approval of new treatments generates large volumes of results, such as summaries of efficacy and safety. However, it is commonly overlooked that analyzing clinical study data also produces data in the form of results.…
Data preparation, especially data cleaning, is very important to ensure data quality and to improve the output of automated decision systems. Since there is no single tool that covers all steps required, a combination of tools -- namely a…
Software needs to be secure, in particular, when deployed to critical infrastructures. Secure coding guidelines capture practices in industrial software engineering to ensure the security of code. This study aims to assess the level of…
Fully automated astronomical data calibration and imaging pipelines are difficult to develop without a good prototyping method which permits to bridge the time between observatory commissioning and the moment when the special features and…
Statistical analysis is the tool of choice to turn data into information, and then information into empirical knowledge. To be valid, the process that goes from data to knowledge should be supported by detailed, rigorous guidelines, which…
Clinical practice guidelines are long, multimodal documents whose branching recommendations are difficult to convert into executable clinical decision support (CDS), and one-shot parsing often breaks cross-page continuity. Recent LLM/VLM…
Coding standards and good practices are fundamental to a disciplined approach to software projects, whatever programming languages they employ. Prolog programming can benefit from such an approach, perhaps more than programming in other…
Increasingly larger number of software systems today are including data science components for descriptive, predictive, and prescriptive analytics. The collection of data science stages from acquisition, to cleaning/curation, to modeling,…
Systematic reviews, which entail the extraction of data from large numbers of scientific documents, are an ideal avenue for the application of machine learning. They are vital to many fields of science and philanthropy, but are very…
The availability of both structured and unstructured databases, such as electronic health data, social media data, patent data, and surveys that are often updated in real time, among others, has grown rapidly over the past decade. With this…
Data mining is about obtaining new knowledge from existing datasets. However, the data in the existing datasets can be scattered, noisy, and even incomplete. Although lots of effort is spent on developing or fine-tuning data mining models…
Process analytics approaches allow organizations to support the practice of Business Process Management and continuous improvement by leveraging all process-related data to extract knowledge, improve process performance and support…
Evaluating the computational reproducibility of data analysis pipelines has become a critical issue. It is, however, a cumbersome process for analyses that involve data from large populations of subjects, due to their computational and…
Clinical coding is crucial for healthcare billing and data analysis. Manual clinical coding is labour-intensive and error-prone, which has motivated research towards full automation of the process. However, our analysis, based on US English…
A typical information extraction pipeline consists of token- or span-level classification models coupled with a series of pre- and post-processing scripts. In a production pipeline, requirements often change, with classes being added and…
Even though novel imaging techniques have been successful in studying brain structure and function, the measured biological signals are often contaminated by multiple sources of noise, arising due to e.g. head movements of the individual…
Data analysis for scientific experiments and enterprises, large-scale simulations, and machine learning tasks all entail the use of complex computational pipelines to reach quantitative and qualitative conclusions. If some of the activities…