Related papers: Benchmarking JSON BinPack
In this paper, we present the recent advances that highlight the characteristics of JSON-compatible binary serialization specifications. We motivate the discussion by covering the history and evolution of binary serialization specifications…
We present a comprehensive benchmark of JSON-compatible binary serialization specifications using the SchemaStore open-source test suite collection of over 400 JSON documents matching their respective schemas and representative of their use…
Reliably generating structured outputs has become a critical capability for modern language model (LM) applications. Constrained decoding has emerged as the dominant technology across sectors for enforcing structured outputs during…
Network binarization emerges as one of the most promising compression approaches offering extraordinary computation and memory savings by minimizing the bit-width. However, recent research has shown that applying existing binarization…
JSON (JavaScript Object Notation) is a data encoding that allows structured data to be used in a standardized and straightforward manner across systems. Schemas for JSON-formatted data can be constructed using the JSON Schema standard,…
JSON Schema is the de facto standard for describing the structure of JSON documents. Reasoning about JSON Schema inclusion -- whether every instance satisfying a schema S1 also satisfies a schema S2 -- is a key building block for a variety…
Document sketching using Jaccard similarity has been a workable effective technique in reducing near-duplicates in Web page and image search results, and has also proven useful in file system synchronization, compression and learning…
When LLMs process structured data, the serialization format directly affects cost and context utilization. Standard JSON wastes tokens repeating key names in every row of a tabular array--overhead that scales linearly with row count. This…
Recently presented Token-Oriented Object Notation (TOON) aims to replace JSON as a serialization format for passing structured data to LLMs with significantly reduced token usage. While showing solid accuracy in LLM comprehension, there is…
Sparse matrices and tensors are ubiquitous throughout multiple subfields of computing. The widespread usage of sparse data has inspired many in-memory and on-disk storage formats, but the only widely adopted storage specifications are the…
This paper introduces the Bloscpack file format and the accompanying Python reference implementation. Bloscpack is a lightweight, compressed binary file-format based on the Blosc codec and is designed for lightweight, fast serialization of…
Schema discovery is an important aspect to working with data in formats such as JSON. Unlike relational databases, JSON data sets often do not have associated structural information. Consumers of such datasets are often left to browse…
JavaScript Object Notation or JSON is a ubiquitous data exchange format on the Web. Ingesting JSON documents can become a performance bottleneck due to the sheer volume of data. We are thus motivated to make JSON parsing as fast as…
Programming patterns for sequential file access in the .NET Framework are described and the performance is measured. The default behavior provides excellent performance on a single disk - 50 MBps both reading and writing. Using large…
Recovering high-level type information in binaries is a key task in reverse engineering and binary analysis. Binaries contain very little explicit type information. The structure of binary code is incredibly flexible allowing for ad-hoc…
Data accesses between on- and off-chip memories account for a large fraction of overall energy consumption during inference with deep learning networks. We present APack, a simple and effective, lossless, off-chip memory compression…
Unstructured documents like PDFs contain valuable structured information, but downstream systems require this data in reliable, standardized formats. LLMs are increasingly deployed to automate this extraction, making accuracy and…
In the current context of Big Data, a multitude of new NoSQL solutions for storing, managing, and extracting information and patterns from semi-structured data have been proposed and implemented. These solutions were developed to relieve…
In managed languages, serialization of objects is typically done in bespoke binary formats such as Protobuf, or markup languages such as XML or JSON. The major limitation of these formats is readability. Human developers cannot read binary…
JSON is an essential file and data format in do-mains that span scientific computing, web APIs or configuration management. Its popularity has motivated significant software development effort to build multiple libraries to process JSON…