Related papers: Random XML sampling the Boltzmann way
A natural approach to software quality assurance consists in writing unit tests securing programmer-declared code invariants. Throughout the literature a great body of work has been devoted to tools and techniques automating this…
The Boltzmann model for the random generation of "decomposable" combinatorial structures is a set of techniques that allows for efficient random sampling algorithms for a large class of families of discrete objects. The usual requirement of…
Randomly generating structured objects is important in testing and optimizing functional programs, whereas generating random $'l$-terms is more specifically needed for testing and optimizing compilers. For that a tool called QuickCheck has…
XML is a standard and universal language for representing information. XML processing is supported by two key frameworks: DOM and SAX. SAX is efficient, but leaves the developer to encode much of the processing. This paper introduces a…
The advantages for the presence of an XML schema for XML documents are numerous. However, many XML documents in practice are not accompanied by a schema or by a valid schema. Relax NG is a popular and powerful schema language, which…
This paper addresses the uniform random generation of words from a context-free language (over an alphabet of size $k$), while constraining every letter to a targeted frequency of occurrence. Our approach consists in a multidimensional…
We present {\em generative clustering} (GC) for clustering a set of documents, $\mathrm{X}$, by using texts $\mathrm{Y}$ generated by large language models (LLMs) instead of by clustering the original documents $\mathrm{X}$. Because LLMs…
This article introduces new algorithms for the uniform random generation of labelled planar graphs. Its principles rely on Boltzmann samplers, as recently developed by Duchon, Flajolet, Louchard, and Schaeffer. It combines the Boltzmann…
Dictionaries are often developed using tools that save to Extensible Markup Language (XML)-based standards. These standards often allow high-level repeating elements to represent lexical entries, and utilize descendants of these repeating…
Generating synthetic variants of a document is often posed as text-to-text transformation. We propose an alternate LLM based method that first decomposes a document into semantic frames and then generates text using this interim sparse…
We report about the current state of development of a document suite and its applications. This collection of tools for the flexible and robust processing of documents in German is based on the use of XML as unifying formalism for encoding…
The eXtensible Markup Language (XML) provides a powerful and flexible means of encoding and exchanging data. As it turns out, its main advantage as an encoding format (namely, its requirement that all open and close markup tags are present…
In the context of software testing, generating complex data inputs is frequently performed using a grammar-based specification. For combinatorial reasons, an exhaustive generation of the data -- of a given size -- is practically impossible,…
Our method proposes the efficient generation of samples from an unnormalized Boltzmann density by solving the underlying continuity equation in the low-rank tensor train (TT) format. It is based on the annealing path commonly used in MCMC…
We survey several methods of generating large random lambda-terms, focusing on their closed and simply-typed variants. We discuss methods of exact- and approximate-size generation, as well as methods of achieving size-uniform and…
Boltzmann samplers and the recursive method are prominent algorithmic frameworks for the approximate-size and exact-size random generation of large combinatorial structures, such as maps, tilings, RNA sequences or various tree-like…
XML documents are described by a document type definition (DTD). An XML-grammar is a formal grammar that captures the syntactic features of a DTD. We investigate properties of this family of grammars. We show that every XML-language…
We present a famework for the automatic compilation of multi-parametric Boltzmann samplers for algebraic data types in Haskell. Our framework uses Template Haskell to synthesise efficient, entropy-optimal samplers generating random…
We propose Blank Language Model (BLM), a model that generates sequences by dynamically creating and filling in blanks. The blanks control which part of the sequence to expand, making BLM ideal for a variety of text editing and rewriting…
This paper reports on the INRIA group's approach to XML mining while participating in the INEX XML Mining track 2005. We use a flexible representation of XML documents that allows taking into account the structure only or both the structure…