Related papers: Triangular clustering in document networks
Many complex networks from the World-Wide-Web to biological networks are growing taking into account the heterogeneous features of the nodes. The feature of a node might be a discrete quantity such as a classification of a URL document as…
Importance of document clustering is now widely acknowledged by researchers for better management, smart navigation, efficient filtering, and concise summarization of large collection of documents like World Wide Web (WWW). The next…
In this paper, we explore a set of novel features for authorship attribution of documents. These features are derived from a word network representation of natural language text. As has been noted in previous studies, natural language tends…
Triangles are an important building block and distinguishing feature of real-world networks, but their structure is still poorly understood. Despite numerous reports on the abundance of triangles, there is very little information on what…
There are different ways to define similarity for grouping similar texts into clusters, as the concept of similarity may depend on the purpose of the task. For instance, in topic extraction similar texts mean those within the same semantic…
A fundamental property of complex networks is the tendency for edges to cluster. The extent of the clustering is typically quantified by the clustering coefficient, which is the probability that a length-2 path is closed, i.e., induces a…
This work is pertaining to the diversified ranking of web-resources and interconnected documents that rely on a network-like structure, e.g. web-pages. A practical example of this would be a query for the k most relevant web-pages that are…
Document clustering is an unsupervised approach in which a large collection of documents (corpus) is subdivided into smaller, meaningful, identifiable, and verifiable sub-groups (clusters). Meaningful representation of documents and…
High triangle density -- the graph property stating that a constant fraction of two-hop paths belong to a triangle -- is a common signature of social networks. This paper studies triangle-dense graphs from a structural perspective. We prove…
Text clustering holds significant value across various domains due to its ability to identify patterns and group related information. Current approaches which rely heavily on a computed similarity measure between documents are often limited…
A text network refers to a data type that each vertex is associated with a text document and the relationship between documents is represented by edges. The proliferation of text networks such as hyperlinked webpages and academic citation…
Coherence is an important aspect of text quality, and various approaches have been applied to coherence modeling. However, existing methods solely focus on a single document's coherence patterns, ignoring the underlying correlation between…
Keyword-based information processing has limitations due to simple treatment of words. In this paper, we introduce named entities as objectives into document clustering, which are the key elements defining document semantics and in many…
This paper explores intellectual and social proximity among scholarly journals by using network fusion techniques. Similarities among journals are initially represented by means of a three-layer network based on co-citations, common authors…
A recurrent neural network that has been trained to separately model the language of several documents by unknown authors is used to measure similarity between the documents. It is able to find clues of common authorship even when the…
We develop a full theoretical approach to clustering in complex networks. A key concept is introduced, the edge multiplicity, that measures the number of triangles passing through an edge. This quantity extends the clustering coefficient in…
We are interested in the widespread problem of clustering documents and finding topics in large collections of written documents in the presence of metadata and hyperlinks. To tackle the challenge of accounting for these different types of…
Text documents are structured on multiple levels of detail: individual words are related by syntax, but larger units of text are related by discourse structure. Existing language models generally fail to account for discourse structure, but…
For the study of citation networks, a challenging problem is modeling the high clustering. Existing studies indicate that the promising way to model the high clustering is a copying strategy, i.e., a paper copies the references of its…
Real-world networks often exhibit strong transitivity with nontrivial local clustering spectra and degree correlations. Such features are not easily modeled in tractable network models, creating an obstacle to the theoretical understanding…