Related papers: Parameterizing Kterm Hashing
Novelty detection in news events has long been a difficult problem. A number of models performed well on specific data streams but certain issues are far from being solved, particularly in large data streams from the WWW where…
We present a new machine learning and text information extraction approach to detection of cyber threat events in Twitter that are novel (previously non-extant) and developing (marked by significance with respect to similarity with a…
Novelty detection in text streams is a challenging task that emerges in quite a few different scenarios, ranging from email thread filtering to RSS news feed recommendation on a smartphone. An efficient novelty detection algorithm can save…
Frequency estimation of elements is an important task for summarizing data streams and machine learning applications. The problem is often addressed by using streaming algorithms with sublinear space data structures. These algorithms allow…
The accelerating pace of scientific publication makes it difficult to identify truly original research among incremental work. We propose a framework for estimating the conceptual novelty of research papers by combining semantic…
Big data streams are possibly one of the most essential underlying notions. However, data streams are often challenging to handle owing to their rapid pace and limited information lifetime. It is difficult to collect and communicate stream…
Twitter is one of the most popular microblogging services in the world. The great amount of information within Twitter makes it an important information channel for people to learn and share news. Twitter hashtag is an popular feature that…
We present a novel approach for the problem of frequency estimation in data streams that is based on optimization and machine learning. Contrary to state-of-the-art streaming frequency estimation algorithms, which heavily rely on random…
We propose a new (theoretical) computational model for the study of massive data processing with limited computational resources. Our model measures the complexity of reading the very large data sets in terms of the data size N and analyzes…
Since datasets with annotation for novelty at the document and/or word level are not easily available, we present a simulation framework that allows us to create different textual datasets in which we control the way novelty occurs. We also…
In recent years, people spend a lot of time on social networks. They use social networks as a place to comment on personal or public events. Thus, a large amount of information is generated and shared daily in these networks. Using such a…
Cold-start is a very common and still open problem in the Recommender Systems literature. Since cold start items do not have any interaction, collaborative algorithms are not applicable. One of the main strategies is to use pure or hybrid…
We focus in this paper on dataset reduction techniques for use in k-nearest neighbor classification. In such a context, feature and prototype selections have always been independently treated by the standard storage reduction algorithms.…
Unsupervised discovery of stories with correlated news articles in real-time helps people digest massive news streams without expensive human annotations. A common approach of the existing studies for unsupervised online story discovery is…
For applied intelligence, utility-driven pattern discovery algorithms can identify insightful and useful patterns in databases. However, in these techniques for pattern discovery, the number of patterns can be huge, and the user is often…
Point patterns are sets or multi-sets of unordered elements that can be found in numerous data sources. However, in data analysis tasks such as classification and novelty detection, appropriate statistical models for point pattern data have…
This study aims to publish a novel similarity metric to increase the speed of comparison operations. Also the new metric is suitable for distance-based operations among strings. Most of the simple calculation methods, such as string length…
We consider novelty detection in time series with unknown and nonparametric probability structures. A deep learning approach is proposed to causally extract an innovations sequence consisting of novelty samples statistically independent of…
Minwise hashing is a fundamental and one of the most successful hashing algorithm in the literature. Recent advances based on the idea of densification~\cite{Proc:OneHashLSH_ICML14,Proc:Shrivastava_UAI14} have shown that it is possible to…
In this era of Big Data, due to expeditious exchange of information on the web, words are being used to denote newer meanings, causing linguistic shift. With the recent availability of large amounts of digitized texts, an automated analysis…