Jessica Lin
Entities in discourse vary in salience: main participants, objects and locations stay prominent, while others are quickly forgotten, raising questions about how humans signal and infer discourse-level salience. Using a graded…
Previous work examining the Uniform Information Density (UID) hypothesis has shown that while information as measured by surprisal metrics is distributed more or less evenly across documents overall, local discrepancies can arise due to…
Time series chain (TSC) is a recently introduced concept that captures the evolving patterns in large scale time series. Informally, a time series chain is a temporally ordered set of subsequences, in which consecutive subsequences in the…
We prove quantitative estimates on the the parabolic Green function and the stationary invariant measure in the context of stochasic homogenization of elliptic equations in nondivergence form. We consequently obtain a quenched, local CLT…
We present a general framework which can be used to prove that, in an annealed sense, rescaled spatial stochastic population models converge to generalized propagating fronts. Our work is motivated by recent results of Etheridge, Freeman,…
We consider the long-time behaviour of binary branching Brownian motion (BBM) where the branching rate depends on a periodic spatial heterogeneity. We prove that almost surely as $t\to\infty$, the heterogeneous BBM at time $t$, normalized…
Determining and ranking the most salient entities in a text is critical for user-facing systems, especially as users increasingly rely on models to interpret long documents they only partially read. Graded entity salience addresses this…
Multivariate time series classification is a crucial task in data mining, attracting growing research interest due to its broad applications. While many existing methods focus on discovering discriminative patterns in time series,…
Observed pileups of planets with period ratios $\approx 1\%$ wide of strong mean motion resonances (MMRs) pose an important puzzle. Early models showed that they can be created through sustained eccentricity damping driving a slow…
Recent progress on large language models (LLMs) has enabled dialogue agents to generate highly naturalistic and plausible text. However, current LLM language generation focuses on responding accurately to questions and requests with a…
Work on shallow discourse parsing in English has focused on the Wall Street Journal corpus, the only large-scale dataset for the language in the PDTB framework. However, the data is not openly available, is restricted to the news domain,…
Neural networks are widely used in machine learning and data mining. Typically, these networks need to be trained, implying the adjustment of weights (parameters) within the network based on the input data. In this work, we propose a novel…
As NLP models become increasingly capable of understanding documents in terms of coherent entities rather than strings, obtaining the most salient entities for each document is not only an important end task in itself but also vital for…
We present GENTLE, a new mixed-genre English challenge corpus totaling 17K tokens and consisting of 8 unusual text types for out-of domain evaluation: dictionary entries, esports commentaries, legal documents, medical notes, poetry,…
Time series classification is an important data mining task that has received a lot of interest in the past two decades. Due to the label scarcity in practice, semi-supervised time series classification with only a few labeled samples has…
We describe a system for deep reinforcement learning of robotic manipulation skills applied to a large-scale real-world task: sorting recyclables and trash in office buildings. Real-world deployment of deep RL policies requires not only…
We study entire solutions to homogeneous reaction-diffusion equations in several dimensions with Fisher-KPP reactions. Any entire solution $0<u<1$ is known to satisfy \[ \lim_{t\to -\infty} \sup_{|x|\le c|t|} u(t,x) = 0 \qquad \text{for…
Recent rapid development of sensor technology has allowed massive fine-grained time series (TS) data to be collected and set the foundation for the development of data-driven services and applications. During the process, data sharing is…
While much attention has been paid to identifying explicit hate speech, implicit hateful expressions that are disguised in coded or indirect language are pervasive and remain a major challenge for existing hate speech detection systems.…
We give a new, self-contained proof of the multidimensional central limit theorem using the technique of ``doubling variables," which is traditionally used to prove uniqueness of solutions of partial differential equations (PDEs). Our…