2024 NSF CAREER Awardee Dominik Kempa is Revolutionizing Data Storage

Pictured: Prof. Dominik Kempa, a 2024 NSF CAREER Awardee

Professor Dominik Kempa, from the Department of Computer Science (CS), has been awarded the prestigious National Science Foundation (NSF) CAREER award. The project, Scalable and Flexible Indexing of Compressed Sequences, is supported with $600k in CAREER funding from NSF.

The last two decades have witnessed a dramatic increase in the amount of datasets consisting of text. Sources of such data include massive DNA databases produced by initiatives such as "100,000 Genomes Project," versioned articles (such as Wikipedia), or source code repositories (such as GitHub). Some of these datasets already reach the scale of petabytes and are predicted to grow further. A unique characteristic of this data is that it is often highly repetitive, and hence very compressible. For example, genomic databases are known to be up to 99.9% repetitive. One ray of hope for dealing with the scale of this data is thus data compression. Compression alone, however, is not sufficient in many applications, since the data in compressed form cannot be accessed or searched without first decompressing it.

The field of compressed indexing aims to address the above challenge by designing data structures that can store highly compressible string collections using space close to the size of the data in compressed form, while simultaneously supporting various queries (such as pattern matching) on the underlying uncompressed sequence. Although this research area has already led to improvements in our ability to store and search large string collections, most of the work has concentrated on designing static compressed indexes, with much less attention paid to other aspects, such as efficient index construction, or support for updates.

Kempa’s CAREER research seeks to fully unlock the potential of compressed indexing by:

Designing efficient algorithms for constructing compressed indexes. By utilizing approximate compression and compressed computation, he aims to significantly reduce construction time for compressed indexes. This makes handling massive datasets more feasible.
Developing dynamic compressed indexes. This enhances the flexibility in data storage and retrieval by avoiding re-indexing every time a small change occurs in the dataset.
Exploring lower bounds for compressed data structures and providing valuable insights for future research. By understanding the fundamental limits of compressed data structures, researchers can better optimize algorithms and develop more efficient storage solutions for massive datasets.

Before joining SBU in 2021, Kempa held postdoctoral positions at Johns Hopkins University, UC Berkeley, and the University of Warwick. He earned a PhD in computer science from the University of Helsinki, Finland. According to Samir Das, CS department chair, “In 2022 Dominik was named a “rising star in combinatorial pattern matching” in the Communications of the ACM. This NSF CAREER award further solidifies his position in the field.”

-Sahil Sarna