Universite du Quebec a Montreal - Computer Science
Xiaodan Zhu
Journal of the Association for Information Science and Technology 66 (2)
The importance of a research article is routinely measured by counting how many times it has been cited. However
treating all citations with equal weight ignores the wide variety of functions that citations perform. We want to automatically identify the subset of references in a bibliography that have a central academic influence on the citing paper. For this purpose
we examine the effectiveness of a variety of features for determining the academic influence of a citation. By asking authors to identify the key references in their own work
we created a data set in which citations were labeled according to their academic influence. Using automatic feature selection with supervised machine learning
we found a model for predicting academic influence that achieves good performance on this data set using only four features. The best features
among those we evaluated
were those based on the number of times a reference is mentioned in the body of a citing paper. The performance of these features inspired us to design an influence-primed h-index (the hip-index). Unlike the conventional h-index
it weights citations by how many times a reference is mentioned. According to our experiments
the hip-index is a better indicator of researcher performance than the conventional h-index.
Measuring academic influence: Not all citations are equal
It is becoming common to archive research datasets that are not only large but also numerous. In addition
their corresponding metadata and the software required to analyse or display them need to be archived. Yet the manual curation of research data can be difficult and expensive
particularly in very large digital repositories
hence the importance of models and tools for automating digital curation tasks. The automation of these tasks faces three major challenges: (1) research data and data sources are highly heterogeneous
(2) future research needs are difficult to anticipate
(3) data is hard to index. To address these problems
we propose the Extract
Transform and Archive (ETA) model for managing and mechanizing the curation of research data. Specifically
we propose a scalable strategy for addressing the research-data problem
ranging from the extraction of legacy data to its long-term storage. We review some existing solutions and propose novel avenues of research.
Extracting
Transforming and Archiving Scientific Data
Compressed bitmap indexes are used in databases and search engines. Many bitmap compression techniques have been proposed
almost all relying primarily on run-length encoding (RLE). However
on unsorted data
we can get superior performance with a hybrid compression technique that uses both uncompressed bitmaps and packed arrays inside a two-level tree. An instance of this technique
Roaring
has recently been proposed. Due to its good performance
it has been adopted by several production platforms (e.g.
Apache Lucene
Apache Spark
Apache Kylin and Druid). \nYet there are cases where run-length encoded bitmaps are smaller than the original Roaring bitmaps---typically when the data is sorted so that the bitmaps contain long compressible runs. To better handle these cases
we build a new Roaring hybrid that combines uncompressed bitmaps
packed arrays and RLE compressed segments. The result is a new Roaring format that compresses better. \nOverall
our new implementation of Roaring can be several times faster (up to two orders of magnitude) than the implementations of traditional RLE-based alternatives (WAH
Concise
EWAH) while compressing better. We review the design choices and optimizations that make these good results possible.
Consistently faster and smaller compressed bitmaps with Roaring
In many important applications -- such as search engines and relational database systems -- data is stored in the form of arrays of integers. Encoding and
most importantly
decoding of these arrays consumes considerable CPU time. Therefore
substantial effort has been made to reduce costs associated with compression and decompression. In particular
researchers have exploited the superscalar nature of modern processors and SIMD instructions. Nevertheless
we introduce a novel vectorized scheme called SIMD-BP128 that improves over previously proposed vectorized approaches. It is nearly twice as fast as the previously fastest schemes on desktop processors (varint-G8IU and PFOR). At the same time
SIMD-BP128 saves up to 2 bits per integer. For even better compression
we propose another new vectorized scheme (SIMD-FastPFOR) that has a compression rate within 10% of a state-of-the-art scheme (Simple-8b) while being two times faster during decoding.
Decoding billions of integers per second through vectorization
Ji-Rong Wen
Hongfei Yan
Jian-Yun Nie
Dongdong Shan
Xudong Zhang
Wayne Xin Zhao
ACM Transactions on Information Systems 33 (3)
Compression algorithms are important for data oriented tasks
especially in the era of Big Data. Modern processors equipped with powerful SIMD instruction sets
provide us an opportunity for achieving better compression performance. Previous research has shown that SIMD-based optimizations can multiply decoding speeds. Following these pioneering studies
we propose a general approach to accelerate compression algorithms. By instantiating the approach
we have developed several novel integer compression algorithms
called Group-Simple
Group-Scheme
Group-AFOR
and Group-PFD
and implemented their corresponding vectorized versions. We evaluate the proposed algorithms on two public TREC datasets
a Wikipedia dataset and a Twitter dataset. With competitive compression ratios and encoding speeds
our SIMD-based algorithms outperform state-of-the-art non-vectorized algorithms with respect to decoding speeds.
A General SIMD-based Approach to Accelerating Compression Algorithms
Compression can sometimes improve performance by making more of the data available to the processors faster. We consider the compression of integer keys in a B+-tree index. For this purpose
systems such as IBM DB2 use variable-byte compression over differentially coded keys. We revisit this problem with various compression alternatives such as Google's VarIntGB
Binary Packing and Frame-of-Reference. In all cases
we describe algorithms that can operate directly on compressed data. Many of our alternatives exploit the single-instruction-multiple-data (SIMD) instructions supported by modern CPUs. We evaluate our techniques in a database environment provided by Upscaledb
a production-quality key-value database. Our best techniques are SIMD accelerated: they simultaneously reduce memory usage while improving single-threaded speeds. In particular
a differentially coded SIMD binary-packing techniques (BP128) can offer a superior query speed (e.g.
40% better than an uncompressed database) while providing the best compression (e.g.
by a factor of ten). For analytic workloads
our fast compression techniques offer compelling benefits. Our software is available as open source.
Upscaledb: Efficient Integer-Key Compression in a Key-Value Store using SIMD Instructions
Compressed bitmap indexes are used to speed up simple aggregate queries in databases. Indeed
set operations like intersections
unions and complements can be represented as logical operations (AND
OR and NOT) that are ideally suited for bitmaps. However
it is less obvious how to apply bitmaps to more advanced queries. For example
we might seek products in a store that meet some
but maybe not all
criteria. Such threshold queries generalize intersections and unions; they are often used in information-retrieval and data-mining applications. We introduce new algorithms that are sometimes three orders of magnitude faster than a naïve approach. Our work shows that bitmap indexes are more broadly applicable than is commonly believed.
Compressed bitmap indexes: beyond unions and intersections
We present fast strongly universal string hashing families: they can process data at a rate of 0.2 CPU cycle per byte. Maybe surprisingly
we find that these families---though they require a large buffer of random numbers---are often faster than popular hash functions with weaker theoretical guarantees. Moreover
conventional wisdom is that hash functions with fewer multiplications are faster. Yet we find that they may fail to be faster due to operation pipelining. We present experimental results on several processors including low-powered processors. Our tests include hash functions designed for processors with the Carry-Less Multiplication (CLMUL) instruction set. We also prove
using accessible proofs
the strong universality of our families.
Strongly universal string hashing is fast
Good database design is crucial to obtain a sound
consistent database
and - in turn - good database design methodologies are the best way to achieve the right design. These methodologies are taught to most Computer Science undergraduates
as part of any Introduction to Database class. They can be considered part of the \"canon\"
and indeed
the overall approach to database design has been unchanged for years. Moreover
none of the major database research assessments identify database design as a strategic research direction.\nShould we conclude that database design is a solved problem?\nOur thesis is that database design remains a critical unsolved problem. Hence
it should be the subject of more research. Our starting point is the observation that traditional database design is not used in practice - and if it were used it would result in designs that are not well adapted to current environments. In short
database design has failed to keep up with the times. In this paper
we put forth arguments to support our viewpoint
analyze the root causes of this situation and suggest some avenues of research.
A Call to Arms: Revisiting Database Design
Nathan Kurz
Arrays of integers are often compressed in search engines. Though there are many ways to compress integers
we are interested in the popular byte-oriented integer compression techniques (e.g.
VByte or Google's Varint-GB). They are appealing due to their simplicity and engineering convenience. Amazon's varint-G8IU is one of the fastest byte-oriented compression technique published so far. It makes judicious use of the powerful single-instruction-multiple-data (SIMD) instructions available in commodity processors. To surpass varint-G8IU
we present Stream VByte
a novel byte-oriented compression technique that separates the control stream from the encoded data. Like varint-G8IU
Stream VByte is well suited for SIMD instructions. We show that Stream VByte decoding can be up to twice as fast as varint-G8IU decoding over real data sets. In this sense
Stream VByte establishes new speed records for byte-oriented integer compression
at times exceeding the speed of the memcpy function. On a 3.4GHz Haswell processor
it decodes more than 4 billion differentially-coded integers per second from RAM to L1 cache.
Stream VByte: Faster byte-oriented integer compression
Bitmap indexes are commonly used in databases and search engines. By exploiting bit-level parallelism
they can significantly accelerate queries. However
they can use much memory
and thus we might prefer compressed bitmap indexes. Following Oracle's lead
bitmaps are often compressed using run-length encoding (RLE). Building on prior work
we introduce the Roaring compressed bitmap format: it uses packed arrays for compression instead of RLE. We compare it to two high-performance RLE-based bitmap encoding techniques: WAH (Word Aligned Hybrid compression scheme) and Concise (Compressed `n' Composable Integer Set). On synthetic and real data
we find that Roaring bitmaps (1) often compress significantly better (e.g.
2 times) and (2) are faster than the compressed alternatives (up to 900 times faster for intersections). Our results challenge the view that RLE-based bitmap compression is best.
Better bitmap performance with Roaring bitmaps
Sorted lists of integers are commonly used in inverted indexes and database systems. They are often compressed in memory. We can use the single-instruction
multiple data (SIMD) instructions available in common processors to boost the speed of integer compression schemes. Our S4-BP128-D4 scheme uses as little as 0.7 CPU cycles per decoded 32-bit integer while still providing state-of-the-art compression. However
if the subsequent processing of the integers is slow
the effort spent on optimizing decompression speed can be wasted. To show that it does not have to be so
we (1) vectorize and optimize the intersection of posting lists; (2) introduce the simd galloping algorithm. We exploit the fact that one SIMD instruction can compare four pairs of 32-bit integers at once. We experiment with two Text REtrieval Conference (TREC) text collections
GOV2 and ClueWeb09 (category B)
using logs from the TREC million-query track. We show that using only the SIMD instructions ubiquitous in all modern CPUs
our techniques for conjunctive queries can double the speed of a state-of-the-art approach.
SIMD compression and the intersection of sorted integers
Iterated hash functions process strings recursively
one character at a time. At each iteration
they compute a new hash value from the preceding hash value and the next character. We prove that iterated hashing can be pairwise independent
but never 3-wise independent. We show that it can be almost universal over strings much longer than the number of hash values; we bound the maximal string length given the collision probability.
The universality of iterated hashing over variable-length strings
Daniel
Lemire
Université du Québec à Montréal
Acadia University
Tech-Elements Inc.
National Research Council of Canada
TELUQ - Université du Québec
Institute for biomedical engineering (Montreal)
University of New Brunswick
Montreal
Canada Area
I am an adjunct professor at UQAM where I am a member of the LATECE laboratory. I supervise graduate students.
Adjunct Professor
Université du Québec à Montréal
Montreal
Canada Area
I am a full professor (promoted in 2009) at the Université du Québec. I have been a department chair since June 2017. I teach computer science
primarily at the graduate level. I have held a federal research grant (NSERC) for nearly twenty years; I am a member of the NSERC computer science committee (2017-2020). I have supervised dozens of graduate students and published 70 peer-reviewed papers. During the 2016-2017 NSERC competition
I received a rating of \"outstanding\" for the excellence of the researcher. I maintain software used by Fortune 100 companies.
Professor and department chair
TELUQ - Université du Québec
Acadia University
Research Officer
This was a full time research position (government laboratory). I founded and ran the e-Health research group. I worked on recommender technology.
National Research Council of Canada
Entrepreneur
I founded and ran a small consultancy which had 4 full-time associates at one point. We had several clients worldwide. We designed a wavelet-based compression format for radiology images (Waaves). We designed signal processing and modelling software in geophysics.
Tech-Elements Inc.
Adjunct Professor
I am an adjunct professor at the Computer Science Department. I have supervised several graduate students.
University of New Brunswick
Post-doctoral fellow
I did wavelet-based signal processing (ECG and EEG).
Institute for biomedical engineering (Montreal)
French
English
Ph.D.
Supervisors: Gilles Deslauriers and Serge Dubuc.
Engineering Mathematics
M.Sc.
Supervisor: Catherine Sulem.
Mathematics
B.Sc.
I graduated with High Distinction and a nearly perfect GPA. I got the St.Michael's Dean medal. I held several prestigious scholarships
including the C. D. Howe Memorial scholarship (50k$). Instead of the regular course load of 5 classes a term
I took 6 classes a term.\n\nI also completed a minor in French studies and a major in Physics. (Not formally recognized as I did not register in the programs.)
Mathematics
I was a member of St. Michael's college.
University of Toronto - University of St. Michael's College
DEC
Sciences de la nature
Science columnist for the \"Mouton Noir\" (student newspaper)
CÉGEP de Drummondville
DES
Member of the computer club.
Collège Saint-Bernard
Compressing column-oriented indexes
Decoding billions of integers per second through vectorization
In many important applications -- such as search engines and relational database systems -- data is stored in the form of arrays of integers. Encoding and
most importantly
decoding of these arrays consumes considerable CPU time. Therefore
substantial effort has been made to reduce costs associated with compression and decompression. In particular
researchers have exploited the superscalar nature of modern processors and SIMD instructions. Nevertheless
we introduce a novel vectorized scheme called SIMD-BP128 that improves over previously proposed vectorized approaches. It is nearly twice as fast as the previously fastest schemes on desktop processors (varint-G8IU and PFOR). At the same time
SIMD-BP128 saves up to 2 bits per integer. For even better compression
we propose another new vectorized scheme (SIMD-FastPFOR) that has a compression rate within 10% of a state-of-the-art scheme (Simple-8b) while being two times faster during decoding.
Decoding billions of integers per second through vectorization
Faster Column-Oriented Indexes
Recent research results in optimizing column-oriented indexes for faster data warehousing. This talks aims to answer the following question: when is sorting the table a sufficiently good optimization?
Faster Column-Oriented Indexes
Write good papers
How to write good research papers. This talk covers everything from selecting a good title
writing a good abstract
crafting good figures
and so on.
Write good papers
Innovation without permission: from Codd to NoSQL
Practitioners often fail to apply textbook database design principles. We observe both a perversion of the relational model and a growth of less formal alternatives. Overall
there is an opposition between the analytic thought that prevailed when many data modeling techniques were initiated
and the pragmatism which now dominates among practitioners. There are at least two recent trends supporting this rejection of traditional models:\r\n(1) the rise of the sophisticated user
\r\nmost notably in social media is challenge to the rationalist view
as it blurs the distinction between design and operation
\r\n(2) in the new technological landscape where there are billions of interconnected computers worldwide
simple concepts like \r\nconsistency sometimes become prohibitively expensive. Overall
for a wide range of information systems
design and operation are becoming integrated in the spirit of pragmatism. Thus
we are left with design methodologies which embrace fast and continual iterations and and exploratory testing. These methodologies allow innovation without permission in that the right to design new features is no longer so closely guarded.\r\n\r\nFo
Innovation without permission: from Codd to NoSQL
A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAP
A data warehouse cannot materialize all possible views
hence we must estimate quickly
accurately
and reliably the size of views to determine the best candidates for materialization. Many available techniques for view-size estimation make particular statistical assumptions and their error can be large. Comparatively
unassuming probabilistic techniques are slower
but they estimate accurately and reliability very large view sizes using little memory. We compare five unassuming hashing-based view-size estimation techniques including Stochastic Probabilistic Counting and LogLog Probabilistic Counting. Our experiments show that only Generalized Counting
Gibbons-Tirthapura
and Adaptive Counting provide universally tight estimates irrespective of the size of the view; of those
only Adaptive Counting remains constantly fast as we increase the memory budget.
A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAP
All About Bitmap Indexes... And Sorting Them
A review of bitmap index from an academic perspective. Several theoretical results are presented. The talk also discuss technical issues regarding sorting the tables prior to indexing
as a way to improve the indexes.\r\n\r\nMuch of the talk is based on the following preprint:\r\n\r\nDaniel Lemire
Owen Kaser
Kamel Aouiche
Sorting improves word-aligned bitmap indexes. \r\n\r\nhttp://arxiv.org/abs/0901.3751
All About Bitmap Indexes... And Sorting Them
XML
Machine Learning
Mathematics
Compression
Java
Algorithm Design
Data Warehousing
Python
Analytics
Recommender Systems
Time Series Analysis
Data Visualization
C++
Relational Databases
Algorithms
OLAP
Software Development
MongoDB
Information Retrieval
NoSQL
Faster 64-bit universal hashing using carry-less multiplications
Intel and AMD support the Carry-less Multiplication (CLMUL) instruction set in their x64 processors. We use CLMUL to implement an almost universal 64-bit hash family (CLHASH). We compare this new family with what might be the fastest almost universal family on x64 processors (VHASH). We find that CLHASH is at least 60% faster. We also compare CLHASH with a popular hash function designed for speed (Google's CityHash). We find that CLHASH is 40% faster than CityHash on inputs larger than 64 bytes and just as fast otherwise.
Faster 64-bit universal hashing using carry-less multiplications
Kamel Boukhalfa
In the current Big Data era
systems for collecting
storing and efficiently exploiting huge amounts of data are continually introduced
such as Hadoop
Apache Spark
Dremel
etc. Druid is one of theses systems especially designed to manage such data quantities
and allows to perform detailed real-time analysis on terabytes of data within sub-second latencies. One of the important Druid's requirements is fast data filtering. To insure that
Druid makes an extensive use of bitmap indexes. Previously
we introduced a new compressed bitmap index scheme called Roaring bitmap that has shown interesting results when compared to the bitmap compression scheme adopted by Druid: Concise. Since
Roaring bitmap has been integrated to Druid as an indexing solution. In this work
we produce an extensive series of experiments in order to compare Roaring bitmap and Concise time-space performances when used to accelerate Druid's OLAP queries and other kinds of operations Druid realizes on bitmaps
like: retrieving set bits from bitmaps
computing bitmap complements
aggregating several bitmaps with logical ORs and ANDs operations. Roaring bitmap has shown to improve up to ≈ 5× analytical queries response times under Druid compared to Concise.
Optimizing Druid with Roaring bitmaps
Sorting database tables before compressing them improves the compression rate. Can we do better than the lexicographical order? For minimizing the number of runs in a run-length encoding compression scheme
the best approaches to row-ordering are derived from traveling salesman heuristics
although there is a significant trade-off between running time and compression. A new heuristic
Multiple Lists
which is a variant on Nearest Neighbor that trades off compression for a major running-time speedup
is a good option for very large tables. However
for some compression schemes
it is more important to generate long runs rather than few runs. For this case
another novel heuristic
Vortex
is promising. We find that we can improve run-length encoding up to a factor of 3 whereas we can improve prefix coding by up to 80%: these gains are on top of the gains due to lexicographically sorting the table. We prove that the new row reordering is optimal (within 10%) at minimizing the runs of identical values within columns
in a few cases.
Reordering rows for better compression: Beyond the lexicographic order
Owen Kaser
Hazel Webb
In OLAP
analysts often select an interesting sample of the data. For example
an analyst might focus on products bringing revenues of at least $100 000
or on shops having sales greater than $400 000. However
current systems do not allow the application of both of these thresholds simultaneously
selecting products and shops satisfying both thresholds. For such purposes
we introduce the diamond cube operator
filling a gap among existing data warehouse operations. \nBecause of the interaction between dimensions the computation of diamond cubes is challenging. We compare and test various algorithms on large data sets of more than 100 million facts. We find that while it is possible to implement diamonds in SQL
it is inefficient. Indeed
our custom implementation can be a hundred times faster than popular database engines (including a row-store and a column-store).
Diamond Dicing
Nathan Kurz
We consider the ubiquitous technique of VByte compression
which represents each integer as a variable length sequence of bytes. The low 7 bits of each byte encode a portion of the integer
and the high bit of each byte is reserved as a continuation flag. This flag is set to 1 for all bytes except the last
and the decoding of each integer is complete when a byte with a high bit of 0 is encountered. VByte decoding can be a performance bottleneck especially when the unpredictable lengths of the encoded integers cause frequent branch mispredictions. Previous attempts to accelerate VByte decoding using SIMD vector instructions have been disappointing
prodding search engines such as Google to use more complicated but faster-to-decode formats for performance-critical code. Our decoder (Masked VByte) is 2 to 4 times faster than a conventional scalar VByte decoder
making the format once again competitive with regard to speed.
Vectorized VByte Decoding
Bloom filters are probabilistic data structures commonly used for approximate membership problems in many areas of Computer Science (networking
distributed systems
databases
etc.). With the increase in data size and distribution of data
problems arise where a large number of Bloom filters are available
and all of them need to be searched for potential matches. As an example
in a federated cloud environment
each cloud provider could encode the information using Bloom filters and share the Bloom filters with a central coordinator. The problem of interest is not only whether a given element is in any of the sets represented by the Bloom filters
but also which of the existing sets contain the given element. This problem cannot be solved by just constructing a Bloom filter on the union of all the sets. Instead
we effectively have a multidimensional Bloom filter problem: given an element
we wish to receive a list of candidate sets where the element might be.\n\nTo solve this problem
we consider three alternatives. Firstly
we can naively check many Bloom filters. Secondly
we propose to organize the Bloom filters in a hierarchical index structure akin to a B+ tree that we call Bloofi. Finally
we propose another data structure that packs the Bloom filters in such a way as to exploit bit-level parallelism
which we call Flat-Bloofi.\n\nOur theoretical and experimental results show that Bloofi and Flat-Bloofi provide scalable and efficient solutions alternatives to search through a large number of Bloom filters.
Bloofi: Multidimensional Bloom filters
To classify time series by nearest neighbors
we need to specify or learn one or several distance measures. We consider variations of the Mahalanobis distance measures which rely on the inverse covariance matrix of the data. Unfortunately --- for time series data --- the covariance matrix has often low rank. To alleviate this problem we can either use a pseudoinverse
covariance shrinking or limit the matrix to its diagonal. We review these alternatives and benchmark them against competitive methods such as the related Large Margin Nearest Neighbor Classification (LMNN) and the Dynamic Time Warping (DTW) distance. As we expected
we find that the DTW is superior
but the Mahalanobis distance measures are one to two orders of magnitude faster. To get best results with Mahalanobis distance measures
we recommend learning one distance measure per class using either covariance shrinking or the diagonal approach.
Time series classification by class-specific Mahalanobis distance measures