Column-oriented data are well suited for compression. Since values of the same column are stored contiguously on disk, the information entropy is lower if compared to the physical data organization of conventional databases. There are many useful light-weight compression techniques targeted at specific data types and domains, like integers and small lists of distinct values, respectively. However, compression of textual values formed by skewed and high-cardinality words is usually restricted to variations of the LZ compression algorithm. So far there are no empirical evaluations that verify how other sophisticated compression methods address columnar data that store text. In this paper we shed a light on this subject by revisiting concepts of those algorithms. We also analyse how they behave in terms of compression and speed when dealing with textual columns where values appear in adjacent positions.
Publié le : 2018-07-03
Classification:  other areas of Computing and Informatics,  Compression, column-oriented databases, LZ, PPM, BWT, entropy encoding, DSM, NSM, PAX,  68P30
@article{cai2018_2_405,
     author = {Vinicius Fulber Garcia; Department of Languages and Computer Systems, Universidade Federal de Santa Maria, 1000, Santa Maria and Sergio Luis Sardi Mergen; Department of Languages and Computer Systems, Universidade Federal de Santa Maria, 1000, Santa Maria},
     title = {Compression of Textual Column-Oriented Data},
     journal = {Computing and Informatics},
     volume = {36},
     number = {6},
     year = {2018},
     language = {en},
     url = {http://dml.mathdoc.fr/item/cai2018_2_405}
}
Vinicius Fulber Garcia; Department of Languages and Computer Systems, Universidade Federal de Santa Maria, 1000, Santa Maria; Sergio Luis Sardi Mergen; Department of Languages and Computer Systems, Universidade Federal de Santa Maria, 1000, Santa Maria. Compression of Textual Column-Oriented Data. Computing and Informatics, Tome 36 (2018) no. 6, . http://gdmltest.u-ga.fr/item/cai2018_2_405/