DGFIndex for Smart Grid: Enhancing Hive with a Cost-Effective Multidimensional Range Index

Yue Liu, Songlin Hu, Tilmann Rabl, Wantao Liu, Hans-Arno Jacobsen, Kaifeng Wu, and Jian Chen.

Proceedings of the VLDB Endowment, 13(7)1496-1507, 2014.


In Smart Grid applications, as the number of deployed electric smart meters increases, massive amounts of valuable meter data is generated and collected every day. To enable reliable data collection and make business decisions fast, high throughput storage and high-performance analysis of massive meter data become crucial for grid companies. Considering the advantage of high efficiency, fault tolerance, and price-performance of Hadoop and Hive systems, they are frequently deployed as underlying platform for big data processing. However, in real business use cases, these data analysis applications typically involve multidimensional range queries (MDRQ) as well as batch reading and statistics on the meter data. While Hive is high-performance at complex data batch reading and analysis, it lacks efficient indexing techniques for MDRQ.

In this paper, we propose DGFIndex, an index structure for Hive that efficiently supports MDRQ for massive meter data. DGFIndex divides the data space into cubes using the grid file technique. Unlike the existing indexes in Hive, which stores all combinations of multiple dimensions, DGFIndex only stores the information of cubes. This leads to smaller index size and faster query processing. Furthermore, with pre-computing user-defined aggregations of each cube, DGFIndex only needs to access the boundary region for aggregation query. Our comprehensive experiments show that DGFIndex can save significant disk space in comparison with the existing indexes in Hive and the query performance with DGFIndex is 2-50 times faster than existing indexes in Hive and HadoopDB for aggregation query, 2-5 times faster than both for non-aggregation query, 2-75 times faster than scanning the whole table in different query selectivity.


Tags: hadoop, hive, smart grid, nosql, dgfindex

Readers who enjoyed the above work, may also like the following:

  • A BigBench Implementation in the Hadoop Ecosystem.
    Badrul Chodhury, Tilmann Rabl, Pooya Saadatpanah, Jiang Du, and Hans-Arno Jacobsen.
    In Advancing Big Data Benchmarks, 2013. Springer Berlin Heidelberg.
    Tags: bigbench, hadoop, hive, big data benchmarking, nosql
  • DualTable: A Hybrid Storage Model for Update Optimization in Hive.
    Songlin Hu, Wantao Liu, Tilmann Rabl, Shuo Huang, Ying Liang, Zhang Xiao, Hans-Arno Jacobsen, Xubin Pei, and Jiye Wang.
    In Proceedings of the 31st International Conference on Data Engineering, 2015.
    Tags: big data, hadoop, dualtable
  • CaSSanDra: An SSD Boosted Key-Value Store.
    Prashanth Menon, Tilmann Rabl, Mohammad Sadoghi, and Hans-Arno Jacobsen.
    In 30th IEEE International Conference on Data Engineering, pages 1162-1167, 2014.
    Tags: cassandra, big data, key-value store, nosql