Just can't get enough - Synthesizing Big Data

Tilmann Rabl, Manuel Danisch, Michael Frank, Sebastian Schindler, and Hans-Arno Jacobsen.

In Proceedings of the ACM SIGMOD Conference, 2015.
Demonstration Track.


With the rapidly decreasing prices for storage and storage systems ever larger data sets become economical. While only few years ago only successful transactions would be recorded in sales systems, today every user interaction will be stored for ever deeper analysis and richer user modeling. This has led to the development of big data systems, which offer high scalability and novel forms of analysis. Due to the rapid development and ever increasing variety of the big data landscape, there is a pressing need for tools for testing and benchmarking.

Vendors have little options to showcase the performance of their systems but to use trivial data sets like TeraSort or WordCount. Since customers' real data is typically subject to privacy regulations and rarely can be utilized, simplistic proof-of-concepts have to be used, leaving both, customers and vendors, unclear of the target use-case performance. As a solution, we present an automatic approach to data synthetization from existing data sources. Our system enables a fully automatic generation of large amounts of complex, realistic, synthetic data.


Related Projects

Tags: pdgf, dbsynth, data generation

Readers who enjoyed the above work, may also like the following:

  • Rapid Development of Data Generators Using Meta Generators in PDGF.
    Tilmann Rabl, Meikel Poess, Manuel Danisch, and Hans-Arno Jacobsen.
    In 6th International Workshop on Testing Database Systems, 2013.
    Tags: pdgf, meta generator, data generation
  • Big Data Generation.
    Tilmann Rabl and Hans-Arno Jacobsen.
    In Proceedings of the Workshop on Big Data Benchmarking, pages 20-27, 2013.
    Tags: pdgf, big data, benchmarking
  • Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance.
    Tilmann Rabl, Meikel Poess, Hans-Arno Jacobsen, Patrick O'Neil, and Elizabeth O'Neil.
    In Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering, 2013.
    Tags: star schema benchmark, ssb, parallel data generation framework, pdgf, benchmarking, skew