Thursday 9 November 2017

Thoughts on a paper: Parallel Data Analysis Directly on Scientific File Formats

The main motivation behind the paper is: the data processing pipeline that scientists deal with has established HDF5 format as the most ubiquitous one. A plethora of tools support the HDF5 format for simulations, visualizations and other data processing tasks. However, there is no analysis tool that would provide a rich query capabilities that can be found in a typical database management system (DBMS) via SQL. Loading data from HDF5 to a database is expensive and then the results of a processing in a DBMS would have to be converted back to HDF5 to be able to proceed further on in the data analysis pipeline.

The main contribution of the paper is the architecture and implementation of a new system. It is a database query engine that works over the HDF5 file format. The system used before the tool was introduced was PostgreSQL. It is a bit surprising that the query engine over HDF5 is row-based. The HDF5 format is array-based and the queries run by scientist are analytical so either an array-based or even a column-based systems would fit better in this case. When you compare results of benchmarks of OLAP-like queries for PostgreSQL and MonetDB, the performance difference is stunning, with MonetDB being an order of magnitude faster.

The paper targets a very specific use case - scientific data analysis. It seems to be one of the best places where the NoDB idea https://cacm.acm.org/magazines/2015/12/194619-nodb/fulltext can be applied. Regarding the implementation details, SDS/Q system implements the positional maps, which were proposed in NoDB. Clearly, the outcome is not as expected - because the positional maps perform poorly in this case. Why? The authors mention the inherent cost of performing random point lookups over HDF5 data. It seems to be more about at least 3 things: the queries run (workload), the underlying file format, and the experiments themselves to approve/disapprove a given technique.  Let me comment on the two former differences. The queries run in the NoDB paper are aggregates that scan the whole input file. Moreover, the tables in NoDB have hundreds of columns. It can be the case that when you do such a scan, you can just jump over the file but still preserving the forward pass through the file and, overall, a sequential access to disk. On the other hand, SDS/Q indicates random accesses. The data storage from HPC (High Performance Computing) environment is able to support massively parallel I/O where many processes/threads access data from the storage simultaneously, but in case of singe program accessing data randomly, the cost of such access is more expensive than just a sequential one.

Mistakes: a claim in section 4.2 that the checkpointing frequency was increased to speed-up the data loading process to PostgreSQL is incorrect. Actually, a reverse statement is correct: the checkpointing should be less frequent and the intervals as well as the amount of data modified (threshold) to trigger a checkpoint should be increased.