Thursday, 9 November 2017

Thoughts on a paper: Parallel Data Analysis Directly on Scientific File Formats

The main motivation behind the paper is: the data processing pipeline that scientists deal with has established HDF5 format as the most ubiquitous one. A plethora of tools support the HDF5 format for simulations, visualizations and other data processing tasks. However, there is no analysis tool that would provide a rich query capabilities that can be found in a typical database management system (DBMS) via SQL. Loading data from HDF5 to a database is expensive and then the results of a processing in a DBMS would have to be converted back to HDF5 to be able to proceed further on in the data analysis pipeline.

The main contribution of the paper is the architecture and implementation of a new system. It is a database query engine that works over the HDF5 file format. The system used before the tool was introduced was PostgreSQL. It is a bit surprising that the query engine over HDF5 is row-based. The HDF5 format is array-based and the queries run by scientist are analytical so either an array-based or even a column-based systems would fit better in this case. When you compare results of benchmarks of OLAP-like queries for PostgreSQL and MonetDB, the performance difference is stunning, with MonetDB being an order of magnitude faster.

The paper targets a very specific use case - scientific data analysis. It seems to be one of the best places where the NoDB idea https://cacm.acm.org/magazines/2015/12/194619-nodb/fulltext can be applied. Regarding the implementation details, SDS/Q system implements the positional maps, which were proposed in NoDB. Clearly, the outcome is not as expected - because the positional maps perform poorly in this case. Why? The authors mention the inherent cost of performing random point lookups over HDF5 data. It seems to be more about at least 3 things: the queries run (workload), the underlying file format, and the experiments themselves to approve/disapprove a given technique.  Let me comment on the two former differences. The queries run in the NoDB paper are aggregates that scan the whole input file. Moreover, the tables in NoDB have hundreds of columns. It can be the case that when you do such a scan, you can just jump over the file but still preserving the forward pass through the file and, overall, a sequential access to disk. On the other hand, SDS/Q indicates random accesses. The data storage from HPC (High Performance Computing) environment is able to support massively parallel I/O where many processes/threads access data from the storage simultaneously, but in case of singe program accessing data randomly, the cost of such access is more expensive than just a sequential one.

Mistakes: a claim in section 4.2 that the checkpointing frequency was increased to speed-up the data loading process to PostgreSQL is incorrect. Actually, a reverse statement is correct: the checkpointing should be less frequent and the intervals as well as the amount of data modified (threshold) to trigger a checkpoint should be increased. 


Tuesday, 28 March 2017

Suggestion for Supper

Pancakes with Greek yogurt and cherry jam, apples with cinnamon and oatmeal cookies with almond butter. Bon App├ętit :)

Monday, 27 February 2017

Snake bites

The programming language that bites me most terribly & keenly is ... no doubts ... Python. This time it was a global variable eps (epsilon) and I did not add the "self." prefix to the one that I really wanted to use in a method of a class. The nightmares of C programming and its common best practices come back! All in all, avoid global variables also in Python, or these snake bites can be deadly at the end of the day.

Monday, 20 February 2017

Bug in pthread_cond_timedwait or we need an improvement?

pthread_cond_timedwait returns ETIMEDOUT instead of e.g. EINVAL when provided with "timespec *restrict abstime" whose tv_nsec part exceeds 999999999 (basically, it's more than 1 second). Furthermore, IMHO pthread_cond_timedwait should return ETIME instead of ETIMEDOUT (if we follow comments in: http://bit.ly/2kgHEhj). Anyway, if you want to avoid a few hours of debugging then make sure that your tv_nsec is correct in each timespec struct!

Friday, 3 February 2017

"sudo checkinstall" instead of "sudo make install"

The pain of removing packages installed from source has reached its ultimate peak and it's the highest time to move to checkinstall. This gives us a very handy ability to remove a package from the system using the system packaging tools: e.g.  

dpkg -r custom-protobuf

So current flow of commands for any installation from source is:

./configure 
make  
sudo checkinstall