Thursday, 9 November 2017

Thoughts on a paper: Parallel Data Analysis Directly on Scientific File Formats

The main motivation behind the paper is: the data processing pipeline that scientists deal with has established HDF5 format as the most ubiquitous one. A plethora of tools support the HDF5 format for simulations, visualizations and other data processing tasks. However, there is no analysis tool that would provide a rich query capabilities that can be found in a typical database management system (DBMS) via SQL. Loading data from HDF5 to a database is expensive and then the results of a processing in a DBMS would have to be converted back to HDF5 to be able to proceed further on in the data analysis pipeline.

The main contribution of the paper is the architecture and implementation of a new system. It is a database query engine that works over the HDF5 file format. The system used before the tool was introduced was PostgreSQL. It is a bit surprising that the query engine over HDF5 is row-based. The HDF5 format is array-based and the queries run by scientist are analytical so either an array-based or even a column-based systems would fit better in this case. When you compare results of benchmarks of OLAP-like queries for PostgreSQL and MonetDB, the performance difference is stunning, with MonetDB being an order of magnitude faster.

The paper targets a very specific use case - scientific data analysis. It seems to be one of the best places where the NoDB idea https://cacm.acm.org/magazines/2015/12/194619-nodb/fulltext can be applied. Regarding the implementation details, SDS/Q system implements the positional maps, which were proposed in NoDB. Clearly, the outcome is not as expected - because the positional maps perform poorly in this case. Why? The authors mention the inherent cost of performing random point lookups over HDF5 data. It seems to be more about at least 3 things: the queries run (workload), the underlying file format, and the experiments themselves to approve/disapprove a given technique.  Let me comment on the two former differences. The queries run in the NoDB paper are aggregates that scan the whole input file. Moreover, the tables in NoDB have hundreds of columns. It can be the case that when you do such a scan, you can just jump over the file but still preserving the forward pass through the file and, overall, a sequential access to disk. On the other hand, SDS/Q indicates random accesses. The data storage from HPC (High Performance Computing) environment is able to support massively parallel I/O where many processes/threads access data from the storage simultaneously, but in case of singe program accessing data randomly, the cost of such access is more expensive than just a sequential one.

Mistakes: a claim in section 4.2 that the checkpointing frequency was increased to speed-up the data loading process to PostgreSQL is incorrect. Actually, a reverse statement is correct: the checkpointing should be less frequent and the intervals as well as the amount of data modified (threshold) to trigger a checkpoint should be increased. 


Tuesday, 28 March 2017

Suggestion for Supper

Pancakes with Greek yogurt and cherry jam, apples with cinnamon and oatmeal cookies with almond butter. Bon App├ętit :)

Monday, 27 February 2017

Snake bites

The programming language that bites me most terribly & keenly is ... no doubts ... Python. This time it was a global variable eps (epsilon) and I did not add the "self." prefix to the one that I really wanted to use in a method of a class. The nightmares of C programming and its common best practices come back! All in all, avoid global variables also in Python, or these snake bites can be deadly at the end of the day.

Monday, 20 February 2017

Bug in pthread_cond_timedwait or we need an improvement?

pthread_cond_timedwait returns ETIMEDOUT instead of e.g. EINVAL when provided with "timespec *restrict abstime" whose tv_nsec part exceeds 999999999 (basically, it's more than 1 second). Furthermore, IMHO pthread_cond_timedwait should return ETIME instead of ETIMEDOUT (if we follow comments in: http://bit.ly/2kgHEhj). Anyway, if you want to avoid a few hours of debugging then make sure that your tv_nsec is correct in each timespec struct!

Friday, 3 February 2017

"sudo checkinstall" instead of "sudo make install"

The pain of removing packages installed from source has reached its ultimate peak and it's the highest time to move to checkinstall. This gives us a very handy ability to remove a package from the system using the system packaging tools: e.g.  

dpkg -r custom-protobuf

So current flow of commands for any installation from source is:

./configure 
make  
sudo checkinstall

Wednesday, 21 December 2016

Laws in Physics vs. in Computer Science

The laws of physics and nature are unchanged and we just have to discover them. It requires a lot of imagination and thorough study to fully explain and prove a law. A beautiful and rewarding effect is that later on we can predict certain phenomenon or simply what will happen next based on the laws. An example is the Fermat law that says that light always traverses the path that gives it the lowest running time. You can, for example, infer that because the atmosphere is denser than the vacuum, the light that comes to your eye from the Sun is curved and in reality when you watch a sunset, the Sun is already below the horizon, even though it is still visible.

On the other hand, there are no true and unchanged laws in computer science. The design of systems changes and the way it does can be surprising as every designer/human being has different preferences and can even change a commonly accepted "good design choices". An example is a mismatch between OS page size and DBMS page size, 4 KB and 8 KB respectively. The database can fetch 8 KB pages randomly, for example, because of the index-based scan. On the other hand, the OS notices that the application (DBMS in this case) starts from fetching 2 pages in a row (2x OS page of 4 KB - gives the 8 KB page fetched by the DBMS) and OS assumes that it must be a sequential scan of pages, so it additionally prefetches another 2x 4 KB pages (16 KB in total), the second 8 KBs in vain. The other 2 pages won't be read subsequently. Afterwards, DBMS goes to a totally different location to fetch another 8KB page.

The systems could have a built-in level of adjustment (probably built-in Machine Learning components) that could simulate the nature and automatically improve as the time progresses, for example changing the number of pages to 3 in a row to trigger sequential scan assumption in the OS. The computer environment should imitate better the natural environment.

Thursday, 27 October 2016

$ fortune

$ fortune
Real software engineers work from 9 to 5, because that is the way the job is
described in the formal spec.  Working late would feel like using an
undocumented external procedure.