Some notes on Data management

There is a new book i found during a great talk by Mathew Graham, (Center for Advanced Computing Research, California Institute of Technology, USA) i heard at a summerschool for Astrostatisics and Datamining on La Palma, which i just attend:

“The fourth Paradigm”.

I strongly recommend taking a look at it!

Comment: The “fourth” paradigm means: there are some paradigms in science (if you dont know what a paradigma is take a look at

1. Experiments
2. Theory
3. Numercal simulations
4. Data driven science

It seems that Theory and Simulations correlate in the same way as experiments and data driven science.

Some notes:

  • hadoop: reads HDFS
  • NoSql, manages Petabytes in a non-relational database
  • SciDB -> works with arrays instead of tables, Query languages: AQL, AFL
  • MapReduce ->used in  Astronomy in the Cloud ( Allows to parallelize the map and the reduce process on different nodes (see It is used to completely regenerate Google’s index. It therefore uses atomic database transactions (
  • HIVE: organizes data into tables, queries are converted into MapReduce jobs, allows bucketing in multiple dimensions. You can still use SQL Syntax!
  • Pregel: Graphs

