I am looking for students to work with me on a few usecases and projects described below. If you have strong coding experience in C/C++/Python/Java, and have take the first course in database management systems, you may send a cover letter and a CV/resume to scidataspace at gmail dot com.
As scientists struggle with increasingly larger amounts of data, it is natural to host data within a data management system that simplifies operations on data, provides guarantees on performance and correctness, and enables analyses. Relational database management systems have long been applied to scientific omputing, but increasingly have been found inefficient for scientific computing. Our goal is to invent new principles for data management that are more suitable for scientific computing and innovatively apply existing principles (the basis for relational DBMSs) to scientific computing.
Based on the science usecase and performance needs, we determine how to efficiently adapt existing data models for scientific computing, such as:
- relational data model,
- array data model,
- heirarchical data model (file-systems).
- graph data model, and
- key-value data model,
Please see our publications for details.
UseCases
PLENARIO: AN URBAN DATA CYBERINFRASTRUCTURE
The cyberinfrastructure, technologies, and tools used to make the rich set of urban open data available were designed primarily to support the analysis of individual data sets rather than exploring relationships among many data sets. Consequently, urban scientists lack the tools and infrastructure to fully harness urban data for their research. Plenario, is a new platform for accessing, combining, downloading, and visualizing datasets released by city, county, state, and federal governments. Currently in alpha, the Plenario prototype exploits the fact that the bulk of published urban data sets share the attributes of location and time. By integrating data across multiple data sources, for specific time periods, for specific geographic areas, Plenario enables scientists to apply the tools of mathematics and computation to better understand urban challenges, ranging from youth violence and crime to graduate rates to employment and economic decline and revitalization.
EARTH DATA ENGINE: A CYBERINFRASTRUCTURE FOR COMPUTING AND ENVIRONMENT
Several science domains, examine the natural and physical phenomenon as a continuous phenomenon, but for computational purposes must represent it as discrete datasets. As the size of these datasets increases, analysis and visualization mandates that these datasets be stored with a data management system with a powerful engine for data analysis. The Earth Data Engine is a new platform that allows for accessing, subetting, interpolating and visualizion of gridded datasets by geoscientists, agiculturalists, and physical scientists.
COSMOLOGICAL SIMULATIONS AND CRUNCHING "BIG DATA"
Cosmology has entered the era of precision science, from order of magnitude estimates to ~few % accuracy measurements of mass content, geometry of the Universe, spectral index of primordial fluctuations and their normalization. Cosmologists must find different ways to measure cosmological parameters and to observe the Universe and its dynamics; obtain independent crosschecks for all results. Simulations must scale, but allow for data analysis.
PDACS is a platform that combines high throughput computing and runs jobs in a loosely coupled fashion. We examine how databases can be interfaced as part of a HTC workflow to allow scalable processing and data analysis.
Research
Extending RDBMS for scientific analyses: Through analyses of scientific workloads, we have extended RDBMS systems
to efficiently support scientific analysis. For instance, a fundamental operator in an RDBMS is the deterministic join, which combines information from different relational tables or databases that ‘belong’ to the same object (key). Physical sciences, however, often need a probabilistic join, in which the key may be slightly different in different relations. In astronomy, for instance, the key is the physical location of objects in the sky, and varies across surveys because of instrument errors in recording objects. The work in [] and several others describe extensions to RDBMSs.
Databases in HTC workflows: We are investigating the relation between HTC systems and database systems by coupling databases sytems within HTC workflows.
The current investigation combines them with a Galaxy framework that can run MPI-based jobs. The objective is to use
databases to enable iterative and recursive workflow patterns.
Geospatial databases : We are investigating how to extend big data systems like HBase, SciDB, and MongoDB to efficiently support
geospatial analyses.