I am looking for students to work with me on the GeoDataspace and the LDV projects described below. If you have strong coding skills in C/C++ you may send a cover letter and a CV/resume to tanum at uchicago dot edu.

Science is disseminated primarily through published articles. But how to verify and validate a science experiment in which the adopted scientific method is purely computational and data-driven? Adopting reproducible tools and practices as part of the scientific method can help in subsequent validation. From the perspective of a scientist, two fundamental challenges exist with respect to such adoption: (i) It is extremely tedious to identify dependencies and record how data is used within computational analysis; and (ii) Heterogeneity of software stacks and constant change in the computing environments make it challenging to reproduce the same environment.

I (along with my collaborators) have established three systems/frameworks that help in improving this situation

  • A Cyberinfrastructure for Conducting Reproducible Science,
  • A PROV-enabled High Performance Provenance Service,
  • A Virtualization Mechanism for Reproducible Databases
Please see our publications for details.

Use Cases

Most of our use cases are from geoscience and urban science domains.

EarthCube Building Blocks: CINERGI
CINERGI project focuses on constructing a community inventory and knowledge base on geoscience information resources to meet the challenge of finding resources across disciplines, assessing their fitness for use in specific research scenarios, and providing tools for integrating and re-using data from multiple domains. Provenance is integral to CINERGI and enables metadata enhancers, which enhance geoscience information resources by making them more complete. As the enhancers change the content of the record, a corresponding provenance record is being created and made accessible via CINERGI search interface.

Geoscience: Hydrology and Space Science
The Integrated Geoscience Observatory is a pilot project that creates an online platform for integrating data and associated software tools contributed by separate geoscience research communities, into a unified toolset that brings them together. The vision is to expand the individual domains of geoscience research toward study of the whole Sun-Earth system, and in so doing to uncover the system level effects critical to the habitability of planet Earth. The Observatory uses GeoDataspace, for attributing credit to contributors through publication of processed data. The toolset can be accessed and used either through a web-based computing environment, or through download packages for local installation, with a nearly seamless transition between the two.

Plenario: An Urban Data CyberInfrastructure
Plenario, is a new platform for accessing, combining, downloading, and visualizing datasets released by city, county, state, and federal governments. Currently in alpha, the Plenario prototype exploits the fact that the bulk of published urban data sets share the attributes of location and time. Currently, Plenario is an upload and download platform but provides no mechanism for including derived datasets.

Research

Reproducible and Shareable Frameworks: Sharing and repeating applications is crucial for verifying claims, reproducing model/simulation runs, and promoting reuse of complex model applications. However, scientists lack effective mechanisms that enable easy sharing and efficient repeatability. It is not unusual for scientists to spend vast amounts of time and effort to capture, manage and organize the various data elements that a typical science application or model requires to operate: the input files, processing and manipulation scripts, manifests, or databases that must be assembled and organized appropriately for the model to function correctly. We are establishing a reproducible and shareable framework for geoscientists, termed the GeoDataspace, to make it easy for geoscientists to share and repeat their geoscience model applications in an efficient and effective way. The GeoDataspace system captures models and data in an integrated way, encapsulates them as a single shareable package, and allows the user to share/publish the package for wider community use or self-preserve it for further analysis.

PROVaaS: Provenance-as-a-Service: PROVaaS is a RESTful service for storage and access of provenance documents, following the W3C PROV standard. Using the REST API, clients can upload multiple documents described in PROV. The service maintains the provenance, if any, among the documents. The clients can query the provenance of an entity or activity using the PROV description model. The basic motivation to build PROVaaS is to provide a fast ingestion provenance database service that can ingest disconnected graphs and compute the provenance between ingested graphs as a batch operation. Thus PROVaaS is suitable for science projects that need a host for preserving provenane but can accept some delay in provenance queries.

Reproducible databases with Light-weight Database Virtualization: Light-weight database virtualization (LDV) is a novel approach for sharing and repeating applications that use a relational database. LDV monitors a DB application to create a repeatability package that encapsulates the application and its dependencies (input files, binaries, and libraries) as well as the necessary and relevant data from the database required for successful repetition. LDV relies on data provenance to determine which database tuples and input files are relevant. While monitoring an application to create a package we incrementally construct an execution trace (provenance graph), that records dependencies across OS and DB boundaries. Such a package can be shared and re-executed on any compatible machine without requiring installation of dependencies (e.g., database system or libraries) and without having to manually create and setup a database.

Products

Most Recent

1. PROVaaS: Provenance-as-a-Service. In Theory and Practice of Provenance (TaPP) (poster), 2015.
Abstract. PDF.

2. H. Meng, R. Kommineni, Q. Pham, R. Gardner, T. Malik, and D. Thain. An Invariant Framework for Conducting Reproducible Computational Science. In Journal of Computational Science, Elsevier, 2015. Invited
Abstract. PDF. Software.

3. Q. Pham, S. Thaler, T. Malik, B. Glavic, I. Foster. Light-weight Database Virtualization. In IEEE International Conference on Data Engineering (ICDE), 2015.
Abstract. PDF. PPT. Software.

4. Q. Pham, T. Malik, I. Foster. SOLE: Towards Descriptive and Interactive Publications. In Implementing Reproducible Research, Editor: Victoria Stodden, et. al, CRC Press, 2014.
Abstract. PDF. Software.

5. Q. Pham, T. Malik, I. Foster. Auditing and Maintaing Provenance in Software Packages. In International Provenance and Annotation Workshop (IPAW), 2014.
Abstract. PDF. PPT. Software.

6. D. Zhao, C. Shou, T. Malik, I. Raicu. Distributed Data Provenance for Large-Scale Data-Intensive Computing, In IEEE Cluster, 2013.
Abstract. PDF.

7. Q. Pham, T. Malik, I. Foster. Using Provenance for Repeatability. In USENIX NSDI Workshop on Theory and Practice of Provenance (TaPP), 2013.
Abstract. PDF. PPT. Software.

8. T. Malik, A. Gehani, D. Tariq, F. Zaffar. Managing and Querying Distributed Data Provenance in SPADE. In Data Provenance and Data Management for eScience, Springer, 2012.
Abstract. PDF. Software.

9. A. Gehani, D. Tariq, B. Baig, T. Malik. Policy-Based Integration of Provenance Metadata. In IEEE International Symposium on Policies for Distributed Systems and Networks (POLICY), 2011.
Abstract. PDF.

10. T. Malik, L. Nistor, A. Gehani. Tracking and Sketching Distributed Data Provenance. In IEEE eScience, 2010.
Abstract. PDF. Software.

Presentations

1. GeoDataspace: Better Tools for Metadata Management. The EarthCube All Hands Meeting, May 2015.
2. A Reproducible Framework Powered By Globus. The Globus World, Apr. 2015. (Presented by Kyle Chard)
3. GeoDataspace: Simplifying Data Management Tasks with Globus. The American Geophysical Union (AGU), Dec. 2014.
4. Reproducibility is hard. Not NP-hard. The Notre Dame DASPOS Workshop, Sept. 2014.
5. Towards Verifiable Publications. The SIAM Annual Meeting, Jul. 2014
6. Active Publications, IEEE eScience, 2013.

People

  • Tanu Malik
  • Cristian Vlaescu, programmer
  • Max Khon, programmer
  • Xiang Li, Master's student, current
  • Quan Pham, Phd, alumni
  • Rupa Kommineni, Master's student, alumni.
top