GriDMan: Data Management for Scientific Applications in a Grid Environment (Finished)

Current data analysis tools for scientific applications are faced with terabytes or even petabytes (PB) of data. For example, the size of the data in Earth Observation applications will reach 9 PB by the year 2010 and 14 PB by the year 2014. Dealing with such amounts of data makes traditional approaches to  data management unworkable. Firstly, data analysis algorithms that are used by scientists are at least of O(N2) or higher complexity and cannot deal with the massive amount of data in reasonable time. Secondly, tools to manage data do not keep up with the amount of data and their distribution. Thirdly, many scientific projects are performed by large groups of scientists that are distributed among several geographic locations. Grid computing attempts to jointly address all these problems by providing the necessary tools for sharing large quantities of resources within Virtual Organizations. In particular, Computational Grids focus on the sharing of CPU cycles which allows to parallelize inherently complex algorithms for data analysis. Service Grids are dedicated to application services that are deployed on Grid nodes. Finally, in Data Grids, possibly heterogeneous nodes contribute local storage capacity to support the management of large volumes of data in a virtual organization. The Data Grid integrates distributed data sources to create a single virtual resource which provides its users with potentially unlimited storage capacity. Each data source may be a database, files or web pages, semi-structured and unstructured data, data streams, raw data from sensors, or multimedia data. But the Grid goes beyond sharing and distributing data and computing resources. The Data Grid has become a prevalent computing environment for data analysis in scientific applications. eScience applications can greatly benefit from Grid environments that adopt state of the art database technology. However, currently, the primary data unit used in the Grid is at files. Consequently, querying, importing, analyzing and updating data requires writing programs that operate on these files. This is labor intensive and requires re-programming when the format of a given file changes, new sources are added, or a query is slightly changed. In addition, since there is no global control in a Grid environment and since providers can decide to withdraw nodes at their sole discretion, guaranteeing a certain level of data availability means to replicate data across several nodes. Current Data Grids delegate replication management to the users and do not provide any built-in support for maintaining a certain degree of replication, for supporting updates of replicated data, and for providing data in different versions and levels of freshness. Thus, Grid applications are faced with two basic problems: Data Gathering & Integration: A typical eScience application processes extremely large amounts of data and it also generates large amounts of data. Thus, the problem of scalability of data gathering is a very serious one. In addition, data sets are heterogeneous and span multiple sites and granularities. Depending on the particular application, one might require different data models and views. Data Management: In contrast to the first Data Grid applications that mostly dealt with read-only data, novel eScience applications need to address both read-only and updateable data. As a result, new eScience applications require the development of tools that guarantee consistent data replication without a substantial increase of data processing costs. Users’ demands for data freshness required by applications also should be honored. Finally, the Data Grid should have an ability to uniformly distribute an application load among many Grid nodes. The GriDMan project focuses on the second aspect, on novel solutions for data management in a Grid environment, including dynamic replication and different levels of consistency.

Start / End Dates

01.05.2011 - 31.10.2013

Partners

Associated partners: Prof. Ruoming Jin, Kent University, Ohio, USA

Funding Agencies

Swiss National Science Foundation (SNF)

Funding

122’845.- CHF

Staff

Research Topics

Publications

2013
2012
2010