ABSTRACT: Finding Tropical Cyclones On Clouds
Extensive computing power has been used to tackle issues such as climate changes, fusion energy, and other pressing scientific challenges. In this work, we bring the power of cloud computing to bear on the task of analyzing trends of tropical cyclones in climate simulation data. The cloud computing platform is attractive here because it can provide an environment familiar to climatologists and their analysis tools. We created virtual machines (VMs) and ran them on the Magellan Scientific Cloud at Argonne National Laboratory. Our VM communicates with instances of itself to split up and analyze large datasets in parallel. In a preliminary test, we used this virtual climate analysis platform to analyze ~500GB of climate data. Using 34 VMs, the total analysis time was reduced by a factor of ~40 from the traditional analysis method. The main advantages of our method are that the level of parallelism is easily configurable, and software dependency resolution is simple. This initial work demonstrates that a cloud computing system is a viable platform for distributed scientific data analysis traditionally conducted on dedicated supercomputing systems.
Climatologists generate many terabytes of data using climate simulation techniques. This data requires computationally intensive analysis. The code to perform this analysis often contains numerous software dependencies, and climatologists want to use variable amounts of compute resources to control analysis speed. Hence, a computational platform that provides both configurable parallelism and simple dependency resolution is desirable.
We used parallel virtualization to create such a platform. Virtualization is a burgeoning technology that has yet to be applied in the climatology community. Parallel virtualization on clouds is suitable for climatologists because it provides both easy dependency resolution (climatologists can make their VMs look just like their personal workstations) and configurable parallelism (climatologists can launch as many instances of VMs as necessary).
We designed a virtual machine to perform climate data analysis on cloud computing clusters. Our virtual machine contains a climate analysis program (not written by us) which analyzes climate data in NetCDF format and outputs lists of possible tropical storms. This code contains numerous dependencies on NetCDF library functions, all of which were easy to resolve inside the virtualized environment. The virtual machine also contains GridFTP software/credentials and a control script implemented in python. Upon launch, our virtual machine uses its rc.local file to launch the control script. The control script runs a leader election algorithm to find other climate analysis VMs on the subnet and elect a single leader. The elected leader initializes a queue of remote URLs to the files we wish to analyze as well as a small RPC server/library to wrap around the queue (we needed a way for multiple processes over multiple nodes to access a single synchronized queue; RPC is a nice abstraction for this). The workers pull URLs from the queue using the leader's RPC library. Each time a worker receives a URL, it uses GridFTP to stage in the file referred to by the URL, runs the climate analysis code on it, and stages all the results out to a specified remote directory. When there are no more URLs left in the queue to analyze, the VMs shut themselves down.
With this method, we have created an environment in which code previously limited to single-processor execution can be massively parallelized.
Our most encouraging preliminary result is an analysis of ~500GB of climate simulation data in under 3 hours using 34 VM instances. Analysis of the same dataset using the same code has previously taken 5-7 days running on climatologists' workstations. We have been able to control analysis speed by using different instance types and numbers of instances. Resolving software dependencies was simple; we plan to try to run the same code using traditional batch-based supercomputing to compare both ease of environment setup and computing speed. Some reliability issues were found with the cloud platform we used; it was often difficult to get VM instances to run. We believe that these issues will be resolved in time, and even with the issues we were able to successfully complete analysis tasks using many VM instances.
Presentation: We will give a discussion of motivation, methodology, and findings, and review the techniques we learned/used creating the VM. By presentation time, we will have extensive timing data and will use this to draw conclusions about how best to utilize cloud resources. We will also present a comparative analysis of the differences between using parallel virtualization and traditional supercomputing for climate analysis, focusing on both platforms' ability (or lack thereof) to meet climatologists' needs.
Daren J. Hasenkamp - Lawrence Berkeley National Laboratory