Optimizing End-to-End Performance of Scientific Workflows in Distributed Environments
SESSION: Student Poster Reception
EVENT TYPE: Poster, ACM Student Poster
TIME: 5:15PM - 7:00PM
ABSTRACT: The advance in supercomputing technology is expediting the transition in various basic and applied sciences from traditional laboratory-controlled experimental methodologies to modern computational paradigms involving complex numerical model analyses and extreme-scale simulations. These computation-based simulations and analyses have become an essential research and discovery tool in next-generation scientific applications and are producing colossal amounts of data. Other scientific data of similar scales generated in broad science communities include real-world environmental data such as satellite climate data, multimodal sensor data, etc. and high-throughput experimental data such as Spallation Neutron Source, Large Hadron Collider, etc. No matter which type of data is considered, one crucial task would be to develop an effective and efficient end-to-end solution for geographically distributed users to transfer, process, visualize, analyze, and synthesize the data in heterogeneous network environments for collaborative research. The computing tasks of these scientific applications feature complex workflows comprised of many computing modules with intricate inter-module execution dependencies, and the execution of computing modules typically involves the invocation of a large number and wide variety of distributed computing tools for collaborative data analysis and knowledge discovery. Supporting such scientific workflows in heterogeneous network environments and optimizing their end-to-end performance are crucial to ensuring the success of mission-critical e-sciences and maximizing the utilization of massively distributed computing and networking resources.
The computing workflows of scientific applications are modeled as a Directed Acyclic Graph (DAG) where each vertex represents a computing module and each directed edge represents the execution dependency between two adjacent modules. We consider the network environment as an overlay computer network consisting of a number of heterogeneous computer nodes interconnected by disparate network links. The network can be represented by a directed weighted graph with an arbitrary topology. We assume that the bandwidths of overlay links are not limited by nodes upload or download capacities and are independent of each other The DAG-structured workflow mapping problem has been well studied in the literature and is known to be NP-complete even on two processors without any topology or connectivity restrictions. We consider a general workflow mapping problem: select an appropriate set of computer nodes in the network and assign each computing module in the workflow to one of those selected nodes to achieve (i) Minimum End-to-end Delay (MED) for fast system response in single-input applications and (ii) Maximum Frame Rate (MFR) for smooth data flow in streaming applications with multiple-input (time-serial) datasets. Note that if multiple modules are mapped onto the same node, the node’s computing resource is shared in a fair manner by concurrently running modules on that node. Similarly, the bandwidth of a network link is equally shared by multiple data transfers that take place concurrently over the same link. The difficulty of this mapping problem essentially arises from the topological matching nature in the spatial domain, which is further compounded by the resource sharing complicacy in the temporal dimension if multiple modules are deployed on the same node.
We construct analytical cost models and formulate workflow mapping as optimization problems. For MED in interactive applications, we propose an efficient algorithm to compute the exact end-to-end delay of a mapped workflow with arbitrary node reuse and develop a distributed workflow mapping scheme based on a recursive critical path optimization procedure; while for MFR in streaming applications, we conduct a rigorous workflow stability analysis and develop a distributed layer-oriented dynamic programming solution based on topological sorting to identify and minimize the global bottleneck. The accuracy of the proposed exact delay calculation algorithm is verified in comparison with an approximate solution, a dynamic distributed system simulation program, and a real network deployment, and the performance superiority of the proposed mapping schemes are illustrated by extensive simulation based comparisons with existing algorithms and verified by large-scale experiments on Spallation Neutron Source workflows through effective SWAMP system implementation and deployment in real networks, which is funded by the Office of Science in Department of Energy. I am a key participant of this three-year project, “Towards a Scalable and Adaptive Application Support Platform for Large-Scale Distributed e-Sciences in High-Performance Network Environments”, which is a team-based project consisting of two principal investigators and five Ph.D. students. In this project, we proposed a Scientific Workflow Automation and Management Platform (SWAMP), which contains a set of integrated computing and networking toolkits for application scientists to conveniently assemble, execute, monitor, and control complex computing workflows in heterogeneous high-performance network environments. Mapping component modules of workflows to appropriate computer nodes in the high performance environment is a critical task in SWAMP to ensure the end-to-end performance of workflows and the utilization of distributed system and network resources. My main responsibility in this project is to develop a workflow mapper for SWAMP to optimize workflow performance by incorporating my own workflow mapping solutions based on rigorous algorithm design and performance analysis.