AmLight-ExP and AtlanticWave-SDX offers the academic community 630Gbps of upstream bandwidth, network auto-recovery and dynamic provisioning, network programmability, network telemetry, integration with SENSE project's distributed orchestrator, and 100G DTNs.
The AtlanticWave-SDX (AW-SDX) (NSF Award #1451024) is a distributed, multi-domain, wide-area SDX platform that controls many network switches across the U.S. and South America. Southern Crossroads (SoX) in Atlanta, AMPATH in Miami, South America eXchange (SAX) in Fortaleza, SouthernLight in Sao Paulo, and AndesLight in Santiago are exchange points participating in the AtlanticWave-SDX project.
Transient faults, temporary failures in processors or memories, are a growing concern for emerging extreme-scale HPC systems. New work at Georgia Tech focuses on automatic on-the-fly recovery from some faults in HPC applications. The key observation is that many HPC codes, e.g. stencil-based codes, require multiple integer operations to calculate array index values, and that a fault in this series of operations 1) is relatively easily detected because it likely causes a segmentation fault, and 2) could be recovered by simply replaying those address operations. In order to test this hypothesis, we created CARE, a light-weight compiler-assisted technique for on-the-fly repair of processes crashed by transient faults in the address path. The goal of CARE is to repair faulting processes so that they simply continue their executions instead of being terminated and restarted. Care becomes active only when a corrupted address is dereferenced, and so imposes no run-time overhead in the case of fault-free execution.
Find out more about this work from 4 - 4:30 PM on Wednesday, Nov. 20 at Room No. 401-402-403-404
Kiran Ravikumar, David Appelhans, P.K. Yeung
A Georgia Tech and IBM team have developed a batched asynchronous algorithm using GPUs to perform large pseudo-spectral turbulence simulations out of CPU memory on dense node architectures like Summit. The code uses optimized strided copy kernels and MPI+OpenMP parallelism to scale to extreme problem sizes.
Find out more about this work from 11:30AM - 12 PM on Tuesday, Nov.19 at Room No. 405-406-407