Gateways 2019: Full Schedule

9:00am PDT

Portable, Reproducible High Performance Computing In the Cloud (All-day Tutorial)

This tutorial will focus on providing attendees exposure to state-of-the-art techniques for portable, reproducible research computing, enabling them to easily transport analyses from cloud to HPC resources. We will introduce open source technologies such as Jupyter, Docker and Singularity, the emerging "serverless" computing paradigm, and how to utilize these tools within two NSF-funded cyberinfrastucture platforms, Tapis API (formerly Agave API) and Abaco API. The approaches introduced not only increase application portability and reproducibility but also reduce or eliminate the need for investigators to maintain physical infrastructure so that more time can be spent on analysis. For the tutorial, attendees will have access to allocations on XSEDE JetStream and one or more HPC resources such as TACC’s Stampede2 or Frontera.

Target Audience: This tutorial is targeted to CI professionals and researchers that are interested in learning to use container technologies for research computing, and leveraging national cyberinfrastructure platforms to execute containerized compute jobs on cloud and HPC resources.

Content Level: Beginner 70%, Intermediate 30%

Prerequisites: Requirements before the Workshop: Basic familiarity with Linux, SSH and the command line will be assumed. A valid, active TACC account will be needed to complete the exercises (attendees can register for a TACC account for free on the TACC User Portal: https://portal.tacc.utexas.edu/account-request ). Some familiarity with Python will be helpful but not required. Attendees must bring their own laptops.

Presenters

Julia Looney

Joe Stubbs

Anagha Jamthe

Sean Cleveland

Monday September 23, 2019 9:00am - 5:00pm PDT
Toucan Room, Catamaran Resort

Tutorial, All-Day

10:30am PDT

Gateway design features for undergraduate education communities

Like science gateways, an education gateway should provide research and management support for a community of practitioners working collaboratively to solve a set of challenging problems. While the technical aspects of the cyberinfrastructure play an important role in the utility of a gateway, they are not sufficient to attract users who are new to collaborative, online scholarship. Over the course of the development of the Quantitative Undergraduate Biology Education and Synthesis (QUBES) gateway we have learned to adapt our services and messaging to reach out to our target audience and recruit their participation. Part of this process has involved aligning our services with common project management challenges and being aware of the opportunities and constraints faced by teaching faculty. Adopting a client-centered approach has made it possible not only to build our user base, but to foster important conversations among users around promoting a shared culture that supports scholarly approaches to teaching.

Presenters

Sam Donovan

Michael LaMar

Associate Professor, College of William and Mary

Tuesday September 24, 2019 10:30am - 10:50am PDT
Toucan Room, Catamaran Resort

Concurrents B, Education

10:50am PDT

External Communication to Diffuse Science Gateways and Cyberinfrastructure for Research with Big Data

In the era of big data, for science gateways (SG) and cyberinfrastructure (CI) projects to have the greatest impacts, the tools need to be widely adopted in the scientific community. However, diffusion activities are often an afterthought in SG/CI projects. We warn against the fallacy of ‘If You Build It, They Will Come’. Projects could be intentional in promoting tool adoption. We identified five external communication practices based on an analysis of 20 interviews with administrators, developers, users, and outreach educators working in CI across the US. The practices include raising awareness of the innovations, engaging in educational outreach, building relationships with trust, networking with the community, and keeping a track record of reliability. While exploratory in nature, the findings can be used as a guideline for project to promote SG/CI diffusion. The paper serves as evidence to justify a bigger budget from funder for diffusion activities to increase adoption and broader impacts.

Presenters

Bethanie Le

Faith Escalera

Kulsawasd Jitkajornwanich

Kerk Kee

Tuesday September 24, 2019 10:50am - 11:10am PDT
Toucan Room, Catamaran Resort

Concurrents B, Education

11:10am PDT

TAMU HPRC Portal: Leveraging Open OnDemand for Research and Education

The Texas A&M University High Performance Research Computing (TAMU HPRC) Portal is a local installation and adaptation of Open OnDemand (OOD) on the HPRC clusters. The portal provides an advanced cyberinfrastructure that enables HPRC users with various backgrounds to utilize the High Performance Computing (HPC) resources for their research. It also serves as an educational platform for HPRC staff to train their users with cluster technologies and HPC applications.
Using OOD for the HPRC portal has three benefits. First, it provides a single point of access to all the HPC resources via a web browser and can greatly simplify HPC workflows. Second, it provides an intuitive user interface that significantly reduces the barrier between users and HPC working environments. Third, the extensible and scalable design makes it easy to accommodate a growing number of users and applications.
In addition to the out-of-the-box features, we have extensively customized the Matlab interface for our local needs. We have also developed a dynamic form generation scheme that makes the portal app deployment and management more efficient. We have used the portal in multiple training programs and have received positive feedback from the instructors and the users.
To understand the impact of the portal on our users, we analyzed portal access data and conducted a survey among HPRC portal users. We received 148 survey responses out of 554 users who have accessed the portal between March 22, 2018 and April 24, 2019. The responses demonstrate that most users think the apps are useful and they would recommend the portal to other users. Additionally, we provide two use cases from our portal users, one for research and one for training, to demonstrate the usefulness of the portal.
Our paper is the first that describes the experience with OOD from an HPC site outside of OSC. Overall, the TAMU HPRC Portal based on OOD provides a robust and simple solution for both novice and experienced users at TAMU HPRC to access HPC resources. It is a valuable addition to the traditional command line based approach.

Presenters

Ping Luo

Donald Mcmullen

Mark Huang

Shaowen Mao

Michael Dickens

Yang Liu

Marinus Pennings

Tuesday September 24, 2019 11:10am - 11:20am PDT
Toucan Room, Catamaran Resort

Concurrents B, Education

11:20am PDT

Using a Scientific Gateway to Build STEM Education Capacity Around the World

With a broad focus on STEM education, STEMEdhub.org brings together university researchers in STEM disciplines with other researchers, k-12 teachers/practitioners, students and the public through the various groups and projects on the site. Built on Purdue’s HUBzero architecture, STEMEdhub.org is a fully functional Gateway that facilitates the hosting of interactive scientific tools, online presentations, wikis, or documents such as assessment plans and courses for downloading or interactive editing, complemented by document tagging to enable searching and a rating tool for commenting on shared resources. STEMEdhub has been used for over 8 years to build capacity in many areas of STEM Education in the United States and throughout the world. It currently hosts over 6000 users in 160 user groups with over 1300 published resources. More importantly, STEMEdhub.org allows users themselves to create and manage their own groups, resources and communities of practice enabling it to exist on very little overhead with a small staff. While other Scientific Gateways focus on high performance computing capabilities, STEMEdhub is focused on using a Scientific Gateway platform to make connections, build partnerships and engage students. This Demo will show how STEMEdhub.org is used as a Scientific Gateway to build STEM Education Capacity throughout the world.

Presenters

Ann Bessenbacher

Data Scientist, ELRC/Purdue University

Wilella Burgess

Loran Carleton Parker

Tuesday September 24, 2019 11:20am - 11:40am PDT
Toucan Room, Catamaran Resort

Concurrents B, Education

Company Demo

11:40am PDT

Open OnDemand: State of the Platform and the Project

High performance computing (HPC) has led to remarkable advances in science and engineering and has become an indispensable tool for research. Unfortunately, HPC use and adoption by many researchers is often hindered by the complex way in which these resources are accessed. Indeed, while the web has become the dominant access mechanism for remote computing services in virtually every computing area, it has not for HPC. Open OnDemand is an open source project to provide web based access to HPC resources (https://openondemand.org). This paper describes the challenges to adoption and other lessons learned over the three year project that may be relevant to other science gateway projects, and describes future plans in the Open OnDemand 2.0 project.

Presenters

Alan Chalker

David Hudak

Eric Franz

Morgan Rodgers

Trey Dockendorf

Doug Johnson

Tuesday September 24, 2019 11:40am - 12:00pm PDT
Toucan Room, Catamaran Resort

Concurrents B, Education

1:50pm PDT

DDX-Interface: An interface to and a factory of interoperable scientific gateways.

Data access and distribution is an ongoing problem in science, affecting many research fields, from genomic information to microscopic images of rocks. Issues such as differing database schema and file formats, inability to customize or enforce laboratory terms of use, infrastructure failure and financial limitations have reduced public access to scientific data. Centralized solutions have been funded by government agencies in an attempt to expand access, but often, valuable resources cannot be published in repositories without being part of a peer-reviewed publication. Previously, we proposed to answer the demand for public access to raw scientific data using Open Index Protocol, a specification to publish metadata into a public blockchain-based ledger and host files in peer-to-peer file systems. With this method, there are 30TB of cryo-electron tomography datasets publicly available today. Now, we generalized this idea to let researchers publish any kind of scientific datasets using a distributed public ledger as a common index between interoperable databases. Here we describe a customizable gateway capable of exploring these distributed databases and publishing new records. The basic gateway design is built to be intuitively operable by academic and non-academic public alike, expanding the reach of the data distribution. As Open Index Protocol becomes a popular choice for data distribution by laboratories, focus on the user experience of the interface for data consumption will be key to achieve its full impact on society.

In the demo part of this presentation, we will demonstrate how to build a distributed database to share scientific data using Open Index Protocol, a specification to publish metadata on FLO blockchain and use a peer-to-peer file system for file storage and distribution. To begin, we will launch an instance of DDX and publish the metadata schema to the blockchain. Next, we will publish a few datasets to the database using the schema. Then, we will configure the explorer template and customize it to create a static webpage capable of exploring, searching & downloading the datasets published. A remote colleague will run another instance of DDX, configured to be compatible with the database we just created, and will use it to publish some records. We will be able to visualize their records in our own instance of DDX. Finally, we will show how to deploy and build a static website that serves as a gateway to visualize the records in the newly created database. In this brief demonstration, we will show the flexibility and power of this distributed resource for increasing access to raw datasets. Our main goal is to make it easy for researchers to participate in the effort to host and share their own data.

Presenters

Amy James

Ryan Chacon

Davi Ortega

Prudence Rees-Lee

Jeremiah Buddenhagen

Devon Read James

Grant J. Jensen

Tuesday September 24, 2019 1:50pm - 2:20pm PDT
Toucan Room, Catamaran Resort

Concurrents B, Federation & Interoperability

Company Paper plus Demo

2:20pm PDT

Enabling Data Streaming-based Science Gateway through Federated Cyberinfrastructure

Large scientific facilities are unique and complex infrastructures that have become fundamental instruments for enabling high quality, world-leading research tackling scientific problems at unprecedented scales. Cyberinfrastructure (CI) is an essential component of these facilities to provide the user community with access to data, data products, and services with the potential to transform data into knowledge. However, the timely evolution of the CI available at the large facilities is challenging and can result in science communities requirements not being fully satisfied. Furthermore, integrating CI across multiple facilities as part of a scientific workflow is hard, resulting in data silos.

In this paper, we explore how science gateways can provide improved user experience and services that may not be offered at the large facilities datacenter. Using a science gateway supported by the Science Gateway Community Institute that provides subscription-based delivery of streamed data and data products from the NSF Ocean Observatories Initiative (OOI), we propose a system that enables streaming-based capabilities and workflows using data from large facilities such as OOI in a scalable manner. We leverage data infrastructure building blocks such as the Virtual Data Collaboratory that provides data and computing capabilities in the continuum to efficiently and collaboratively integrate multiple data-centric CI, build data-driven workflows and connect large facilities data sources with NSF-funded CI such as XSEDE. We also introduce architectural solutions for running these workflows using dynamically provisioned federated CI.

Tuesday September 24, 2019 2:20pm - 2:30pm PDT
Toucan Room, Catamaran Resort

Concurrents B, Federation & Interoperability

3:00pm PDT

nanoHUB@home: Expanding nanoHUB through Volunteer Computing

Volunteer computing (VC) uses consumer digital electronics products, such as PCs, mobile devices, and game consoles, for high-throughput scientific computing. Device owners participate in VC by installing a program which, in the background, downloads and executes jobs from servers operated by science projects. Most VC projects use BOINC, an open-source middleware system for VC. BOINC allows scientists create and operate VC projects and enables volunteers to participate in these projects. Volunteers install a single application (the BOINC client) and then choose projects to support. We have developed a BOINC project, nanoHUB@home, to make use of VC in support of the nanoHUB science gateway. VC has greatly expanded the computational resources available for nanoHUB simulations.

We are using VC to support “speculative exploration”, a model of computing that explores the input parameters of online simulation tools published through the nanoHUB gateway, pre-computing results that have not been requested by users. These results are stored in a cache, and when a user launches an interactive simulation our system first checks the cache. If the result is already available it is returned to the user immediately, leaving the computational resources free and not re-computing existing results. The cache is also useful for machine learning (ML) studies, building surrogate models for nanoHUB simulation tools that allow us to quickly estimate results before running an expensive simulation.

VC resources also allow us to support uncertainty quantification (UQ) in nanoHUB simulation tools, to go beyond simulations and deliver real-world predictions. Models are typically simulated with precise input values, but real-world experiments involve imprecise values for device measurements, material properties, and stimuli. The imprecise values can be expressed as a probability distribution of values, such as a Gaussian distribution with a mean and standard deviation, or an actual distribution measured from experiments. Stochastic collocation methods can be used to predict the resulting outputs given a series of probability distributions for inputs. These computations require hundreds or thousands of simulation runs for each prediction. This workload is well-suited to VC, since the runs are completely separate, but the results of all runs are combined in a statistical analysis.

Presenters

Benjamin Haley

Steven Clark

Nathan Denny

Saaketh Desai

Martin Hunt

David Anderson

Tuesday September 24, 2019 3:00pm - 3:20pm PDT
Toucan Room, Catamaran Resort

Concurrents B, Cloud & HPC

3:20pm PDT

Cloud bursting to AWS from the CIPRES Science Gateway

The role of commercial cloud computing as a source of scalable compute power for science gateways is an area of ongoing investigation. As part of this effort, we are exploring the practicality of cloud bursting to a commercial provider for the CIPRES Science Gateway (CIPRES), a highly accessed gateway that delivers compute resources to users across all fields of biology. CIPRES provides browser and RESTful access to popular phylogenetics codes run on large computational clusters. Historically, CIPRES has submitted compute-intensive jobs to clusters provided through the NSF-funded XSEDE project. An ongoing issue for CIPRES is whether compute time available on XSEDE resources will be adequate to meet the needs of a large and growing user base. Here we describe a partnership with Internet2 to create infrastructure that supports CIPRES submissions to compute resources available through a commercial cloud provider, Amazon Web Services (AWS). This paper describes the design and implementation of the infrastructure created, which allows users to submit a specific subset of CIPRES jobs to V100 GPU nodes at AWS. This new infrastructure allows us to refine and tune job submissions to commercial clouds as a production service at CIPRES. In the short term, the results will speed the discovery process by allowing users greater discretionary access to GPU resources at AWS. In the long term, this infrastructure can be expanded and improved to submit all CIPRES jobs to one or more commercial providers on a fee-for-service basis.

Presenters

Mark Miller

Wayne Pfeiffer

Trevor Cooper

Dmitry Mishin

Tony Chen

Shawn Strande

Tuesday September 24, 2019 3:20pm - 3:30pm PDT
Toucan Room, Catamaran Resort

Concurrents B, Cloud & HPC

3:30pm PDT

Supporting Characterisation Communities with Interactive HPC (Characterisation Virtual Laboratory)

The Characterisation VL is an Australian nationally funded virtual laboratory focused on bringing together the national community around their research data and software. Principally this means imaging techniques including optical microscopes. CT, MRI, Cryo Electron microscopy and other non-traditional techniques. As it turns out Characterisation is very general, but does have two principal commonalities:
- Data sets are getting larger every day (CryoEM ~2-5TB per dataset, LLSM ~1-10TB per dataset). They are becoming too large for the average workstation and difficult for enterprise IT providers within universities.
- Many data processing tools take the form of desktop applications, requiring interactivity and input from domain experts.
Rather than building a dedicated web interface to a single workflow, the CVL has chosen to provide access to a virtual laboratory with all of the techniques needed by the range of characterisation communities. In this demonstration we will show how easy access to virtual laboratories (science gateway) has impacted the Australian characterisation community as well as explaining the first and second generations of architecture, used and how it can be reused by other computing facilities to benefit their users.

Presenters

Lance Wilson

Chris Hines

Jafar Lie

Tuesday September 24, 2019 3:30pm - 3:50pm PDT
Toucan Room, Catamaran Resort

Concurrents B, Cloud & HPC

Company Demo

3:50pm PDT

SciServer: Bringing Analysis to Petabytes of Scientific Data

SciServer is a free science gateway that offers access to more than five Petabytes of data across multiple science domains, along with free online tools to analyze, share, and publish results.
SciServer’s online services are entirely browser-based, with no software to install and configure, and are designed to be easy to learn and use. They include familiar user interface components for managing and sharing files, creating groups, and running computational analysis in Python, R, or Matlab by means of Jupyter Notebooks or RStudio.

The SciServer project grew out an existing system designed to support astronomy research, featuring several research and education tools that made access to hundreds of Terabytes of astronomical data easy and intuitive for researchers, students, and the public. One component of the previous system was Galaxy Zoo, a citizen science project that resulted in reliable classifications of hundreds of thousands of galaxy images, and led to more than 40 peer-reviewed scientific publications.

The current SciServer system has scaled out these tools for multi-science-domain support, applicable to any form of data. SciServer has been used in a variety of fields, from oceanography to mechanical engineering to social sciences and finance.

SciServer features a learning environment that is being used in K-12 and university education in a variety of contexts, both formal and informal. We have continued to develop the educational tools into a new component called Courseware which allows a classroom or course project to be defined giving teachers and students direct access to hosted scientific data sets.
SciServer has sufficiently impressed some of our collaborators that three of them have taken the system and deployed it for themselves for use in varied environments. To facilitate this, over the past year we redeveloped the packaging and deployment model to support deployment in Kubernetes clusters. This work then led us to a new deployment of the system in the Amazon Cloud on their EKS platform. This latter installation is allowing us to experiment with the issues around data hosting and data transfer in a hybrid-cloud environment, and how best to support integration of user data between local and cloud hosted data sets.

SciServer is being developed by the Institute for Data-Intensive Engineering and Science (IDIES) at Johns Hopkins University, with funding from a five-year award from the National Science Foundation.
Submitted:

Presenters

Joseph Booker

Camy Chhetri

Jai Won Kim

Lemson Gerard

Dmitry Medvedev

M. Jordan Raddick

Michael Rippin

Manuchehr Taghizadeh-Popp

Aniruddha Thakar

Tuesday September 24, 2019 3:50pm - 4:10pm PDT
Toucan Room, Catamaran Resort

Concurrents B, Cloud & HPC

Company Demo

10:30am PDT

Tapis-CHORDS Integration: Time-Series DataSupport in Science Gateway Infrastructure

The explosion of IoT devices and sensors in recentyears has led to a demand for efficiently storing, processing andanalyzing time-series data. Geoscience researchers use time-seriesdata stores such as Hydroserver, VOEIS and CHORDS. Many ofthese tools require a great deal of infrastructure to deploy andexpertise to manage and scale. Tapis’s (formerly known as Agave)platform as a service provides a way to support researchers ina way that they are not responsible for the infrastructure andcan focus on the science. The University of Hawaii (UH) andTexas Advanced Computing Center (TACC) have collaboratedto develop a new API integration that combines Tapis withthe CHORDS time series data service to support projects atboth institutions for storing, annotating and querying time-seriesdata. This new Streams API leverages the strengths of both theTapis platform and CHORDS service to enable capabilities forsupporting time-series data streams not available in either toolalone. These new capabilities may be leveraged by Tapis poweredscience gateways with needs for handling spatially indexed time-series data-sets for their researchers as they have been at UHand TACC.

Presenters

Sean Cleveland

Anagha Jamthe

Smruti Padhy

Joe Stubbs

Jeaime Powell

Michael Daniels

Gwen A. Jacobs

Suzanne A. Pierce

Wednesday September 25, 2019 10:30am - 10:50am PDT
Toucan Room, Catamaran Resort

Concurrents A, Time Series

10:50am PDT

Streamed Data via Cloud-Hosted Real-Time Data Services for the Geosciences as an Ingestion Interface into the Planet Texas Science Gateway and Integrated Modeling Platform

By the year 2050, the state of Texas is forecast to increase in population from 28 million to nearly 55 million residents. As a result, the effects of present utilization in the sustainability of natural resources (water, energy, and land-use) must be modeled and made available to policymakers. The Planet Texas 2050 (PT2050) project is designed to address knowledge and information needed to inform and support resilient responses in the face of identified vulnerabilities.

The DataX Science Gateway is in development as part of the PT2050 initiative, to provide a platform through which scientists, data analysts, and policymakers collaborate to generate cross-disciplinary environmental models. The scientists and analysts creating the hybridized models will have unique access to both datasets, workflow generation tools, and collaborators historically partitioned across disciplines. The DataX Gateway enables the ingestion, data transformations and composition of integrated models. Core capabilities within the data portal include tools for assimilating disparate datasets, pre-processing data sources for inclusion in integrated models, and sharing through the community with access to large scale resources including storage, and computational capabilities at the Texas Advanced Computing Center.

Generally, integrated models use static datasets. The purpose of this research was to explore a method by which real-time in-situ environmental edge monitoring systems could stream data into backend models for processing. The real-time data serves as a ground truth source of information for models and expands the spectrum of possible use cases the DataX Gateway could support. The Cloud-Hosted Real-time Data Services for the Geosciences project, funded by the EarthCube program at NSF, was implemented within the DataX platform from an edge sensor point of view. Non-standard utilization of the application programming interface (API) for the ingestion of prior/non-streamed datasets was also addressed as a possible use case. Future work aims to create a data streaming to data frame workflow as an approach for connecting real-time or near real-time data with integrated models at scale. Challenges include addressing authentication and data confidentiality for potential users, as well as data collection at scale limitations.

Early implementation and testing of data streaming in the gateway has demonstrated that the capabilities of the API exceed standard data streaming. When viewed as a core service, CHORDS becomes a method by which datasets can be added to the DataX platform while providing both standardized geoscience naming schemes as well as direct pipelines into integrated model workflows.

Presenters

Jeaime Powell

Sean Cleveland

Joe Stubbs

Suzanne A. Pierce

Michael Daniels

Wednesday September 25, 2019 10:50am - 11:10am PDT
Toucan Room, Catamaran Resort

Concurrents A, Time Series

11:15am PDT

Featured Interactive Presentation: Engaging Presentation Skills

You’ve got solid data, but are people really listening to you?

For everything from seeking funding to motivating stakeholder action, communication matters. You’ve got the substance, but if it’s not presented with the proper form, structure, and style, it’s not going to be understood. I’m here to help you with this.

This brief, interactive talk will introduce you to simple techniques designed to engage audiences and make you a more effective communicator. I’ve worked with a wide range of scientists from the Jet Propulsion Labs in Pasadena to doctors at major hospitals. I’m eager to share this training with you.

Brian Palermo is a career professional actor who’s been training scientists and science communicators for a decade. https://www.palermoscienceimprov.com/

Presenters

Brian Palermo

Palermo Improv Training

Brian Palermo is an engaging actor with an impressive resume of performances in television, film and top comedy venues. He graduated from the University of New Orleans with a degree in Drama and Communications. He has been a performer and teacher with The Groundlings Theatre, Los... Read More →

Wednesday September 25, 2019 11:15am - 12:00pm PDT
Toucan Room, Catamaran Resort

Concurrents A, Featured Presentation

1:00pm PDT

ROTDIF-web and ALTENS: GenApp-based Science Gateways for Biomolecular Nuclear Magnetic Resonance (NMR) Data Analysis and Structure Modeling

Proteins and nucleic acids participate in essentially every biochemical process in living organisms, and the elucidation of their structure and motions is essential for our understanding how these molecular machines perform their function. Nuclear Magnetic Resonance (NMR) spectroscopy is a powerful versatile technique that provides critical information on the molecular structure and dynamics. Spin-relaxation data are used to determine the overall rotational diffusion and local motions of biological macromolecules, while residual dipolar couplings (RDCs) reveal local and long-range structural architecture of these molecules and their complexes. This information allows researchers to refine structures of proteins and nucleic acids and provides restraints for molecular docking. Several software packages have been developed by NMR researchers in order to tackle the complicated experimental data analysis and structure modeling. However, many of them are offline packages or command-line applications that require users to set up the run time environment and also to possess certain programming skills, which inevitably limits accessibility of this software to a broad scientific community. Here we present new science gateways designed for NMR/structural biology community that address these current limitations in NMR data analysis. Using the GenApp technology for scientific gateways (https://genapp.rocks), we successfully transformed ROTDIF and ALTENS, two offline packages for bio-NMR data analysis, into science gateways that provide advanced computational functionalities, cloud-based data management, and interactive 2D and 3D plotting and visualizations. Furthermore, these gateways are integrated with molecular structure visualization tools (Jmol) and with gateways/engines (SASSIE-web) capable of generating huge computer-simulated structural ensembles of proteins and nucleic acids. This enables researchers to seamlessly incorporate conformational ensembles into the analysis in order to adequately take into account structural heterogeneity and dynamic nature of biological macromolecules. ROTDIF-web offers a versatile set of integrated modules/tools for determining and predicting molecular rotational diffusion tensors and model-free characterization of bond dynamics in biomacromolecules and for docking of molecular complexes driven by the information extracted from NMR relaxation data. ALTENS allows characterization of the molecular alignment under anisotropic conditions, which enables researchers to obtain accurate local and long-range bond-vector restraints for refining 3-D structures of macromolecules and their complexes. We will describe our experience bringing our programs into GenApp and illustrate the use of these gateways for specific examples of protein systems of high biological significance. We expect these gateways to be useful to structural biologists and biophysicists as well as NMR community and to stimulate other researchers to share their scientific software in a similar way.

Presenters

Yuexi Chen

cheol jeong

Alexey Savelyev

Susan Krueger

Joseph Curtis

Emre Brookes

David Fushman

Wednesday September 25, 2019 1:00pm - 1:20pm PDT
Toucan Room, Catamaran Resort

Concurrents A, Frameworks

1:20pm PDT

EarthCube Data Discovery Studio, an integration of a semantically enhanced cross-disciplinary catalog with JupyterHub to enable an analytical workbench

EarthCube Data Discovery Studio (DDStudio) works to integrate resources described by metadata with analytical platforms. DDStudio has harvested over 1.6 million metadata records from over 40 sources, enhanced them via an augmentation pipeline, created a catalog, and provided an interface that allows users to explore the data via Jupyter Notebooks. DDStudio utilizes a scalable metadata augmentation pipeline designed to improve and re-index metadata content using text analytics and an integrated geoscience ontology. Metadata enhancers automatically add keywords and related ontology references that describe science domains, geospatial features, measured variables, equipment, geoscience processes, and other characteristics, thus search and discovery of semantically indexed datasets. In the pipeline, we enhance spatial and temporal extents, and organization identifiers, enabling faceted browsing by these parameters. The pipeline also generates provenance for each enhanced metadata document, publishes the metadata using schema.org markup, lets users validate or invalidate metadata enhancements, and enables faceted search. Users are permitted to upload metadata descriptions for resources not already in the catalog and have them immediately available within the search interface. DDStudio and the Jupyter Hubs are loosely coupled and communicate via a simple interface we call a dispatcher. Users can search for datasets in DDStudio by utilizing text, search facets, and geospatial and temporal filters. Researchers can collect records of interests into collections, save the collections for further use, and share collections of resources with collaborators. From DDStudio, users can launch Jupyter notebooks residing on several JupyterHubs for any metadata record, or a built collection of metadata records. The dispatcher seeks to identify appropriate resources to utilize in visualization, analysis or modeling, thus bridging resource discovery with more in-depth data exploration. Users can contribute their own notebooks to process additional types of data indexed in DDStudio. DDStudio demonstrates how linking search results from the catalog directly to software tools and environments reduces time to science in a series of examples from coral reef and river geochemistry studies. DDStudio has worked with SGCI to enhance its process and utility with centralized authentication, security analysis, and outreach to user communities. URL: datadiscoverystudio.org

Presenters

David Valentine

Ilya Zaslavsky

Meier Ouida

Stephen Richard

Karen Stocks

Jeffrey Grethe

Amarnath Gupta

Burak Ozyurt

Bernhard Peucker-Ehrenbrink

Wednesday September 25, 2019 1:20pm - 1:40pm PDT
Toucan Room, Catamaran Resort

Concurrents A, Frameworks

1:40pm PDT

Rebasing HUBzero tool middleware from OpenVZ to Docker

HUBzero middleware has been based on OpenVZ container technology for a decade. This provided very powerful control and customization options with light resource utilization, ahead of alternatives at the time. However, OpenVZ 6 will reach end of life in November 2019. The next version, OpenVZ 7, is substantially different than its predecessors. Architecturally, OpenVZ 7 is becoming its own Linux distribution with limited support for the previous container management with "simfs". Adapting the HUBzero middleware to simfs under OpenVZ 7 resulted in a loss of quota management. HUBzero tool development under OpenVZ, as well as testing the entire HUBzero software stack, has been problematic because it required people to install a different kernel than the one provided by their distribution; under OpenVZ 7, having to install a specific distribution would make the problem even worse. The HUBzero middleware also required that all tools use the same tool template, so upgrades to the tool template necessitated synchronized upgrades and retesting of all tools.

Meanwhile, Docker emerged as a popular choice for creating, sharing and deploying containers. Docker isn't tied to a specific Linux distribution, and is easier to install and use than OpenVZ. Having the entire HUBzero software stack, not just the middleware, redeployed as Docker containers would ease testing, development, adoption and deployment. However, there were several challenges to doing so. One is that by default Docker heavily manages the host firewall, conflicting with the management performed by the HUBzero middleware, which also interacts extensively with the host firewall. That Docker functionality is optional, but enabled by default and normally expected to be functional. We didn't want to disable the Docker firewall functionality, as that may be surprising and cause compatibility issues. The second challenge was trying to separate the X11 server and related services, from the tools themselves, which used to all be located in the same OpenVZ container. Doing so creates flexibility and makes sense as newer tools tend to emit HTML directly and do not require an X11 server. It also makes tool containers smaller and more easily shared and managed. The third challenge is an ongoing one, which is to evaluate the security implications of using Docker instead of OpenVZ and develop better assurances based on gathered experience and evidence.

Presenters

David Benham

Pascal Meunier

Wednesday September 25, 2019 1:40pm - 1:50pm PDT
Toucan Room, Catamaran Resort

Concurrents A, Frameworks

1:50pm PDT

vDef-Web: A Case-Study on Building a Science Gateway Around a Research Code

Many research codes assume a user’s proficiency with high-performance computing tools, which often hinders their adoption by a community of users.
Our goal is to create a user-friendly gateway to allow such users to leverage new capabilities brought forward to the fracture mechanics community by the phase-field approach to fracture, implemented in the open source code vDef.

We leveraged popular existing tools for building such frameworks: Agave, Django, and Docker, to build a Science Gateway that allows a user to submit a large number of jobs at once.
We use the Agave framework to run jobs and handle all communications with the high-performance computers, as well as data sharing and tracking of provenance.
Django was used to create a web application.
Docker provided an easily deployable image of the system, simplifying setup by the user.

The result is a system that masks all interactions with the high-performance computing environment and provides a graphical interface that makes sense for scientists.
In the common situation of parameter sweeps our gateway also helps the scientists comparing outputs of various computations using a matrix view that links to individual computations.

Presenters

Alex

Blaise Bourdin

Steven Brandt

Wednesday September 25, 2019 1:50pm - 2:00pm PDT
Toucan Room, Catamaran Resort

Concurrents A, Frameworks

2:00pm PDT

Simplifying Natural Hazards Engineering Research Data with an Interactive Curation Process

The organization of complex datasets, experiments, and simulations into readable and reusable information is a challenging task in the realm of Natural Hazards Engineering Research. The varying types and formats of data collected in the field and in laboratory research is difficult to present due to the scale and complexity of these procedures. Procedures performed in experimental facilities use a variety of unique tools to simulate natural disasters. These tools have differing setups as well as various configurations of sensors and sensor types. In addition, each sensor type may have its own output. When reproducing or referencing these procedures, it is difficult to interpret how each piece of information relates to the whole. Finally, there is a need to store data while in the field, collaborate with other researchers, visualize their results, and curate their findings in a place that can be easily referenced and reused by other members in the community.
DesignSafe is a Science Gateway that aims to enable natural hazards researchers by addressing these issues in an easy-to-use web interface. It provides researchers the ability to collaborate on projects in a shared workspace and publish their data through an interactive curation process. Allowing researchers and engineers the ability to accurately portray relationships between their predictions, procedures, and results greatly improves the readability and reusability of their findings. To do this, we’ve collaborated with several research groups and universities to develop a series of standardized yet flexible models that researchers use to structure their projects. In doing so, researchers can publish large and complex procedures in a way that is simple to interpret, cite, and reuse. This capability known as the DesignSafe curation process.
This paper will focus more specifically on how this curation process was developed and implemented. It will expand on challenges implementing this process and future work being planned to further improve the pipeline.

Presenters

Keith Strmiska

Josue Coronel

Joseph Meiring

Sal Tijerina

Harika Gurram

Craig Jansen

Maria Esteva

Ellen Rathje

Dan Stanzione

Wednesday September 25, 2019 2:00pm - 2:20pm PDT
Toucan Room, Catamaran Resort

Concurrents A, Data Management

2:20pm PDT

Protecting integrity and provenance of research data with the Open Science Chain

Facilitating the future reuse of data is critical to the advancement of research. Researchers need the ability to independently validate the authenticity of scientific datasets and track the provenance information to extend or build upon prior research. The National Science Foundation funded Open Science Chain project is building a cyberinfrastructure solution to enable a broad set of researchers to efficiently verify and validate the authenticity of scientific datasets and share metadata including detailed provenance information in a secure manner. In this demonstration, we will show how science gateway users can benefit from utilizing the Open Science Chain cyberinfrastructure to enhance the trustworthiness of their data.

Presenters

Company Demo