SDSC Summer Institute 2014

Program/Schedule

August 4 – 8, 2014
SDSC Auditorium at UC San Diego

Required Materials
Summer Institutes are designed to be hand-on so participants are expected to bring a laptop computer to follow along with demos and hands-on instruction throughout the program.

Piazza

We will be using Piazza to help with asynchronous communications during and following the workshop.

Open the Piazza 'HPC Meets Big Data' site.

AGENDA

Day 1 - Monday
Day 2 - Tuesday
Day 3 - Wednesday
Day 4 - Thursday
Day 5 - Friday

top

MONDAY, August 4
Morning
8:00 – 8:30	Registration, Coffee
8:30 – 8:45	Welcome Mike Norman, SDSC Director
8:45 – 9:30	Introduction, Orientation Bob Sinkovits, Interim Director for Scientific Computing Applications, SDSC
9:30 – 10:15	How do I launch and manage jobs on the system? Mahidhar Tatineni, User Services Manager, SDSC View Slides View Recording
10:15 – 10:45	Break
10:45 – 12:15	Launching and Managing Jobs Mahidhar Tatineni, User Services Manager, SDSC View Slides View Recording - Part 1 View Recording - Part 2
12:15 – 1:30	Lunch at Café Ventanas
Afternoon
1:30 – 3:00	How do I manage my data on the file system? Rick Wagner, HPC Systems Manager, SDSC View Slides View Recording
3:00 – 3:30	Break
3:30 – 5:00	How do I know I’m making effective use of the machine? Bob Sinkovits, Interim Director for Scientific Computing Applications, SDSC View Slides View Recording
5:30 – 8:30	Reception at Wayne Pfeiffer’s home overlooking the Pacific Sweater or jacket recommended Shuttle provided from SDSC driveway

top

TUESDAY, August 5
Morning	Coffee
8:30 – 10:00	How do I ensure reproducibility? Shweta Purawat, New User Applications Specialist, SDSC View Slides
10:00 – 10:15	Break
10:15 – 12:15	How do I manage my software? Andrea Zonca, HPC Applications Support Specialist, SDSC View Slides View Notes for Github Hands-On
12:15 – 1:30	Lunch at Café Ventanas
Afternoon
1:30 – 3:30	How do I handle my “big data” problem? Amarnath Gupta, Director, Advanced Query Processing Lab, SDSC View Slides Demos/Hands-on: Exercises with graph data, e.g. DB2 RDF, GraphLab Chris Condit, Sr. Scientific Information Engineer View Slides (TBD)
3:30 – 3:45	Break
3:45 – 5:00	Hands-on practice continues with mentors available for questions

top

WEDNESDAY, August 6
Morning	Coffee
8:30 – 10:30	How do I mine and get insight from my data? Natasha Balac, Director, Predictive Analytics Center of Excellence, SDSC Amit Chourasia, Senior Visualization Scientist, SDSC View Natasha's Slides View Amit's Slides
10:30 – 10:45	Break
10:45 – 11:45	Data Center Tour
11:45 – 1:00	Lunch at Café Ventanas

WEDNESDAY Afternoon – Parallel Sessions

Track 1
Auditorium

Track 2
Synthesis Center E-B143

Session One
1:00 – 5:00
with break

Parallel Computing using MPI & Open MP
Mahidhar Tatineni, User Services Manager, SDSC

This session is targeted at attendees who are looking for a hands-on introduction to parallel computing using MPI and Open MP programming. The session will start with an introduction and basic information for getting started with MPI. An overview of the common MPI routines that are useful for beginner MPI programmers, including MPI environment set up, point-to-point communications, and collective communications routines will be provided. Simple examples illustrating distributed memory computing, with the use of common MPI routines, will be covered. The OpenMP section will provide an overview of constructs and directives for specifying parallel regions, work sharing, synchronization and data scope. Simple examples will be used to illustrate the use of OpenMP shared-memory programming model, and important run time environment variables Hands on exercises for both MPI and OpenMP will be done in C and FORTRAN.

View Slides

Download Sample Files

Predictive Analytics
Natasha Balac, Director, Predictive Analytics Center of Excellence, SDSC

Nicole Wolter, PACE, SDSC

This session is designed as an introduction for attendees seeking to extract meaningful predictive information from within massive volumes of data. The session will provide an introduction to the field of predictive analytics and a variety of data analysis tools to discover patterns and relationships in data that can contribute to building valid predictions.

View Nicole's Slides

View Paul's Slides Set 1

View Paul's Slide Set 2

top

THURSDAY, August 7 Morning – Parallel Sessions
Morning	Coffee
	Track 1 Auditorium	Track 2 Synthesis Center E-B143
Session Two 8:30 – 12:15 with break	Performance Optimization Bob Sinkovits, Interim Director for Scientific Computing Applications, SDSC This session is targeted at attendees who both do their own code development and need their calculations to finish as quickly as possible. We'll cover the effective use of cache, loop-level optimizations, force reductions, optimizing compilers and their limitations, short-circuiting, time-space tradeoffs and more. Exercises will be done mostly in C, but emphasis will be on general techniques that can be applied in any language. View Slides	Scalable Data Management Amarnath Gupta, Director, Advanced Query Processing Lab, SDSC This session will take an in-depth tour of large and complex data science problems that need or more data management software. This session will take a case-study based approach and present three real-world problems from three different application domains. Each case will have a short oral introduction followed by a longer hands-on session using state-of-the-art scalable data management software. Graphlab Slides
12:15 – 1:30	Lunch at Café Ventanas
Afternoon – Parallel Sessions
Session Three 1:30 – 5:00 with break	Hadoop for Scientific Computing Mahidhar Tatineni, User Services Manager, SDSC This session will begin with providing a hands-on overview of Hadoop, its application ecosystem, and how the map/reduce paradigm can be applied to solve problems in scientific computing. Starting with a conceptual overview of Hadoop and HDFS, attendees will write simple but powerful map/reduce applications in Python, R, or Perl and learn how to adapt their existing analysis codes to work within the Hadoop framework. Some of the tools built upon Hadoop such as Pig and Mahout will be discussed in the context of expanding the capabilities of high-performance computing, and attendees will gain hands-on experience using these tools to manipulate and analyze large scientific data sets. View Slides	Visualization Amit Chourasia, Senior Visualization Scientist Visualization is largely understood and used as an excellent communication tool by researchers. This narrow view often keeps scientists from fully using and developing their visualization skillset. This tutorial will provide a “from the ground up" understanding of visualization and its utility in error diagnostic and exploration of data for scientific insight. When used effectively visualization can provide a complementary and effective toolset for data analysis, which is one of the most challenging problems in computational domains. In this tutorial we plan to bridge these gaps by providing end users with fundamental visualization concepts, execution tools, customization and usage examples. Finally, a short introduction to SeedMe.org will be provided where users will learn how to share their visualization results ubiquitously. View Slides

THURSDAY, August 7 - Continued
5:30 – 8:30	Beach BBQ at La Jolla Shores Hotel, sweater or jacket recommended Shuttle provided from SDSC driveway

top

FRIDAY, August 8
Morning	Coffee
9:00 – 10:00	HPC Meets Big Data Rick Wagner, HPC Systems Manager, SDSC
10:00 – 11:00	Lightning Rounds Download PPT Template
11:00 – 11:30	Wrap up
11:30	Adjourn Thank you for attending we hope you enjoyed the week! (To-Go box lunches will be available)

Parallel Sessions

HPC Meets Big Data: Things Everyone Should Know

How do I launch and manage jobs on the system?
How do I manage my data on the file system?
How do I handle my “big data” problem?
How do I mine and get insight from my data?
How do I ensure reproducibility?
How do I know that I'm making effective use of the machine?
How do I manage my software?

HPC Meets Big Data Parallel Sessions: Deep Dive

Hadoop for Scientific Computing
Parallel Computing using MPI & Open MP
Performance Optimization
Predictive Analytics
Scalable Data Management
Visualization
Workflow Management
One-on-One Consulting

Hadoop for Scientific Computing: This session will begin with providing a hands-on overview of Hadoop, its application ecosystem, and how the map/reduce paradigm can be applied to solve problems in scientific computing. Starting with a conceptual overview of Hadoop and HDFS, attendees will write simple but powerful map/reduce applications in Python, R, or Perl and learn how to adapt their existing analysis codes to work within the Hadoop framework. Some of the tools built upon Hadoop such as Pig and Mahout will be discussed in the context of expanding the capabilities of high-performance computing, and attendees will gain hands-on experience using these tools to manipulate and analyze large scientific data sets.

Parallel Computing using MPI & Open MP: This session is targeted at attendees who are looking for a hands-on introduction to parallel computing using MPI and Open MP programming. The session will start with an introduction and basic information for getting started with MPI. An overview of the common MPI routines that are useful for beginner MPI programmers, including MPI environment set up, point-to-point communications, and collective communications routines will be provided. Simple examples illustrating distributed memory computing, with the use of common MPI routines, will be covered. The OpenMP section will provide an overview of constructs and directives for specifying parallel regions, work sharing, synchronization and data scope. Simple examples will be used to illustrate the use of OpenMP shared-memory programming model, and important run time environment variables Hands on exercises for both MPI and OpenMP will be done in C and FORTRAN.

Performance Optimization: This session is targeted at attendees who both do their own code development and need their calculations to finish as quickly as possible. We'll cover the effective use of cache, loop-level optimizations, force reductions, optimizing compilers and their limitations, short-circuiting, time-space tradeoffs and more. Exercises will be done mostly in C, but emphasis will be on general techniques that can be applied in any language.

Predictive Analytics: This session is designed as an introduction for attendees seeking to extract meaningful predictive information from within massive volumes of data. The session will provide an introduction to the field of predictive analytics and a variety of data analysis tools to discover patterns and relationships in data that can contribute to building valid predictions.

Scalable Data Management: This session will take an in-depth tour of large and complex data science problems that need or more data management software. This session will take a case-study based approach and present three real-world problems from three different application domains. Each case will have a short oral introduction followed by a longer hands-on session using state-of-the-art scalable data management software.

Visualization: Visualization is largely understood and used as an excellent communication tool by researchers. This narrow view often keeps scientists from fully using and developing their visualization skillset. This tutorial will provide a “from the ground up" understanding of visualization and its utility in error diagnostic and exploration of data for scientific insight. When used effectively visualization can provide a complementary and effective toolset for data analysis, which is one of the most challenging problems in computational domains. In this tutorial we plan to bridge these gaps by providing end users with fundamental visualization concepts, execution tools, customization and usage examples. Finally, a short introduction to SeedMe.org will be provided where users will learn how to share their visualization results ubiquitously.

Workflow Management: This session will start with a crash course on workflow management basics. We will then explore common computing platforms including Sun Grid Engine, NSF XSEDE high performance computing resources, the Amazon Cloud and Hadoop with an emphasis on how workflow systems can help with rapid development of distributed and parallel applications on top of any combination of these platforms. We will then discuss how to track data flow and process executions within these workflows (i.e. provenance tracking) including the intermediate results as a way to make workflow results reproducible. We will end with a lab session on using Kepler to build, package and share workflows interacting with various computing systems.

One-on-one consulting: Attendees will have the opportunity to work individually or in small groups directly with SDSC staff. The goal is to help participants overcome the computational challenges and bottlenecks that are limiting the progress of their research projects. We will be available to assist participants with data management, software parallelization, workflow development and other topics covered in the summer institute.

Menu:

Contact Info:

Links:

Sponsors: