HPC Operational Management System 2017-10-19T16:07:46+00:00

HPC Operational Management System (HOMS)

HPC Operations Management Systems, commonly referred to as HOMS, is a collection of tools and methodologies in focus areas to support the HPC ecosystem for both the HPC management team and the research community. High Performance Computing is a set of interdependent components interacting together to form a single complex research ecosystem. HOMS is a multi-module operational service to assume operational management of the HPC environment while enhancing key strategic areas: day-to-day cluster operations maintenance; domain-specific user community support; collaboration and accountability within the user base; increase cluster usage within the user community via education and tuning; and increase accountability.

Why have many of the top Life Science research institutions started using DST resources and the HOMS process?

  • Senior level skills are expensive and hard to maintain – DST delivers consistent focused expertise that the institution no longer has to manage on a day-to-day basis.
  • Depth of skillsets and personnel DST affords our clients that ability to never miss a step when a key employee leaves the business, goes to training, or takes vacation. HOMS offers a deep bench of HPC skills spread across a diverse team of individuals who are not just HPC experts but have worked in Life Science and understand the unique requirements and workloads.
  • Accountability – HOMS provides on-boarding with a complete documentation of the physical and logical environment as well as the critical workflows. A customized support ticketing system is employed to provide an audit trail on every interaction with the user community, the institutions HPC team, and the vendor community. HOMS delivers reports, real-time dashboards and weekly operational reviews with your team to keep communication and expectations in full view. Finally, DST closes every ticket with a “How are we doing?” customer satisfaction survey so that you have data on your user communities experience with cluster services and DST support.
  • Support advocacy – Often our client’s internal HPC teams are multi-tasking and operating the cluster is one among several responsibilities. HOMS focuses on interaction within the user community addressing issues that arises. Working with vendors and partners to open tickets, managing them to resolution through the provider’s system and raising a flag when vendors are not responsive or comprehensive in their solution provides unprecedented accountability. Because we manage many environments, the team of professionals at DST are highly regarded by vendors and able to work closely to provide speedy and complete resolution with significant input in the process.
  • Collaboration and Education – With HOMS, DST creates a customized WIKI to help educate and empower the user community. Topics including: cluster operations, policies, workflows, component functions, FAQ and How To’s, are published during the on-boarding process and enhanced on a weekly basis. Further, we offer the Wiki FAQ’s customized to the site and community forums for sharing.
  • Quarterly Onsite Reviews – HOMS delivers personalized quarterly onsite reviews to look at: next steps in the end user educational process, cluster enhancements, aggregate cluster performance reports, and ticket response reports. User educational forums are available on-site during the on-boarding kick-off and, upon request, at the quarterly meeting as an adjoining event.

Data in Science Technologies offers agnostic hardware/software services to the research compute clusters. We support all parallel file systems but specialize in GPFS and Lustre. We support all major compute providers, all network providers and all storage providers used in research computing. Further, we have experience supporting all major schedulers and all major cluster management tools as well as OpenStack. DST can provide a team of senior level engineers to help your HPC management team or that team can assume management of the cluster. Our remote sys admin service is agnostic to the components of the cluster and we provide an entire support team less expensively than an FTE in most cases.

DST works with you to define the parameters of the existing environment that you would like to have us manage. Our services can offer operational management for compute, scripting, cluster and pipeline management, file systems and operational hardware support. The customized support plan is defined by your individual needs.

supported-ecosystem

DST developed a systematic integration methodology as an extension of the existing HPC management team. Our methodology is designed to on-board HOMS, offer daily operational support; improve accountability, and increase collaboration. No aspect of the client’s environment is unaccounted for.

  • Step 1: We define the parameters of the existing environment for HOMS in an agreement that outlines all expectations and services to do operational management for compute, scripting, cluster and pipeline management, file systems and operational hardware support. The customized support plan is outlined in the initial SOW.
  • Step 2: Our onboarding methodology for including your cluster into HOMS encompasses the following components.
    • Documentation: physical and logical environment, workflows, contact info for key stakeholders in the community and key vendors making up the HPC.
    • Credentials: internal credential requirements and external vendor credentials so DST might open tickets with the vendors when necessary.
    • Customized WIKI: through a series of Webex sessions, the DST team will assemble the information necessary to start the Wiki. Our detailed deployment of the Wiki includes cluster policy statements, documentation, end user forums for education, informative papers, and FAQ’s.
    • Customized support ticketing system: While the HOMS team will respond within its SLAs, we identify individuals within the organization who also need the ability to monitor activity and occasionally assist on critical events.
    • Cluster Improvement Report: At the end of the onboarding process outlining the immediate efficiencies based on the findings in the documentation phase. Our efficiency recommendations deliver changes which if implemented should improve performance, space management or operations.
  • Step 3: Assume operational management of the cluster. DST will respond to all end user requests, provide daily, weekly and monthly management functions for the cluster (software updates, patches etc.), and monitor the performance, space utilization and more, on a daily basis. Operational management includes.
    • Real time dashboards for systems under management
    • Weekly Reports and review meetings
    • Weekly Wiki enrichment – content adds
    • “How are we doing?” customer satisfaction survey sent after every ticket with customer sat data aggregated into a report for review on the weekly calls
    • Operational management of system – daily, weekly, monthly by DST
    • Trouble shooting and problem resolution by DST with the HPC vendors. We open tickets and manage them to completion, holding vendors’ accountable and reporting back to you on an ongoing basis.
  • Step 4: Quarterly onsite review meetings. This interactive engagement delivers an in-depth review of current cluster performance, ticket resolution, vendor issues, and operational efficiencies. We also create a quarterly plan to educate newly added HPC end users scheduled for implementation over the next quarter.

homs-delivery

HOMS Focus Areas

The goal of HOMS is to assume the operational management of the HPC cluster as defined in the table below. DST takes responsibility for HPC management and can customize the service to fit your individual requirements. We can assume accountability for operational management in compute, script support, cluster management, file system, and hardware.

Data in Science Technology (DST)
Predictable, Consistent, Reliable HPC Environment
Frequency
Architect HPC WIKI Collaboration – Health, FAQ, Rules, Tutorials, Escalation, SLA Posting, Policies (see Collaboration & Communication) One-time
Define & Create an HPC ticket process – Logging and escalating (see Collaboration & Communication) One-time, Quarterly
Define/Refine cluster rules One-time, Quarterly
Publish/Republish cluster rules One-time, Quarterly
Publish/Republish Service Levels and Response Guidelines One-time, Quarterly
Establish remote connectivity One-time
Configure GPFS, Luster, XtremeFS management tools suite One-time
Refine configuration (quotas, storage pools and file sets) One-time
Review performance expectations by department Quarterly
Documentation Packet – Fully encompassing documentation of hardware, software and licensing:

  • Defining Connectivity, Functionality and Workflow
  • Management alerting and workflow
  • Escalation and resources
  • End user WIKI / FAQ
One-time
File System (GPFS, Luster, XtremeFS & ZFS) Optimization to eliminate items like:

  • Disk NSD missing or in need of checking
  • Sustained high waiter over threshold (varies per waiter type)
  • If NSD waiter (i.e. getting longer to read/write to disk is over threshold, throw out a different set of messages or trigger
  • Managed disks throwing excessive errors
  • Managed disks exhibiting excessive IO wait or service times. Queue depth too high for too long.
  • Underlying storage array (serving GPFS NSDs) events such as failed disk, disk errors, controller errors, infrastructure errors
  • Storage nodes interconnect errors (SAS, Fibre Channel)
  • Multipath errors on GPFS storage nodes
Ongoing – Alerts
Filesystem Optimization to eliminate items like:

  • File system corruption
  • Unable to mount file system on a given node
  • File set reaching inode or capacity thresh hold (warning)
  • File system reaching inode or capacity threshold (warning)
  • File system pool reaching capacity threshold
  • Local partition on server reaching inode or capacity threshold
  • File system or storage pool reaching or reached throughput threshold
  • User, group or file set quota threshold alerts
Ongoing – Prescriptive Maintenance & Alerts
Networking and Storage Health

  • Throughput
  • Latency
  • IOPS
Ongoing – Alerts
File System (GPFS, Luster, XtremeFS or ZFS) Housekeeping Daily
Proactive Performance Problem Mitigation Setup Daily
General Hardware Support:

  • Address any outstanding hardware repair tasks
  • Manage existing support cases with Hardware vendor
Daily

DST works with you to design Cluster Workshop sessions and online Q&A platforms to benefit the user of HPC. Onsite open forums between HPC and the Scientist are encouraged to foster support and cooperation between the two worlds. Common discussion topics in recent sessions from Scientist include:

  • Could we see real-time utilization of cluster?
  • Do we get a warning that login node has dropped a user due to limits exceeded?
  • Do modules support multiple versions?  Can we see contents of a module? Could one module reference another?
  • Can you help with me with pipeline code?
  • Can we write documentation to allow users to download and manage data? What if someone leaves?
  • Are there limits in jobs per user?
  • How many jobs allowed per queue?
  • Could you show us the history of node choices/queue usage over time?
  • How many queues do we currently have? What is the strategy for future?
  • Normal jobs can be killed by higher priority? Can we stop/suspend the job instead of killing it?
  • Who is priority queue intended for?
  • Could there be a limited number of priority jobs set?

To help accelerate the overall educational experience DST creates unique documentation for your environment.

batch

Our service philosophy, that we are an extension of your team, demands a collaborative approach to every effort. DST recognizes the different needs of varied stakeholders in each work effort and has established a collaboration methodology and tool set designed to communicate effectively.

Collaboration with the HPC staff – No matter what capacity the DST team is engaged – HOMS, On Demand Services or Consulting, our team is responsible to the HPC primary stake holders. Our collaboration tool permits real time viewing of each task assigned, tracked and once complete captured for later reference. We reinforce our efforts with weekly collaboration calls reviewing the work at hand outlined in the collaboration tool. Our process is designed to ensure the DST team always moves in a predictable and visible way while jointly reviewing our ticketing and project management system to help us communicate and collaborate with you, making sure we march to the right beat.

Collaboration with the scientists – High Performance Computing is an important tool in the hands of researchers. Researchers and scientists are focused on the exploration of the unknown and DST sees its job to empower them on that journey with the design of an educational portal. The educational portal serves to help the scientist get quick answers and training guides, scripts or short cuts to help them accelerate their journey. Our process of collaboration begins with an onsite kick off meeting serving the HPC science community, introducing DST as a team member available to serve the HPC community in any way to further the science as it relates to the cluster. DST also makes ourselves available for one on one educational sessions and ongoing seminars for research communities who would like to learn more powerful or efficient ways to use the cluster.

Collaboration with Executives – Every organization has leaders who are responsible for steering the boat. Executives need data for making the right decisions and DST strives to create a collaborative dialogue with the executives we serve. Collaboration with the executive stake holders is accomplished by understanding the leader’s strategic objectives and providing on going feedback as to how we are contributing to those objectives. To that end DST conducts quarterly review meetings which are face to face sessions where we look back, look up and look forward. Our executive quarterly review meetings look back by provide reports on all major metrics for the cluster over the past quarter – utilization, uptime, ticket number and response time, user satisfaction, performance and more. We then look up at what the current state of the cluster. Looking at the current state involves a review of the ongoing projects, current cluster needs and any outstanding issues. Finally, we look forward by reviewing with the executive objectives for the upcoming quarter and collaboratively build the plan of attack to accomplish the objective.

DST works with HPC and the Scientist on agreeing to important internal policies for protecting the integrity of the HPC cluster and its data contents. Some of the key Policy considerations include topics surrounding:

  • Usage
  • Sensitivity categorization
  • Publishing
  • Queue Usage
  • Resource Reservation
  • Software load/compiling
  • Support
  • Documentation

heatmapThe Wunderdial is a browser based tool for the visualization of the distribution of filesystem capacity utilization. The age-based heat map offers users the ability to identify asset reclamation opportunities.

Some of the HOMS tools are customized for your unique environment to report on the key attributes of the cluster, complete with a ticketing system. DST actively optimizes the heart of your HPC including:

  • Shared storage health
  • Parallel file system health
  • Compute node – Operating system updates
  • Proactive problem mitigation
  • Capacity planning/reporting
  • Cluster networking
  • Scheduler (optimizations)

HOMS delivers a Cluster dashboard customized for your environment to include some of the items below.

Jobs running and CPU Use

clusterstats

Metrics for GPFS

  • GPFS Cluster Health (availability) per node
  • Bandwidth per NSD server
  • Long waiters per NSD server
  • Bandwidth per Data Client
  • Long waiters per Data Client
  • Filesystem capacity
  • Pool Capacity
  • Fileset Capacity
  • User and Group Capacity

SFA Metrics

  • SFA12k Pool Health
  • SFA12K Controller Health
  • SFA12K Physical Disk Health
  • SFA12K IOM and DEM health

LSF Metrics

  • Failed jobs
  • Preempted jobs
  • Queued jobs
  • Running jobs

homs-accountabilityAccountability is the cornerstone of DST’s methodology. Our philosophy is for the DST team to become an extension of you and your team and as such we are accountable to you for everything we do. Further, we believe that accountability is fundamental to the success of any endeavor so our tools and disciplined practices provide as close to complete visibility as is possible. Our process brings transparency into the work engagement, the systems we are working on and the time we are allocating to assist you.

The image to the left is a typical customer satisfaction email. A simple “How would you rate the support you received?” with the response options “Good, I’m Satisfied” or “Bad, I’m not satisfied” promotes feedback and eliminates lengthy surveys.