Amazon SageMaker HyperPod is designed to speed up foundation model (FM) training by removing the burden of managing and optimizing a large training compute cluster. With SageMaker HyperPod, you can train FMs for extended periods without any interruptions. These HyperPod clusters are typically utilized by various users such as ML researchers, software engineers, data scientists, and cluster administrators. Each user can edit their files, run their jobs, and ensure that their work does not interfere with others.
To achieve a multi-user environment, Linux’s user and group mechanism can be leveraged to statically create multiple users on each instance through lifecycle scripts. However, this approach has a drawback as user and group settings are duplicated across instances, making it challenging to maintain consistency, especially when new team members join.
To address this issue, Lightweight Directory Access Protocol (LDAP) and LDAP over TLS/SSL (LDAPS) can be used to integrate with a directory service like AWS Managed Microsoft Active Directory. This allows for centralized maintenance of users, groups, and their permissions.
In this post, we present a solution to integrate HyperPod clusters with AWS Managed Microsoft AD, enabling a seamless multi-user login environment with a centrally maintained directory. The solution utilizes AWS services and resources, including AWS CloudFormation to deploy prerequisites for the HyperPod cluster.
The solution architecture involves HyperPod cluster instances connecting to AWS Managed Microsoft AD via LDAPS protocol through an NLB. TLS termination is implemented by installing a certificate to the NLB. The lifecycle script configures System Security Services Daemon (SSSD) on HyperPod cluster instances for LDAPS.
Before implementing this solution, it is assumed that you have basic knowledge of creating a HyperPod cluster without SSSD. If not, refer to the HyperPod workshop for guidance. Additionally, you will need a Linux machine to generate a self-signed certificate and obtain an obfuscated password for the AD reader user.
The setup steps involve creating a VPC, subnets, and a security group as prerequisites for the HyperPod cluster deployment. AWS Managed Microsoft AD is then set up by creating a directory and configuring an NLB in front of the Directory Service. A self-signed certificate is created and imported to AWS Certificate Manager for LDAPS configuration.
Finally, an EC2 Windows instance is created to administer users and groups in the AD. This instance allows for the management and maintenance of users and groups within the directory service.
Source link