News
AWS Says Project Rainier Ushers New Era of AI Training Superclusters
Amazon Web Services (AWS) announced Project Rainier, a large-scale distributed AI compute cluster designed to deliver record-setting training performance for next-generation models. The system is powered by AWS's new Trainium2 chips and a novel architecture that spans multiple data centers across the U.S. instead of relying on a single massive facility.
At the core of Project Rainier are UltraServers--nodes combining multiple Trainium2 chips connected by AWS's high-bandwidth NeuronLink fabric and Elastic Fabric Adapter networking, the company said. According to AWS, Rainier will provide up to five times more computing power than Anthropic's current largest training cluster. Anthropic is already using Rainier to train future versions of its Claude model family, underscoring AWS's ambitions to expand its leadership in AI infrastructure.
Trainium2 chips are designed specifically for large-scale model training and include high-bandwidth memory (HBM) and custom compute units optimized for tensor and linear algebra operations. UltraServers can scale to clusters with thousands of chips, and the distributed architecture allows AWS to balance power, cooling, and data center capacity across multiple locations. This design avoids the constraints of single-site megaclusters while maintaining the efficiency and throughput needed for trillion-parameter model workloads.
Project Rainier illustrates AWS's strategy of vertically integrating its AI stack--from chip design through cluster operation--to reduce reliance on third-party GPU vendors and improve cost, scale, and energy efficiency. The system's modular deployment also supports better energy sourcing and cooling distribution, improving sustainability across the infrastructure footprint.
For customers, Project Rainier signals the next phase of AWS's AI compute evolution. While the full-scale cluster is reserved for frontier model training partnerships, the same Trainium2 technology and supporting software (Neuron SDK and UltraServer infrastructure) are available today through EC2. These building blocks allow customers to begin scaling distributed training workloads ahead of broader Rainier availability.
About the Author
David Ramel is an editor and writer at Converge 360.