Optical ModulesDatacenter PowerDatacenter CoolingAI ASICsHBM MemoryAI NetworkingInfiniBand vs EthernetDatacenter BuildoutEnterprise AI AdoptionToolsWeeklyCompanies

NETWORK-OPS-SRE-01

Site reliability engineer running AI cluster networks day-to-day.

Audience

  • · 5-12
  • Current: Network SRE / NetOps Lead
  • Pain: Debugging hung allreduce at 10k+ GPU scale
  • Pain: Topology change rollout safety (link drains)

Product Needs

(none)

Channels

(none)

Competitor Lens

(none)