How Microsoft’s GitHub just told everyone that AI agents are 'behind' recent outages

1 hour ago 3
ARTICLE AD BOX

 We perceive  the symptom  you are…

GitHub CTO Vlad Fedorov has published a nationalist apology aft 2 large incidents near thousands of repositories and propulsion requests successful breached states. The platform's April uptime has dropped beneath 85 percent—far beneath its 99.9% SLA—driven by a crisp surge successful AI cause workflows demanding 30 times existent infrastructure capacity. Ghostty developer Mitchell Hashimoto, GitHub's 1,299th user, has already announced he's leaving.

GitHub's CTO has published a uncommon nationalist apology, confirming that an detonation successful AI-driven improvement workflows is the main culprit down the platform's worsening reliability problems — and that the institution severely underestimated conscionable however overmuch capableness it would request to support things running.

The post, written by Chief Technology Officer Vlad Fedorov, admits that 2 caller incidents were "not acceptable." It details a level nether superior strain: uptime successful April has slipped beneath 85 percent, acold abbreviated of the 99.9% the work level statement promises. It closes with a blunt two-word admission: "We're sorry."The timing is pointed. The apology dropped conscionable hours aft Ghostty developer Mitchell Hashimoto—GitHub's 1,299th ever user, who joined successful February 2008—publicly announced helium was pulling his fashionable terminal emulator disconnected the platform.

Hashimoto had been keeping a regular diary marking each day a GitHub outage disrupted his work, and astir each time had an X against it. "This is nary longer a spot for superior work," helium wrote.

The AI cause surge cipher planned for

So what really happened? GitHub primitively saw the postulation question coming—just not astatine this scale. The institution began executing a 10x capableness enlargement program successful October 2025, but by February 2026, it became wide the level needed to beryllium designed for 30 times today's scale.

The main driver, GitHub says, is simply a crisp acceleration successful agentic improvement workflows since precocious December 2025—with repository creation, propulsion petition activity, API usage, and large-repository workloads each increasing fast, simultaneously. That past portion is the killer. It's not 1 strategy buckling—it's everything astatine once. A azygous propulsion request, Fedorov explains, tin interaction Git storage, mergeability checks, subdivision protection, GitHub Actions, search, notifications, permissions, webhooks, APIs, inheritance jobs, caches, and databases.

At scale, tiny inefficiencies compound fast.

Two atrocious weeks that made the apology inevitable

Two circumstantial incidents pushed things to a breaking point. On April 23, a merge queue bug caused incorrect commits erstwhile a merge radical contained much than 1 propulsion request—with changes from antecedently merged propulsion requests inadvertently reverted. A full of 658 repositories and 2,092 propulsion requests were affected. No information was lost, but default branches were near successful incorrect states that GitHub couldn't safely repair automatically.Then came April 27. GitHub's Elasticsearch clump became overloaded—likely from a botnet attack—and stopped returning hunt results, breaking ample parts of the UI for propulsion requests, issues, and projects. Elasticsearch, Fedorov notes, was 1 of the systems not yet afloat isolated, due to the fact that different higher-risk areas had taken priority. That calculus intelligibly didn't hold.On the hole side, GitHub has been moving webhooks retired of MySQL, redesigning league caches, overhauling authentication flows to trim database load, and accelerating a migration of performance-sensitive codification from Ruby into Go.

The Azure migration, contempt being blamed by some, has really helped—allowing GitHub to rotation up compute faster. A multi-cloud architecture is present besides successful the works for longer-term resilience.GitHub's stated precedence bid going guardant is availability first, past capacity, past caller features. The level has besides updated its presumption leafage to see unrecorded availability numbers and committed to flagging some ample and tiny incidents—so developers nary longer person to conjecture whether the occupation is connected their extremity oregon GitHub's.

Whether those promises construe into a level developers tin really trust connected again is present the lone question that matters.

Here is the afloat blog station from GitHub CTO Vlad Fedorov, published April 28, 2026:

I wanted to springiness an update connected GitHub's availability successful airy of 2 caller incidents. Both of those incidents are not acceptable, and we are atrocious for the interaction they had connected you. I wanted to stock immoderate details connected them, arsenic good arsenic explicate what we've done and what we're doing to amended our reliability.We started executing our program to summation GitHub's capableness by 10X successful October 2025 with a extremity of substantially improving reliability and failover. By February 2026, it was wide that we needed to plan for a aboriginal that requires 30X today's scale.The main operator is simply a accelerated alteration successful however bundle is being built. Since the 2nd fractional of December 2025, agentic improvement workflows person accelerated sharply. By astir each measure, the absorption is already clear: repository creation, propulsion petition activity, API usage, automation, and large-repository workloads are each increasing quickly.This exponential maturation does not accent 1 strategy astatine a time. A propulsion petition tin interaction Git storage, mergeability checks, subdivision protection, GitHub Actions, search, notifications, permissions, webhooks, APIs, inheritance jobs, caches, and databases. At precocious scale, tiny inefficiencies compound: queues deepen, cache misses go database load, indexes autumn behind, retries amplify traffic, and 1 dilatory dependency tin impact respective merchandise experiences.Our priorities are clear: availability first, past capacity, past caller features. We are reducing unnecessary work, improving caching, isolating captious services, removing azygous points of failure, and moving performance-sensitive paths into systems designed for these workloads. This is distributed systems work: reducing hidden coupling, limiting blast radius, and making GitHub degrade gracefully erstwhile 1 subsystem is nether pressure. We're making advancement quickly, but these incidents are examples of wherever there's inactive enactment to do.Short term, we had to resoluteness a assortment of bottlenecks that appeared faster than expected from moving webhooks to a antithetic backend (out of MySQL), redesigning idiosyncratic league cache to redoing authentication and authorization flows to substantially trim database load. We besides leveraged our migration to Azure to basal up a batch much compute.Next we focused connected isolating captious services similar git and GitHub Actions from different workloads and minimizing the blast radius by minimizing azygous points of failure. This enactment started with cautious investigation of dependencies and antithetic tiers of postulation to recognize what needs to beryllium pulled isolated and however we tin minimize interaction connected morganatic postulation from assorted attacks. Then we addressed those successful bid of risk. Similarly, we accelerated parts of migrating show oregon standard delicate codification retired of Ruby monolith into Go.While we were already successful advancement of migrating retired of our smaller customized information centers into nationalist cloud, we started moving connected way to multi cloud. This longer-term measurement is indispensable to execute the level of resilience, debased latency, and flexibility that volition beryllium needed successful the future.The fig of repositories connected GitHub is increasing faster than ever, but a overmuch harder scaling situation is the emergence of ample monorepos. For the past 3 months, we've been investing heavy successful effect to this inclination some wrong git strategy and successful the propulsion petition experience.On April 23, propulsion requests experienced a regression affecting merge queue operations. Pull requests merged done merge queue utilizing the squash merge method produced incorrect merge commits erstwhile a merge radical contained much than 1 propulsion request. In affected cases, changes from antecedently merged propulsion requests and anterior commits were inadvertently reverted by consequent merges. During the interaction window, 658 repositories and 2,092 propulsion requests were affected.On April 27, an incidental affected our Elasticsearch subsystem, which powers respective search-backed experiences crossed GitHub, including parts of propulsion requests, issues, and projects. What we cognize present is that the clump became overloaded (likely owed to a botnet attack) and stopped returning hunt results. There was nary information loss, and Git operations and APIs were not impacted. However, parts of the UI that depended connected hunt showed nary results, which caused a important disruption.We person besides heard wide feedback that customers request greater transparency during incidents. We precocious updated the GitHub presumption leafage to see availability numbers. We person besides committed to statusing incidents some ample and small, truthful you bash not person to conjecture whether an contented is connected your broadside oregon ours.The squad astatine GitHub is incredibly passionate astir our work. We perceive the symptom you're experiencing. We work each email, societal post, enactment ticket, and we instrumentality it each to heart. We're sorry.We are committed to improving availability, expanding resilience, scaling for the aboriginal of bundle development, and communicating much transparently on the way.— Vlad Fedorov, GitHub CTO

Read Entire Article
LEFT SIDEBAR AD

Hidden in mobile, Best for skyscrapers.