迁移到云节奏Moving to Cloud Cadence

作者: Lori LamkinBy: Lori Lamkin

在 Microsoft 的 VSTS 中,我们一直在 DevOps 旅程。At VSTS in Microsoft, we’ve been on a DevOps journey. 我们并不完美,我们不会在六个月内完成此操作,我们还没有完成。We’re not perfect, we didn’t do this in six months and we’re not done. 在此过程中,我们有重大里程碑,接下来的模式与我看到的客户遵循的模式非常类似。We had major milestones along the way and the pattern that we followed is very similar to the pattern I see customers following. 从 "唉" 开始,我想要缩短循环时间。They start with, “Gosh, I want to reduce my cycle times. 我想这样做的 DevOps。I want to do this DevOps thing. 我想迁移到云节奏。I want to move to that cloud cadence. 我的业务没有跟上 "。My business isn’t keeping up.” 是的,这也是我们更改的原因。Yes, that’s why we changed too.

首先更改节奏First Change the Cadence

首先要做什么?What do you do first? 更改你的计划Change your schedule. 我想要更快地交付。I want to ship faster. 这是旅程的第一步。That is the first step on the journey. 对于我来说,这是最大的 ah ha 时间,这是我们第一次尝试更改计划的时候,我们创建了一个稳定的时间,因为我们恐惧不会有稳定的产品。To me, the big ah-ha moments for this was when we tried at first to change the schedule and we created a stabilization period because we were scared that we weren’t going to have a stable product. 这就提供了错误的行为。That delivered the wrong behavior. 某些团队执行了正确的操作并向下分解了债务,而其他团队仍在培养自己的债务,因为这种情况下,工程师需要提供这些功能,使其能够 cram 尽可能多的功能。Some teams did the right thing and burned down their debt, other teams kept building their debt because of course engineers want to ship features so they’re going to cram as many in as they can. 然后,团队 A 必须修复团队 B 的 bug。Then team A had to fix team B’s bugs. 是的,可以保证错误的行为。Yeah that was rewarding the wrong behavior. 我们的关键是只是跳过 cliff,从而消除了稳定性里程碑。The key to us was just jumping off the cliff, getting rid of the stabilization milestone. 那么,系统突然就会奖励正确的行为,因为如果你建立了债务,则必须将冲刺(sprint)分解为你。Then all the sudden the system rewards the right behavior, because if you build up debt, then you have to spend sprints burning it down, it’s just you. 在执行此操作之前,不会交付功能。You don’t get to ship your features until you do it. 这是一种关键的 ah ha 东西。That was kind of the key ah-ha thing.

团队自治和组织对齐Team Autonomy and Org Alignment

下来要做的 就是。The next thing is letting go. 作为管理团队,在前面的规划中,有很多安全,而现在我们将获得这些日期,但是这只是一种不安全的安全。As a management team there’s a lot of security in planning up front, and feeling like oh we’re going to get these dates, but it’s just false security. 你可以让你的团队使用自己的积压工作(backlog)、自己的计划来运行,并以某种方式查找它们。You’ve got to let your teams be able to run with their own backlogs, their own plans, and somehow find a way to align them.

我经常与管理人员交谈,尤其是我们如何运行功能聊天和冲刺(sprint)演示。I often talk to executives especially it’s about how we run our feature chats and our sprint demos. 因为他们会问,"您能只为我提供 KPI,以便您能衡量工作效率,所以它们会成为一项巨大的 ah ha 吧?Those come across as big ah-ha things because they ask, “Can you just give me the KPI’s so you can measure productivity? 您可以只为我提供 KPI,以便我知道哪些团队正在做什么,哪些团队需要帮助或我的功能在跟踪上? "Can you just give me the KPI’s so I know which teams are doing well or which ones need help or my features are on track?” 同样,您已经在该产品中了。Again, you’ve got to be in the product. 您必须与大家交谈。You’ve got to be talking to the people. 这些工具可帮助实现这一点,但在这种透明的节奏中,对话更好。The tools facilitate that, but conversation is way better in that transparent rhythm.

连续装运Ship Continuously

另一种是 始终提供功能The other is to always deliver features. 这种计划并不喜欢,是我们执行了三个冲刺(sprint)计划,我们尝试坚持使用。That schedule, it’s not like, yeah we do a three sprint plan and we try to stick to it. 我们正在倾听我们的知识。We’re listening and we’re learning.

在 Master 中工作Work in Master

我们用来让工程师在不同的分支中工作。We used to have engineers working in separate branches. 在您尝试执行该集成后,合并负债将隐藏,并且您拥有的团队越多,集成就越大。The merge debt is hidden until you try to do that integration, and the more teams that you have, the bigger that integration becomes. 如何实现该集成,使其在小块区中发生,并且更快、更持续?How can you get that integration to happen in small chunks and faster, more continuously? 关键是, 在 master 中工作The key is, work in master. 迁移到 Git 的原因之一是轻型分支。One of the reasons why we moved to Git is the lightweight branching. 我们内部工程的真正大优势是,消除了这一深层分支层次结构以及它所引入的垃圾。The really big benefit to our internal engineering was getting rid of that deep branch hierarchy and the waste it introduced.

遍历遍历Walk the Walk

由于我们 使用我们构建的工具,我们的一项投资会提高工作效率并改善我们的产品。Because we use the tools that we build, our one investment yields benefit both in our productivity and improving our product. 例如,我们提供了一个发布管理系统,并将其提供给其他人,并且我们使用自己,而不是某个辅助副本,然后从团队 siphons 的速度。For example, it’s really important that we have a release management system that we ship to everybody else and that we use ourselves, instead of something secondary that then siphons off a bunch of velocity from the team.

持续部署Continuous Deployment

部署频率越低,部署起来就越困难。The less frequently you deploy, the harder it is to deploy. 部署之间的时间越多,成堆就越多。The more you have time between deployments, the more stuff piles up. 这种情况突然出现,任何人都不会在这种情况下进行编码,因此会产生部署债务。Suddenly, the code isn’t fresh in anyone’s mind, so you get deployment debt. 在较小的块区中 可以更多地工作,实际部署会更容易。The more you can work in the smaller chunks, the easier the actual deployment will be. 我说,我们在我们的旅程里,因为这是非常困难的,而这只是很难。I say to people, on our journey, when we avoided deploying because it was so hard and that just made it harder. 是的。Yeah. 如果保留这种情况,看起来好像是有意义的,但当你处于中间时,这会是一个不够直观。Reserving that, it seems like it makes sense but it’s a little counterintuitive when you’re in the middle. 部署更为频繁的是,我们会优先考虑我们用于部署更可靠的工具。Deploying more frequently made us prioritize making the tools we use to deploy more solid.

检测所有内容Instrument Everything

当然,调试需要进行更改。Of course, debugging needs to change. 不能在生产计算机上附加调试器。You can’t attach a debugger on a production machine. 这意味着我们是一种飞行。That means we’re flying blind. 那么我们的做法就是我们需要 检测所有内容。Then our practice became that we needed to instrument everything.

立即测试Now Test Continuously

此时,仍在慢慢地移动,但为什么要这样做呢?At this point, we were still moving slowly, but why? "我编写了我的所有代码,但这种情况却不能及时测试。"“I wrote all my code but that guy over there can’t seem to test it in time.” 很容易通过围栏引发东西。It’s really easy to throw things over the fence. 若要实际进行更改,必须 更改责任To actually make the change, we had to change the accountability. 我不再知道如何在开发人员不负责测试的环境中生活。I no longer know how to live in a world where devs aren’t accountable for tests. 在进行此更改之前,我们处于一种非常麻烦的情况,即测试人员运行的测试不同于开发人员运行的测试,因此您甚至不知道要测试的内容。Before this change, we were in a nightmare situation where tests that testers run were different than tests that developers ran, so you didn’t even know what was going to be tested. 经过测试后,开发人员会说: "我在很久之前编写了代码,我甚至不记得如何修复它。"By the time they got around to testing it, the dev says, “Well I wrote that code so long ago I can’t even remember how to fix it.” 这项重要的是,使开发人员对整个事物负责,使一切都发生变化。That key pivot, making the dev accountable for the whole thing, made everything change. 它使测试工具更好。测试结果更好。It made the test tools get better. The test results got better.

并可靠地测试And Test Reliably

当你尝试执行拉取请求时获取的快速、可靠的信号,当你获取 CI 运行并 知道红色表示红色 时,这是必不可少的。The quick, reliable signal that you get when you’re trying to do a pull request, when you get the CI runs and when you know red means red, is indispensable. 这种情况仅在我们做出开发人员负责测试的情况下发生,而不是在开发人员的情况下进行开发,我们尝试过此操作。That only happened when we made dev accountable for the test, and not dev with some testers working for dev, we tried that too. 确实开发人员了测试。It was really devs doing the testing. 其他企业可能会做出不同的选择,但一旦你开始引发一切,并拥有所有这些移交,你就会在系统中创建延迟和废物,并使业务的速度过快。Other businesses, they might make a different choice, but as soon as you start throwing things around and you have all these handoffs, you’re creating delay and waste in the system and the business gets too far ahead. 我们的交付速度不够快。We can’t ship fast enough.

相同的测试在任何位置运行Same Tests Run Everywhere

我们来确保获得极佳的信噪比。Let’s make sure that we get great signal-to-noise ratio. 不可靠测试有0个容错能力We have zero tolerance for flaky tests. 测试可以快速运行,在六分钟内运行55000测试,是的,这是我们在这里要讨论的内容,因此,每次提交拉取请求时,你都知道它是否良好。Tests can be run quickly, they’re run close to the code, 55,000 tests in six minutes, yeah that’s what we’re talking about here, so that every time you’re submitting a pull request you know if it’s good.

迁移到云并在飞行中发展Move to The Cloud and Evolve in Flight

这里的关键是我们必须不断发展,这就是我们每次遇到一个问题的原因。The key thing here was we had to evolve in flight That’s why we took issues one at a time. 我们无法只改进体系结构。We can’t just do architecture improvements. 我们始终向所有客户交付产品和发运功能。We were shipping the product and shipping features to all of our customers all the time. 我看到的是 功能标志节省了生活What I see is that feature flag saves lives. 这是一件非常大的问题,因为一旦我们开始部署,我们就会有一组人说: "我的功能不能在三周内完成,因此我不会进行合并,也不打算部署。"That was a huge thing for us because once we started deploying all the time, we had a bunch of people on the team saying, “My feature can’t be done in three weeks so I’m not going to merge and I’m not going to deploy.”

这是受控制公开的功能标志的答案。The answer to that was feature flags, which was controlled exposure. 只需将其关闭,你就可以合并和部署,而不会产生负债。Just turn it off and you get to merge and deploy, and you don’t accrue debt. 这就是计算机速度的所有方面。This is all about that speed of the machine.

使用复原模式Use Resiliency Patterns

当然,我们每个都在那里都有本地的体系结构,并且我们不得不 阻止级联故障问题,我们在单点故障中发现了这些问题。Then of course we each had our on-prem architecture up there, and we had to prevent the cascading failure problem, which we found in single points of failure. 我们要消除这部分内容。We’re eliminating those piece by piece. 我们仍在执行此操作,但我们想要了解如何防止邻居干扰,并确保人们不会滥用资源。We’re still in the process of doing that now, but we’re learning how to prevent noisy neighbors and make sure people aren’t misusing resources.

安全性Security

这里的主要内容是 使漏洞成为真实的和个人 的,然后您将会关心。The big thing here is to make vulnerabilities real and personal and then you’re going to care. 当红色的团队显示时,我已进入代码,并将整个对话框倒置,就像这样,稍等片刻,你无法对我的代码执行此操作。When the red team shows, I got into your code and I turned your whole dialogue upside down, and it’s like, wait a minute, you can’t do that to my code. 这比关于 XSS 的警告更为真实。That’s much more real than a warning about XSS. 我们创建了这种类型的区域性,并通过该红色团队进行了动态操作,因此,在这种情况下,人们可以在很自豪地利用他人的代码或阻止尝试。We created that kind of culture and dynamic through that red team, blue team exercises that we do, where people take pride in hacking into each other’s code or being able to block the attempts. 这 instilled 了一种安全的代码区域性。That instilled a secure code culture.

我们不能规划每个攻击向量,但我们能做的就是假设存在一种违规行为,并计划 我们的响应速度We can’t plan for every attack vector, but what we can do is assume that there’s going to be a breach, and plan how fast we can react to that breach. 我们的团队已经围绕了许多安全工作。A lot of the security work has been around that for our team.

最后,人犯错误。Then lastly, humans make mistakes. 它们将密码存储在文件共享上,我们可以告诉他们不是,我们可以将其发送到安全培训,并且我们可以执行所有这些操作。They store passwords on file shares and we can tell them not to and we can send them to security training and we can do all these sorts of things. 许多人都知道,但没有人会总是知道。Many people learn but nobody will ever always learn. 您可以使用所有类型的最佳做法,但除非您真正这样做,否则您必须 假定人们会犯错误,您必须进行检查,然后再检查是否正在执行这些操作。You can have all the sorts of lists of best practices but unless you’re making that real, you have to assume that people are going to make mistakes, you have to go and check that they’re being followed. 你可以通过进行测试。and that you have ways to test that through. 这与测试问题的知识相同。This is the same learnings as the testing problem.

实时网站区域性Live Site Culture

使现场现场成为工程团队的问题。Make live site the engineering team’s problem. 这对我们来说非常巨大,因为在过去,我可以部署一些东西,睡眠所有夜晚,让客户支持团队处理的问题更好900,而 ops 团队仍在努力解决这类问题,并不会支付价格。That was huge for us because in the past, I could go deploy something, sleep all night, have a nice weekend, come back Monday to 900 customer issues that the customer support team was dealing with and the ops team was trying to solve and all that sort of stuff, and I didn’t pay the price.

如果不支付价格,那么我并不激励在所有系统中都要再次开始。If I don’t pay the price, then I’m not incented to build in all the systems to never get to that point again. 当我在上午2:00 调用时,我注意到。When I’m called at 2:00 in the morning, I notice.

我们必须构建这一点,说: "无停止一切。We had to build on that and say, “No stop everything. 实时站点是我们最重要的事情。 "Live site is the most important thing that we do.” 这是目前为止的客户体验,并不只是一种税款或类似的内容。It’s the customer experience they have right now and it’s not just a tax or something like that. 实际上,人们对我们进行了统计,并在中取得了自豪。It’s actually something people count on for us and we take pride in. 这是我们产品的一项功能。It’s a feature of our product. 这是一种区分我们的内容。It’s something that differentiates us.

生产遥测Production Telemetry

为了经受这一世界,我们需要出色的警报系统。In order to survive in that world, we need great alerting systems. 如果具有 unactionable 警报、冗余警报或大量警报,则会忽略所有警报。Having unactionable alerts, redundant alerts, or overwhelming alert volumes makes you ignore all the alerts. 同样,我们说,"我们更好地清理此完整警报,因为它是黄金的。Again, we said, “We better clean up this whole alert thing because it’s gold. 警报需要具有可操作性。Alerts need to be actionable. 这是确保我们跳转到正确的客户问题并能够尽快完成此操作的关键。This is the key to making sure we’re jumping on the right customer issues and being able to do it as quickly as we can. 然后,我们将发展为创建可进行故障转移和自行修复的系统,这样就不会收到警报,但我们会在早上,而不是所有晚上处理它们。Then we evolved to creating systems that can fail over and self-heal, so that all the sudden not only are we getting the alerts, but we can deal with them in the morning, not all night. 这一切都不会发生,因为我们在一侧就为 ops 团队引发了某些事情。All of that wouldn’t have happened if we threw things over the wall to an ops team over on the side. 这是因为开发团队坐在那里。It was because the dev team is sitting there. 这些改进不仅是功能速度,而且是工程改善速度。They’re fighting to balance these improvements as a part of not just feature velocity, but engineering improvement velocity.

在指标上运行业务Running the Business on Metrics

大约三年后,公司会说: "我们希望拥有数据驱动的区域性"。About three years ago the company said, “We want to have a data driven culture.” 我们都不希望数据驱动?Don’t we all want to be data driven? 我们已经开始,只是在云中,了解我们需要大量遥测和分析,而不是在生产中进行调试。We had already started, just being in the cloud, learning that we needed a lot of telemetry and analytics instead of debugging in production. 最重要的是我们的服务的运行状况。What was most important to us first was the health of our service. 但要改变文化,说 "我们需要在我们的决策过程中执行此操作。"But then to change the culture and say, “We need to do this in all aspects of our decision making.” 我们已经获得了一种定性反馈循环,但并不是在量化反馈循环。We were already good at the kind of qualitative feedback loops, but not very good at the quantitative feedback loops.

正在获取指标权限Getting Metrics Right

我通过这一旅程学到了一些东西。I learned a couple of things through that journey. 第一种情况是, 设计度量值与设计功能非常困难The first one was, designing metrics is as hard as designing features. 我更好说,"哦,这是一个指标。I better not just say, “Oh here’s a metric. 我们希望有用户。 "We want engaged users.” 我们在第一年中重新定义了服务的接洽用户、三次或四次,并将其衡量为一半。We redefined what engaged users on the service meant, three or four times in the first year and a half of measuring it. 我不需要衡量虚的度量值,而是始终继续。I don’t need to measure vanity metrics, always going up. 我需要衡量速度斜坡,是速度上升还是保持不变?I need to measure the velocity ramp instead, is the velocity going up or is it just staying the same? 而且,我们已了解到哇,不同的月份有不同的天数,需要进行规范化。And we learned that wow, different months have different days and need to be normalized.

现在,每次定义指标时,都有一个类似于 spec 审查的对话。Even now, every time we define a metric we have a conversation like a spec review. 我们要做些什么,需要注意一些子指标,以确保它是正确的?What are we trying to drive and are there sub-metrics we should be watching to make sure it’s right? 我对成为数据驱动的一部分的开销不是很多。I didn’t expect that much overhead as a part of becoming data driven. 最后,我们必须通过每月服务审查将它制作到文化中。Lastly, we had to bake it into our culture with a monthly service review.

遥测确认完成后才会完成You’re Not Done Until Telemetry Confirms You’re Done

我们将制作的指标纳入到 Scott Guthrie,我们的执行副总裁。We bake metrics into our reviews up to Scott Guthrie, our executive vice president. 我们会告诉他每六周,我们的运行状况、我们的业务、方案、客户遥测。We tell him every six weeks how we’re doing on health, our business, our scenarios, our customer telemetry. 我们会将它与管理人员一起讨论,然后将其放到团队中。We discuss it with the executives and then bring it down to the teams. 我们来看看这些相同的用户指标,并问: "这对您的功能有什么意义呢?"We look at those same engaged user metrics and ask, “What does that mean for your feature?” 公司的所有人都可以说,"哦,我并不只是提供了该功能,但现在我看看看看,是使用它的吗?People all through the org can say, “Oh I don’t just ship the feature, but now I go and look and see, Are people using it? 或者,是否需要调整积压工作(backlog)来更好地处理此功能以使其达到其目标? "Or do I need to adjust the backlog to work on this feature more to make it achieve its goals?”

这是一种旅程It’s a Journey

它不是从 A 到 B 的直线,也不是最终的。It wasn’t a straight line to get from A to B, nor is B the end. 它不是 rainbows 和独角兽。It wasn’t rainbows and unicorns. 我们需要更快地移动,而不只是在更短的冲刺(sprint)中执行相同的操作。We needed to move faster, not just do the same thing as before in shorter sprints. 这意味着以不同的方式工作。It means working differently. 这会改变开发、测试和 ops 的规则和责任。This is changing your rules and responsibilities for dev, test and ops. 这是一种持续发展、错误和知识的旅程,我们仍在使用。It’s a journey of continuous evolution and mistakes and learnings and we’re still on the way.

Lori Lamkin Lori Lamkin 是 Visual Studio 云服务的程序管理主管。Lori Lamkin is the Director of Program Management for Visual Studio Cloud Services. 她领导一组计划经理来推动在线开发人员体验,以创建云连接的应用,与团队成员协作,并与社区进行合作,同时采用新式开发实践。She leads a team of program managers to drive the online developer experience to create cloud connected apps, collaborate with their team members, connect with community, all while adopting modern development practices. 她加入了 Microsoft 在1990中提供的支持 C/c + + 语言产品的客户,成为了 C/c + + 技术客户支持和1993的单元经理,并成为了版 Visual Studio Team 2003 System 的1.0 版本。She joined Microsoft in 1990 supporting customers with the C/C++ language product, became the Unit Manager of C/C++ technical customer support and in 1993, and became Group Program Manager on version 1.0 of Visual Studio Team System in 2003. 她持有 B.S。She holds a B.S. 在数学上,重点介绍了华盛顿大学的计算机科学。in Mathematics with an emphasis in Computer Science from the University of Washington. 通过她的丈夫和克隆男孩子,Lori 喜欢的读物、烹饪、热瑜伽、dance、网球和花费时间。Lori enjoys reading, cooking, hot yoga, dance, tennis and spending time with her husband and twin boys.