The DevOps Dojo

https://is1-ssl.mzstatic.com/image/thumb/Podcasts112/v4/a5/da/bf/a5dabfe1-6f6e-1e7a-0cfb-c188822eaba2/mza_137989529075670926.png/600x600bb.jpg

The DevOps Dojo

Johan Abildskov

7 episodes

7 months ago

The DevOps Dojo is an educational podcast focused on DevOps and making the world of building software a little better. Each episode covers a principle, practice or common DevOps fable. Join the Dojo to expand your software development horizons!

Technology

Education

RSS

All content for The DevOps Dojo is the property of Johan Abildskov and is served directly from their servers with no modification, redirects, or rehosting. The podcast is not affiliated with or endorsed by Podjoint in any way.

Technology

Education

Episodes (7/7)

The DevOps Dojo

Digital Transformation with the Three Ways of DevOps

The three ways of DevOps comes from the Phoenix Project, a famous book in DevOps circle. This episode covers how to use the three ways to progress in your digital transformation initiatives. Sources: https://www.businessinsider.com/how-changing-one-habit-quintupled-alcoas-income-2014-4?r=US&IR=T https://www.amazon.com/Phoenix-Project-DevOps-Helping-Business/dp/0988262592 https://www.amazon.com/DevOps-Handbook-World-Class-Reliability-Organizations-ebook/dp/B01M9ASFQ3/ref=sr_1_1?crid=316RJMM06NH59&dchild=1&keywords=the+devops+handbook&qid=1600774333&s=books&sprefix=The+devops+h%2Cstripbooks-intl-ship%2C235&sr=1-1 Transcript: My first introduction to the principles behind DevOps came from reading The Phoenix Project by Gene Kim, Kevin Behr and George Spafford. In this seminal book, that blew my mind we follow Bill as he transforms Parts Unlimited through salvaging The Phoenix Project. An IT project that went so wrong, it could almost have been a project in the public sector. Through Bills journey to DevOps, we discover and experience the Three Ways of DevOps. In this episode, I cover the three ways of DevOps and how they can be applied in a Transformation. This is the DevOps Dojo #6, I am Johan Abildskov, join me in the dojo to learn. In the DevOps world, few books have had the impact of The Phoenix Project. If you have not read it yet, it has my whole-hearted recommendation. It is tragically comic in its recognizability and frustratingly true. In it, we experience the three ways of DevOps. The three ways of DevOps are Principles of Flow, principles of feedback and principles of continuous learning. While each of these areas support each other and has some overlap, we can also use them as a vague roadmap towards DevOps capabilities. The First Way of Flow addresses our ability to execute. The second way of Feedback concerns our ability to build quality in and notice defects early. The Third way of Continuous Learning focuses on pushing our organizations to ever higher peaks through experimentation. The first way of DevOps is called the principles of flow. The first way of DevOps is called the principles of flow. The foundational realization of the first way is that we need to consider the full flow from ideation until we provide value to the customer. This is also a a clash with the chronic conflict of DevOps with siloed Dev and Ops teams. It doesn't matter whether you feel like you did your part or not, as long as we the collective are not providing value to end-users. If you feel you are waiting a lot, try to pick up the adjacent skills so you can help where needed. We also focus on not passing defects on and automating the delivery mechanisms such that we have a quick delivery pipeline. Using Kanban boards or similar to visualize how work flows through our organization can help make the intangible work we do visible. A small action with high leverage is WIP limits. Simply limiting the amount of concurrent tasks that can move through the system at any point in time can have massive impact. Another valuable exercise to do is a Value Stream Map where you look at the flow from aha-moment to ka-ching moment. This can be a learning situation for all involved members as well as the organization around them. Looking at the full end to end flow and having optimized that we can move on to the second way of DevOps. The second way of DevOps is the Principles of Feedback The first way of DevOps enables us to act on information, so the second way focuses on generating that information through feedback loops, and shortening those feedback loops to be able to act on learning while it is cheapest and has the highest impact. Activities in the Second Way can be shifting left on security by adding vulnerability scans in our pipelines. It can be decomposing our test suites such that we get the most valuable feedback as soon as is possible. We can also invite QA, InfoSec and other specialist competences into our cycles early to help

5 years ago

7 minutes 8 seconds

The DevOps Dojo

Site Reliability Engineering

Site Reliability Engineering or SRE, is the next big thing after DevOps. In this episode I cover what SRE is, some of its principles and some of the challenges along the way. Sources https://www.oreilly.com/content/site-reliability-engineering-sre-a-simple-overview/ https://landing.google.com/sre/books/ Transcript Site Reliability Engineering - Is it just traditional siloed Ops in disguised or is it a functioning DevOps organizational structure that works. SRE comes from the trenches of Google production environments, so many organizations look to SRE after having established DevOps teams. Site Reliability Engineering has some powerful concepts and tools, but it all comes at a price. If an organization has the scale, and the willingness to invest, and change their ways of working, SRE can help you go planet-scale. But there are many obstacles on the road to Site Reliability Engineering paradise. This is the DevOpsDojo #5, I am Johan Abildskov, join me in the dojo to learn. Site Reliability engineering is the fabled world of planet scaled operations, with autonomous DevOps teams. This seems like an oxymoron but is a functional DevOps organizational structure. Sloss, who coined the term Site Reliability Engineering, states - "Site Reliability Engineering is what happens when you ask a software engineer to design an operations team". SRE is how Good is able to run their production environments in large scale. Both in terms of how many applications and services they run and how many engineers are required to develop and maintain those products, but also in terms of the global scale of their applications. Site Reliability Engineering is about scaling services to world-class availability, but it is also about scaling the engineering organization such that teams can continue to be productive as the business grows. For me the core tenants of Site Reliability Engineering are: Minimizing Toil, Shared Ownership and the ability to say no. Engineering is a key part of SRE, and toil is the opposite. Toil is the work that does not, in the long run, add value, or that does not require engineering. This means that if we do too much work on toil, we will not be able to scale superlinearly with the number of engineers. Examples of toil are manual deployment procedures and following complex processes. Google claims to be able to have their teams work with thirty to fifty per cent toil. With fifty per cent being a hard limit. A key point here is that if a team violates this, they will either shed responsibilities or get resources added. This requires a hard buy-in from management in supporting this. This will likely be impossible in a project funded organization. So this way of thinking of eliminating toil, and give the space in the teams to do real engineering work, is the foundation of Site Reliability Engineering. Shared Ownership is a weird construct when the point is to run the applications that someone else is building. But it is a very important pillar of SRE. This shared ownership comes from SRE teams providing valuable metrics from production environments to product teams. It also comes from a shared agreement that if the application itself is too unstable, developers will join the SREs on-call rotation until the service has been restored to the quality the SREs can operate. SREs also helps product teams with production readiness reviews and checklists. All in all SRE enables the product teams to run applications at planet-scale without having to maintain all the competencies that this necessarily requires inside the team. This again requires commitment from management in enforcing the requirements from the SRE teams, so we do not revert back to throwing applications over the wall of confusion. A powerful realisation is that we should consider downtime a resource, and spent it deliberately and respectfully. With this realisation comes the underlying acceptance that absolute 100% availability is not the right

5 years ago

7 minutes 1 second

The DevOps Dojo

Containers

Containers are all the jazz, and they contribute to all sorts of positive outcomes. In this episode, I cover the basics of Containerization. Sources Containers will not fix your broken Culture docker.io Transcript Containers - If one single technology could represent the union of Dev and Ops it would be containers. In 1995, Sun Microsystems told us that using Java we could write once and run anywhere. Containers are the modern, and arguably in this respect more successful, way to go about this portability. Brought to the mainstream by Docker, containers promise us the blessed land of immutability, portability and ease of use. Containers can serve as a breaker of silos or the handoff mechanism between traditional Dev and Ops. This is the DevOps Dojo Episode #4, I’m Johan Abildskov, join me in the dojo to learn. As with anything, containers came to solve problems in software development. The problems containers solve are around the deployment and operability of applications or services in traditional siloed Dev and Ops organizations. On the Development side of things deployment was and is most commonly postponed to the final stages of a project. Software is perhaps only on run the developers own computer. This can lead to all sorts of problems. The architecture might not be compatible with the environments that we deploy the software into. We might not have covered security and operability issues, because we are still working in a sandbox environment. We have not gotten feedback from those who operate applications on how we can enable monitoring and lifecycle management of our applications. And thus, we might have created a lot of value, but we are completely unable to deliver it. On the Operations side of things, we struggle with things such as implicit dependencies. The applications run perfectly fine on staging servers, or on the developer PC, but when we receive it, it is broken. This could be because the version of the operating systems doesn’t match, there are different versions of tooling, or even something as simple as an environment variable or file being present. Different applications can also have different dependencies to operating systems and libraries. This makes it difficult to utilize hardware in a cost-efficient way. Operations commonly serve many teams, and there might be many different frameworks, languages, and delivery mechanisms. Some teams might come with a jar file and no instructions, while others bring thousands of lines of bash. In both camps, there can be problems with testing happening on something other than the thing we end up deploying. Containers can remedy most of these pains. As with physical containers, it does not matter what we stick into them, we will still be able to stack them high and ship them across the oceans. In the case of Docker we create a so called Dockerfile that describes what goes into our container. This typically starts at the operating system level or from some framework like nodejs. Then we can add additional configurations and dependencies, install our application and define how it is run and what it exposes. This means that we can update our infrastructure and applications independently. It also means that we can update our applications independently from each other. If we want to move to a new PHP version, it doesn’t have to be everyone at the same time, but rather product by product fitting it into their respective timelines. This can of course lead to a diverse landscape of diverging versions, which is not a good thing. With great power comes great responsibility. The Dockerfile can be treated like source code and versioned together with our application source. The Dockerfile is then compiled into a container image that can be run locally or distributed for deployment. This image can be shared through private or public registries. Because many people and organizations create and publish these container images, it has become easy to ru

5 years ago

7 minutes 24 seconds

The DevOps Dojo

Measuring your Culture with the Westrum Typology of Organizational Culture

The Westrum model has been shown to predict software delivery performance. It also helps us get a quantifiable handle on the intangible concept of culture. With concrete focus points, it is a fantastic way to start improving your culture. This episode covers the Westrum Model. Sources https://cloud.google.com/solutions/devops/devops-culture-westrum-organizational-culture https://inthecloud.withgoogle.com/state-of-devops-18/dl-cd.html https://qualitysafety.bmj.com/content/13/suppl_2/ii22 https://www.amazon.com/Accelerate-Software-Performing-Technology-Organizations/dp/1942788339 Transcript Culture - We know it is the foundation upon which we build high performing teams. Yet, it is a difficult topic to address. We struggle to quantify culture, and what does good culture mean? How can we approach improving our culture without resorting to jamborees and dancing around the bonfires? Team building can feel very disconnected from our everyday lives. The Westrum Typology of Organizational cultures is a model that helps us quantify our culture with focus on information flow in the organization. It has even been shown to drive software delivery performance. The Westrum model gives us an actionable approach to good culture. I’m Johan Abildskov, join me in the dojo to learn. In any conversation about transformations, whether digital, agile or DevOps, you can be certain that before much time has passed, someone clever will state “Culture eats strategy for breakfast”. This quote is from Peter Drucker and implies that no matter how much effort we put into getting the strategy perfect, execution will fail if we do not also improve the culture. Culture is to organizations what personality is to people. We can make all the new years resolutions, fancy diets and exercise plans, but if we do not change our habits, our patterns, or personality, even the best-laid plans will fail. While a human lapse in strategy might involve eating an extra cupcake and result in weight gain we had not planned for, an organizational lapse in culture might be accidentally scolding someone for bringing bad news to light, which results in an organization where problems and challenges are hidden. So it is important that we focus on establishing a healthy culture on top of which we can execute our clever strategies. Our culture is the behaviour that defines how we react as an organization. Ron Westrum is an American sociologist, that has done research on the influence of culture to for instance patient outcomes in the health sector. He has built a model of organizational culture based on information flow through the organization. Being in the good end of this scale has been shown by the DevOps Research and Assessment team to be predictive of software delivery performance and organizational performance. There are three categories of organizations in the Westrum model, Pathological or power oriented. Bureacratic or rule oriented and generative or performance oriented. In pathological organizations, information is wielded as a weapon, used to fortify ones position, or withheld as leverage to be injected at the right moment to sabotage others, or cover ones own mistakes. Cooperation is discouraged as that can bring instability into the power balance, and the only accountability that is present is scapegoating and the blame game. Obviously this is a toxic environment, and the least performing organization type. In the bureaucratic organizations the overarching theme is that it doesn’t matter if we did something wrong or in a bad way, as long as we do it by the book. Responsibilities are accepted, but the priority is not sensemaking, the priority is that no one can claim we did something wrong. Bad news are typically ignored, by the logic that the process is right, and the process is working. Generative organizations focus on outcome or performance. It doesn’t matter who gets credit as long as the organization wins. Failures are treated as learning

5 years ago

8 minutes 7 seconds

The DevOps Dojo

Chaos Engineering

Chaos Engineering is a set of techniques to build a more proactive and resilient production environment and organization. In this Episode of the DevOps Dojo, I introduce you to the basics of Chaos Engineering. Sources https://principlesofchaos.org/ https://github.com/Netflix/chaosmonkey https://chaostoolkit.org/ Transcript Chaos Engineering. The art and practice of randomly introducing failures into our production environment. It seems so counterintuitive that we intentionally break our infrastructure, but Chaos Engineering is a structured and responsible way of approaching exactly this. Modern distributed systems are so complex that it is difficult to say anything about their behaviour in normal circumstances. This means it is nigh impossible to predict how they will behave when pushed to their limits in a hostile environment. Chaos Engineering allows us to probe our running systems, in order to build confidence in their performance and resilience. I am Johan Abildskov. Join me in the dojo to learn. The first time I head about Chaos Engineering, was when I learned about the tool Chaos Monkey. Chaos Monkey is a tool created by Netflix that they run in their production, randomly restarting systems and servers. My first reaction was awe. How was this a sane thing to do? It just sounds so wrong. However, I have come to see that this is the only way that you can keep improving your systems in a structured way. Rather than having the unforeseen happen because of user behaviour, you inject tension in the system to discover weaknesses before they cause user-facing problems. This is the natural continuation of the lean practice of artificially introducing tension in your production lines to continuously optimize productivity. There are two types of companies, those that are only able to react. Whether it is to the market or to the competition. And there are those companies that are able to disrupt themselves and be proactive in the market. My money is on the success of the proactive organizations. But back to Chaos Engineering. To me, the simplest way of explaining Chaos Engineering is that it is applying the scientific method to our production systems. We formulate hypotheses and conduct Chaos Experiments to investigate them. A chaos experiment has four parts. First, we defined a steady state as some measurable condition of a system that means it is working as normal. This could be distribution of failure rates or response time on some percentile of the requests. It is key that we have something measurable. Second, we hypothesize that our system will maintain this steady state under some hostile conditions. This can be server restarts, disk failures or degraded performance in network or in services that we depend on. Third, we have the execution of the experiment, done with tools like Chaos Monkey or Chaos Toolkit. Here we run the experiment trying to disprove our hypothesis. Fourth, we collect data and analyse the results. The harder it is to disprove our hypothesis, the more confidence we have in the behaviour of our complex distributed system. So the four components of a Chaos Experiment are the steady-state, our hypothesis, the execution and the analysis. But wait, you’ll say. You will tell me I am crazy, that you can’t just to such horrible things to your production environment. And you’ll likely be right. Just as we need to build confidence in our production systems, we need to build confidence in our ability to conduct chaos experiments. If you are completely new to Chaos Engineering, and perhaps are not even practising disaster recovery, this would be a really good place to start. This helps build our muscle and gets us used to manipulate environments. These trainings are commonly called game days. In some circles, they practice the “wheel of misfortune”, where you randomly select and incident and walk through that. My best suggestion for a starting disaster recovery training is to test whether you

5 years ago

6 minutes 20 seconds

The DevOps Dojo

The Five Cloud Characteristics

The five cloud characteristics comes from the American Standard institute NIST and has been shown by the Accelerate State of DevOps report to drive DevOps Performance. In this episode I cover the characteristics and why they matter. Sources: State of devops 2018: https://services.google.com/fh/files/misc/state-of-devops-2018.pdf Cloud Charateristics: https://csrc.nist.gov/publications/detail/sp/800-145/final Transcript The Cloud. It is the promised land of IT infrastructure. The magical realm where seemingly infinite resources are available at the click of a button. Where computers appear out of thin air to do our bidding. It is a billion-dollar business, but even those who have invested heavily in using public clouds struggle to reap the benefits. They are still stuck pushing tickets and waiting days, weeks or months for virtual machines. In this episode, I cover the five cloud characteristics and why they matter for our DevOps performance. I’m Johan Abildskov, join me in the dojo to learn. In this episode, I am going to talk about software infrastructure. In short, computers and connectivity between them. With a few more words it is about getting the compute, memory, storage and network resources that we need to run our applications. There are three categories of infrastructure at this level of abstraction. On-premises, or on-prem as it is called, where everything is hosted inside the organizational perimeter. Public Cloud where everything is hosted externally, at a provider such as AWS, Azure, Alibaba or Google Cloud Platform. And finally, the hybrid cloud where some workloads are hosted in a public cloud, while others are hosted on-prem. Each deployment pattern is valid and has its uses. In the DevOps community, we have a common narrative stating that cloud is superior to on-prem, and sometimes we fall into the trap of forgetting the tradeoffs we are making. My opinion is that while it is difficult to become as high performing on-premises as in the cloud, it is trivial to screw the cloud up, just as bad as on-prem. So let’s look at the cloud characteristics that drive DevOps performance. I learned about the five cloud characteristics from the Accelerate State of DevOps 2018. They found that those organizations that agreed with all five characteristics were 23 times more likely to be high performers. In 2019 that number had increased to 24 timers as likely.The characteristics come from the American standards institute NIST. Disregarding the cloud deployment model, they cover the characteristics our infrastructure should have, in order for it to be called cloud. This is very valuable in terms of aligning our vocabulary. Without further ado, and there has been much, let’s move to the characteristics themselves. The first is “On-demand self-service”. That is, consumers can provision the resources they need, as they need them, without going through an approval process or ticketing system.This is the first trap of cloud migrations. If we simply lift-and-shift our infrastructure, but leave the processes in place we are not going to maximize our gain. Cloud is a powerful tool to shorten feedback loops, build autonomy and allow the engineers to make the economic tradeoffs that impact their products. But that is only the case if on-demand self-service is present. The second characteristic is broad network access. This is, to me, the least interesting characteristic, but that might be because I have not been in organizations where this has been a big pain. This refers to the capabilities of our cloud are generally available through various platforms, such that the cloud capabilities are not hidden from the engineers. The third characteristic is resource pooling. This means that there is a pool of resources, and we as consumers do not control exactly where our workloads go. We can declare properties that we desire for our workloads, such as SSD disks, GPUs or a specific geographic region, but not particular hosts. For on

5 years ago

5 minutes 37 seconds

The DevOps Dojo

Trailer - Introducing the DevOps Dojo

Welcome to the DevOps Dojo This is where we learn. This episode introduces the DevOps Dojo and its host Johan Abildskov.

5 years ago

39 seconds