Deliver Infrastructure and Software running on it Rapidly and Reliably at Scale
Notes:
1. There is a certain level of Organizational maturity needed to use these Principles, Patterns, and Practices. This article is not focused on the cultural side of things but it is very important for the successful adoption of these.
2. Examples used in this article are using Terraform and AWS but these Principles, Patterns, and Practices are generic and can mostly be applied to other IaC tools like Pulumi, CloudFormation, etc. and Cloud Providers like GCP and Azure or even On-Premise.
What is Infrastructure as Code?
Infrastructure as Code (IaC) is an approach that takes proven coding techniques used by software systems and extends it to infrastructure. It is one of the key DevOps practices that enable teams to deliver infrastructure, and the software running on it, rapidly and reliably, at scale.
If you want to achieve Continuous Delivery for your applications, having a rapid and reliable, provisioning mechanism for your infrastructure is important.
In this article, we will go through various Principles, Patterns, and Practices that have helped me, and the organizations I have worked with, over the years.
Key Principles
Before we start talking about Patterns and Practices let's look at key principles for effective IaC.
Idempotency
Idempotency means no matter how many times you run your IaC and, what your starting state is, you will end up with the same end state. This simplifies the provisioning of Infrastructure and reduces the chances of inconsistent results.
Idempotency can be achieved by using a stateful tool with a declarative language, like Terraform, where you define the desired end state of infrastructure you want, and then it is Terraform’s job to get you to that end state. If it can’t get to the desired state it will fail.
In diagram #1 below as you can see that for a Non-Idempotent IaC if you run it twice it will provision 6 VMs instead of desired 3. In the case of Idempotent IaC, it only provisions the 3 VMs even if you run it multiple times thereby making it more reliable and consistent.
Immutability
Configuration drift is a huge problem with infrastructure. It occurs when over a period there are changes made to infrastructure that are not recorded, and your various environments drift from each other in ways that are not easily reproducible. This usually happens if you have a mutable infrastructure that lives for a long time. The system is more brittle in general for long-lived infrastructure since issues like a slow memory leak, disk out of space due to log accumulation, etc. might occur over a period. It also means that you won’t be provisioning the infrastructure as frequently as your applications or configuration and as a result won’t be confident in your ability to do so. These issues can be resolved by using immutable infrastructure.
Immutable infrastructure means instead of changing an existing infrastructure you replace it with new. By provisioning new infrastructure every time, you are making sure it is reproducible and doesn’t allow for configuration drift over time.
Immutable infrastructure also enables scalability when provisioning infrastructure on cloud environments. You can see in diagram #2 below that for mutable infrastructure v2 of application is deployed on the same servers as v1 but for immutable infrastructure, it provisions new VMs with v2 of application.
Patterns and Practices
Everything in Source Control
It goes without saying that everything should be in source control. Even a script that you run occasionally to fix an issue, as well as the pipeline used to provision your infrastructure and deploy your software (pipeline as code). I have been at places where no one knows where a script that runs in production lives, who created it, and the history of changes. This is a situation you don’t want to be in.
Make sure the code is accessible to everyone in the company, even for developers that don’t make changes to the IaC code base. An exception might be that you have a very strong reason for not doing it, like a legal reason. This gives visibility and a better understanding to those who run their applications on your infrastructure, so when your consumers want to troubleshoot an issue, and want to understand how the infrastructure was provisioned, they can easily do that by looking at the code. They should be able to look at the code, understand how the infrastructure is provisioned, and even contribute if they choose to do so. I have worked with teams where it is not only the case that IaC repositories are not accessible to the rest of the organization, but they are also stored on separate source control tools that keep them hidden. This is an anti-pattern.
Modularize and Version
Modularizing IaC like software code helps with maintenance, readability, as well as ownership. It also helps in keeping the changes small and independently deployable. Refactoring IaC is relatively difficult compared to software especially for critical pieces like DNS records, CDN, Network, Databases, etc. so being biased towards over-abstraction up-front works better even though that means dealing with slightly more complex IaC than needed.
In many organizations where they have different teams like Networking, Security, and Platform Engineering¹ it might make sense to separate various layers of your infrastructure and give ownership to appropriate teams to allow better control. I have also separated the layers in cases where it is a cross-functional team managing both software and infrastructure due to the other reasons mentioned above. In diagram #3 below I have shown an example for deployments to Amazon Elastic Kubernetes Service (EKS) with various modules within each infrastructure layer and its ownership. These modules/layers might be different for you based on the setup you have.
Versioning for modules is pretty important to make sure you are not breaking things in production unless you are using monorepo in which case you are always using the latest version from the same repo.
Documentation
With IaC, you should not need extensive documentation since everything is codified, but some documentation is still essential. Better quality documentation not only helps the team that is maintaining IaC but the consumers of the infrastructure as well.
Documentation is hard. Just like with code, making sure that you have just enough documentation to convey the message you want is important. More documentation doesn’t mean it’s better. Out of date documentation is even worse.
Making the documentation easily available when needed is important. For example, if you display an error message, it’s a good idea to include a link to the documentation to troubleshoot issues like that. Also, having runbook for typical scenarios help troubleshoot during production issues.
The documentation should live closer to the code. There are more chances you will keep it updated if it is with the code or closer to it. For example, adding a README in the same repository as IaC instead of some external place like confluence or wiki outside the repository is better. This way there are more chances you will remember to update the documentation in the same commit as the code changes and it can also be a reminder during the Pull Request process. If you can generate documentation from your code or use tests as documentation, that’s ideal.
Testing
Like in software development, you need to think about testing your IaC at various levels. If you are not aware of the Test Pyramid, here is an article on Martin Fowler’s website. Below is my attempt at a test pyramid for IaC.
The idea here is that as you go up the test pyramid, the tests are more costly, more brittle, take longer to run, and require higher maintenance. So for these reasons and to get faster feedback, you should run tests at the bottom of the pyramid as often as possible, and the tests at the top, less frequently.
Static Analysis: As this is the quickest way to get the feedback, you should run it as often as possible, even on your machine. There are integrations to do this automatically when you save a file in your text editor or IDE. You can do static analysis using tools like terraform validate or TFLint.
Unit Testing: Since most of the tools (like terraform and ansible) are declarative, unit testing is usually not needed for IaC. In some cases, though, unit tests might be helpful like when you have conditionals or loops. If you are writing bash scripts you can do unit testing using bats, or if you are using Pulumi, which supports languages like TypeScript, Python, Go, or C# you can use the language test framework.
Integration Testing: This is when you provision your resources in an environment and verify whether you have met certain requirements. Remember not to write tests for things that your tool is responsible for especially if you are writing declarative code. For example, you should be writing automated tests to make sure that none of your s3 buckets are public instead of verifying whether the policies specified in IaC were applied or not. Another example would be to test that only certain ports are open across all of your EC2 instances. You can also provision an Ephemeral² environment (that you can tear down later) to run these tests. Depending on how long these tests take you might want to run these after every commit or as nightly builds. Tools like Chef InSpec and goss are helpful for these types of tests.
Smoke Test with Dummy Application: Last but not the least way of testing is by provisioning an environment, deploying a dummy application, and running quick smoke tests to verify that the application was deployed correctly. Use a dummy application to test scenarios your real application would have but is not configured for production. For example, if your apps connect to a database that’s externally hosted, you should try connecting to it in your dummy application. This gives you confidence that the infrastructure that you are provisioning allows you to run applications you intend to run on it. Since these are slow tests you can run these after a new environment is provisioned and then periodically.
Security and Compliance
Making sure your infrastructure, and the applications running on it, are secure and compliant is an important, but often overlooked aspect. Traditionally, a lot of organizations have manual checks and gates for this that are time-consuming and usually happen at a later stage in the deployment cycle, but with IaC you can automate those to provide better security/compliance and run them more frequently and sooner in the cycle.
Here are some of the aspects to consider:
Identity and Access Management: Make sure you have a robust Identity and Access Management for your IaC and the infrastructure it provisions. Using Role-Based Access Control (RBAC) for IaC that provisions the infrastructure helps in reducing the overall attack surface. With RBAC you grant just enough permission to your IaC to perform the operation it’s supposed to do.
Secrets Management: IaC usually needs secrets to provision any infrastructure. For example, if you are provisioning resources in AWS you will need AWS credentials to connect to it. Make sure you use a reliable secrets management tool like Hashicorp Vault or AWS Secrets Manager.
If you need to output or store any secrets in the state file (though you should try to avoid this) make sure they are encrypted so if someone gets hold of the state file they can’t extract the secret out of it.
Security Scanning: Running security scans after provisioning or changing infrastructure in a Lower or Ephemeral environment helps in avoiding security issues in Production. Using tools like CIS Benchmark and Amazon Inspector helps with finding common vulnerabilities/exposures and also make sure that security best practices are followed.
Compliance: Many companies have compliance requirements, but especially if you work in the Healthcare or Financial domain there are stricter requirements. I’m sure you are aware of some if not all of these: HIPAA, PCI, GDPR, and SOX. As mentioned above traditionally compliance teams used to do all the checks and paperwork manually. Using various tools like Chef Inspec or Hashicorp Sentinel to automate these compliance requirements helps in running it more frequently and finding the issues much faster. For example, you can run these compliance tests every time you change your IaC by provisioning an Ephemeral² environment so you find out if there are any issues with the new code before going to production.
Automate Execution from a Shared Environment
All the steps mentioned above should be brought together and IaC executed with appropriate checks in a certain sequence to provision infrastructure with confidence in various environments. For this, I’m going to talk about 2 options below.
Infrastructure as Code Pipeline
See below an example that demonstrates a typical sequence of steps in an IaC pipeline. I have used CircleCI in the example below but you can use any pipeline tool to execute this. The pipeline provides visibility to everyone who is dependent on the infrastructure that gets provisioned and notifies appropriate teams when there is a failure.
GitOps
Another way of executing IaC is by using GitOps which extends IaC and adds a workflow (Pull Request Process) to apply a change to the Production or any environment for that matter. It could also have a control loop that verifies periodically that the actual state of the infrastructure is the same as the desired state. For example, it will make sure that if any changes were done directly to infrastructure it reverts to the desired state as per the source control. GitOps can be used instead of an IaC pipeline defined above. For more on GitOps, you can read the documentation on the Weaveworks website here.
GitOps = IaC + (Workflow + Control Loop)
Conclusion
Thanks for reading the article and hope that you find it useful. If you know of other IaC Principles, Patterns, or Practices that could be added to this article or have any questions let me know in the comments below or reach out to me on twitter and I will look into it.
Acknowledgments: Matt Kuritz, Michael Wytock and Arielle Sullivan read the draft version of this article and provided feedback to improve it.
[1]: The Platform Engineering Team mentioned in diagram #3 is responsible for operating a platform that enables delivery teams to self-service deploy and operate systems with reduced lead time and complexity. For more on it, read an article on it here or watch the talk Priyanka Rao and I gave at DevOpsdays Edinburgh here or look at the deck here.
[2]: Ephemeral environment means to provision an environment on demand when needed and then destroying it afterward. It is a useful technique to test your IaC and applications running on it without a need to keep it running all the time thereby saving costs.
Also published on medium