The Simian Army is a collection of open source testing tools for AWS public cloud implementations, created by Netflix. The core concept of the Simian army is Chaos Engineering, made famous through ‘Chaos Monkey’ which uses random failure injection to help ensure resilient and recoverable architectures; but the suite of tools is broad and extends to cost saving and conformity.
This is an approach we recommend for Financial Services firms hosting on public cloud. We share some experiences below from implementing similar services on Azure for large-scale banks with emphasis on resilience, compliance, security and cost control.
When it comes to public cloud, adoption by the financial service industry has lagged from other sectors by several years. Financial services firms are heavily regulated and scrutinized. Their regulators and customers demand a higher standard of data privacy and security. Applicable regulations, to name a few, include Dodd-Frank, FFIEC, PCI DSS, GLBA, SOX, USA Patriot Act, and GDPR. In recent years, major public Cloud Service Providers (CSPs) such as Microsoft Azure and Amazon Web Services (AWS) have worked closely with financial services firms to ensure that CSP agreements and functionality meet financial security and privacy requirements. Since then, firms have been adopting public cloud services along with their on-premises datacenter to build hybrid cloud solutions. Based on Gartner analysts’ research, about 36% of financial enterprises will use the cloud to support most of their transactional systems of record in 2020. The worldwide public cloud service revenue will reach $214.3B in 2019, and $331.2B by 2022.
With the growing usage of public cloud in the financial services industry, the importance of a system that continually checks and tests those resources has also grown, hence the recommendation to use tools like the Simian Army.
Simian Army for Financial Service Industry
Typically, Simian Army tools can execute actions at either the CSP level or at a resource level. “CSP level” activities are those that an administrator could perform from a CSP’s web console or via a CSP’s API and include things like resource creation and destruction. “Resource level” actions are those that an administrator could perform from within a resource context (e.g. commands from within a secure shell session to a virtual machine or network device). We advise that monkeys only be allowed to execute actions at the same levels and through the same mechanisms as those permitted to administrators. At the CSP level, some of our clients implement a cloud abstraction layer and force all cloud actions through that layer while preventing anyone from leveraging either the CSP’s console or its native APIs. At the resource level, we recommend that our most security-conscious clients lockdown resources at the moment of deployment, following an “immutable infrastructure” model, and do not allow any direct user interaction with cloud resources. This means no one can SSH or RDP into servers or network devices. As mentioned earlier, when we implement these types of restrictions for individuals, we ensure that we hold the Simian Army’s monkeys to similar restrictions.
Stay current on your favourite topics
While there are strong beliefs both for and against the use of an abstraction layer for public cloud, many financial companies have chosen to implement such a layer. Cloud abstraction layers often incorporate well-known Infrastructure as Code (IaC) tools like Terraform, Cloudify, or Pulumi to minimize overall coding complexity. The Simian Army should execute its operations through the same pathway as other cloud applications. If an API abstraction layer has been implemented, Simian Army monkeys should leverage that layer.
Simian Army Integrated with the Pipeline
To ensure that applications are “cloud ready” before they are deployed for production use in the cloud, it is essential to integrate Simian Army into every environment (e.g. dev, QA, prod) across your application delivery pipeline. Surviving and complying with the Simian Army are part of the criteria for promotion to the next environment. This full pipeline integration aligns with the shift-left approach recommended for agile development and allows developers to iterate and correct quickly during their earliest phases rather than learning of architectural weaknesses just before or just after production release. We recommend implementing a global minimum and a de facto standard for the rules associated with each monkey. Encourage application teams to submit rules for each application prior to cloud development understanding that those values cannot fall below the minimums. Applications without specific rules are held to the de facto standard.
Introducing… The Monkeys
With the exception of Chaos Monkey, all the other monkeys follow a similar general pattern. They pull resource information from the cloud and evaluate it with a predefined set of rules. Then, the monkey takes action based on the results of their evaluation. Actions generally fall into one of the following types: notify, sequester, shutdown, or destroy. The monkeys are NOT used to correct the configuration of resources which violate one or more rules. Prioritization rules are leveraged to prevent conflicts between monkeys.
Chaos Monkey functions differently than the other members of the Simian Army. It is a tool that randomly disables or disrupts resources to make sure an application can survive common types of failure without customer impact. When resource-level actions are allowed, this could include running processes with memory leaks, erratic CPU consumption, and intermittent network disruption. More commonly for finance industry use cases, actions are limited to CSP-level. APIs, either directly from the CSP or from an abstraction layer, are leveraged to disable resources/instances and test the resiliency of the application’s architecture. For example, leverage Azure’s APIs, we can test an application’s network resilience detaching/attaching virtual network(vNet). Cloud-ready applications are usually deployed across several distributed groups (aka clusters) rather than a single machine. When we choose which resources/instances to disable, we need to be careful. When we target those machines, we need to query their meta-data to make sure we don’t disable all nodes in the same cluster or else the application will be disabled without achieving the goal of the test. There are many ways to test resilient architecture in a public cloud environment on different layers; we welcome your comments and ideas.
Conformity Monkey finds resources that don’t adhere to best-practices and takes action. This component focuses on enforcing a set of compliance rules set by a compliance team. For example:
- Virtual machines need to belong to an auto-scaling group
- At least two machines in each group
- All resources/instances need to be tagged properly
- Certain data cannot be in the public cloud for more than N days
The cloud provider’s policies can enforce some rules for us. So, we verify that those policies are in place. We can define all those rules in a semi-structured data format, such as JSON or YAML. I’ll cover more about rules design later in this blog. We query resources/instances in the public cloud frequently by using a scheduler and comparing retrieved information with the appropriate rules.
Janitor Monkey ensures that our cloud environment is running free of clutter and waste. It searches for unused or expired resources and disposes of them. For this component, we can define expiration or lifespan rules and apply them to resources. For example, if Janitor Monkey finds a virtual machine which is eight days old, and the maximum lifespan defined for virtual machines running that application is seven days, we delete it. Exceeding lifespan is easy to detect. While identifying idle resources is more challenging, it is doable. For Azure resources, we can query various logs to determine and verify inactivity. Keeping a cloud environment clean is the goal for Janitor Monkey.
Doctor Monkey taps into health checks that run by default on many resources as well as monitoring other signs of health (e.g. CPU load) to detect unhealthy instances. We establish thresholds, upper limits for CPU, hard drive, memory, and bandwidth usage in Doctor Monkey’s rules. Any instance that exceeds the limit for a specified duration if flagged for more in-depth diagnosis. We get those metric values by querying the monitoring logs. Constantly checking resource health is key to improving the availability and reliability of an application.
Security Monkey is an extension of Conformity Monkey. It finds security violations or vulnerabilities. For the finance industry, security plays perhaps the most important role when building an application. Small security flaws can be expensive to fix, and the reputational damage may be irreparable. It is critical to continually check that public cloud security features are in place. Examples follow:
- All virtual machines need to only open ports approved by the security team
- All virtual networks should be connected to the company’s on-premises network using a VPN gateway
- All instances/resources should not have public IP assigned by the cloud provider; all data at rest in storages need to be encrypted by the company’s key
- Use key-vault for key encryption management
Different type of resource should have its security in place; we also need to make sure security has been implemented in every layer of cloud service. Some cloud provider, such as Microsoft Azure, also provides built-in advanced threat protection tool, which can detect varieties of suspected user activities, such as abnormal user login location, brute force attack, suspicious authentication failure, etc. Once we enable it, we can use Security Monkey to regularly retrieve those alerts and send to stakeholders.
We can always create new monkeys by adding more components to the Simian Army. In some cases, one monkey may be sufficient. For example, Chaos Monkey to ensure resilient design and implementation can be used on its own. In other cases, additional, more sophisticated monkeys may be appropriate, for example, AI Monkey, uses Machine Learning algorithms to analyze log data and find cloud instances/resources performance patterns. There is no limit in terms of creating new monkeys. Simian Army is a concept that groups different components (monkeys) together to test applications and cloud environments to ensure that non-functional requirements remain satisfied.
The Monkey Rules
For managing, and auditing purpose, the design of rules used by each monkey is tailored for that monkey. We can then group rules by monkey type, and put them into one repository. By doing so, security, auditing, management, and operations teams can easily design and review rules that are in place, developers can also take advantage of a rules engine to parse rules quickly. YAML or JSON file format is a good fit for rule design since both have human read-friendly structures. Here is an example of a Chaos Monkey rule using YAML:
Looking at the rule definition above, the purpose of this rule is: if 5% of deployed instances in a group went down randomly (minimum 1 instance if (totalNumberOfInstances * 5%) is less than 1), the application that depends on those instances should stay alive with no impact on users. We are using the application’s status endpoint provided by the application team to verify its health status. If the application fails to respond or responds with an error, we will send notice to all appropriate parties. This particular rule will apply to the “prod” environment. More aggressive chaos rules may be used in development and QA environments. Operations teams can easily invoke this rule to test application resiliency, or the rule can be set to be automatically invoked following some schedule. Chaos Monkey will pick it up and apply to the specific virtual machine instances by querying application ID metadata. You can apply a similar design to other monkeys. Here’s an example for Security Monkey:
Stay current on your favourite topics
The above rule tells us to destroy any virtual machines that have a public IP assigned to them. Notice the different parts by comparing this rule with the previous Chaos Monkey rule. This rule doesn’t have application ID, which means it will be applied to all virtual machines regardless of their metadata. We can use a hierarchical design in rule files. For instance, we can set up some entries as Master Rules, and others as Project Rules. Master Rules are in place by default when we test a new deployment. Then, we can define when a Project Rule is allowed if ever, to override a Master Rule.
We can see that YAML is very neat when we use it for defining rules. YAML has broad language support, including Java, Python, Golang, C#, PHP, and many other programming languages. Developers can easily parse YAML files and convert to native data structures.
A scheduler helps teams to run specific Simian Army tasks at predefined times. For example, we may want Chaos Monkey tasks to be run only during working hours. That way, if any failure happens, all stakeholders can respond quickly. We can run different monkeys at different intervals based on their level of importance. We may want Security Monkey to run every 6 hours to make sure all security features are in place, in the meantime, we don’t want Chaos monkey to run every day to test resiliency, we may only need to run it once when new rule kicks in. A scheduler can be used for setting up one time jobs, repeated jobs, or even to invoke tasks after a triggering event.
A Clean, Compliant, Secure and Resilient Cloud Environment
Innovation and agility will lead more and more financial firms to adopt public cloud. Also, new on-demand PaaS/SaaS services added by cloud service providers can help financial industries to use the latest technologies at a relatively low cost. How to use the public cloud effectively while meeting compliance and user requirements will always be a top concern. The ultimate goal of using Simian Army is to address those concerns. We will follow-up with separate posts comparing the use of Simian Army to other techniques such as the use of continuous compliance tools like Redlock and CSP native tooling. If you’re interested, I’d like to hear more or different ideas on this topic.