Istio: Canary Deployments, Dynamic Routing & Tracing
In this series of blog posts we had an introduction to Istio, and an overview of its security features. This post completes the series with a look at how we can leverage Istio’s traffic control features to provide increased observability and control over the operation and deployment of our applications.
Whilst our focus to date has been on security, the transparency Istio provides is its killer adoption-enabling feature. It also unlocks a plethora of operational and networking features that we’ll be exploring in this post. We love continuous delivery (CD), and Istio is the most exciting enabler of CD we’ve seen to date.
In the first article in this series, we contrasted Istio’s approach to network reliability and traffic control to those of Netflix’s Hystrix: Istio makes these features possible through a programmable, policy-driven service mesh that is transparent to the application developer, whilst Hystrix implements them in a networking code library.
Applications running on Istio have all their traffic routed through Istio’s service mesh, which provides policy and observability that is configurable from the Istio control plane. The service mesh is implemented as an Envoy proxy sidecar in every pod that receives all inbound traffic and forwards it to the application, and intercepts all outbound traffic. Envoy is dynamically reconfigured with policy from the control plane. The result is a network that is continually enforcing centrally-managed policy.
Figure 1 (Source: https://istio.io/docs/concepts/what-is-istio/)
Istio’s Traffic Control Features
Istio’s ability to enforce policy at any point in the network enables a number of very useful traffic control and observability features, including rate limiting, circuit breaking and programmable rollouts such as canary deployments. The real key to Istio’s power is not simply that it supports these features, but that it allows their configuration to be expressed at Layer 7, the Application Layer. In technical terms this means we can use data in HTTP headers such as host headers, cookies and JWTs (JSON Web Token) to make routing decisions. More helpfully, it means we can express traffic control policies in application and business terms, for example users, microservices and features.
Decoupling traffic flow from infrastructure in this way makes it an application concern, and brings its configuration into the purview of development and product teams, and ultimately the business. These are the people who best understand the needs of their users, and the business context of features and releases. In addition, developers can configure traffic flow rules at an application level, without having to understand the internals of Kubernetes.
Let’s explore some use cases enabled by these features.
Istio is able to handle network failures in the service mesh automatically and transparently, by retrying failed requests within configurable bounds. Configurable parameters include timeout budgets for retries, and jitter to restrict the impact of the increased traffic caused by retries on upstream services. Of course, even the cleverest of automatic retries can’t fully insure us against network failures, so applications will still need to handle 503s returned from Envoy when it stops retrying. The fallacies of distributed systems still apply!
This provides a layer of insulation from non-persistent network outage and congestion, and can improve the user experience of your application. It is disabled by default.
A sudden and intense spike in traffic can overload a network and its impact can cascade throughout a microservice system as it struggles to process a backlog of requests, even well after a return to normal traffic levels. Whether malicious or incidental, this has the effect of a denial of service attack.
A service that is generating errors (e.g. HTTP 503, “Service unavailable”) above a pre-configured threshold will be removed from the load balancer pool, reducing the chance of requests being routed to unhealthy instances. This is called circuit breaking and is configured by defining a connection pool of concurrent TCP connections and pending HTTP requests, and is tuned similar to Kubernetes liveness and readiness checks in that you can define thresholds for load balancer ejection and readmission.
Related to circuit breaking is rate limiting, the ability for Istio to enforce limits on the rate of requests that match certain criteria. It can be used to ensure that certain requests are not overused, much like a public API service will ensure that you cannot abuse them by exceeding a published rate of requests.
Defining rate limits involves specifying which parameters to count, their maximums, and the window of time in which to enforce the limit. These counts need to be tracked centrally in the cluster to ensure they aren’t exceeded, and therefore rate limiting checks happen in the Mixer on the data path, rather than in Envoy. The Mixer can store these counts in memory (not recommended in production), or in Redis.
Careful configuration of rate limits will ensure fair use of the system by all users.
Istio’s traffic routing can be used for A/B testing, the testing of new features by sending a subset of customer traffic to instances with the new feature and observing telemetry and user feedback. Although commonly used for user interfaces, A/B testing can also be employed for microservices. Istio can be configured to direct traffic based on a percentage weight, cookie value, query parameter and HTTP headers, to name a few.
Use-cases for microservice A/B testing might include trying out new features for a subset of users or geographical regions, or testing an update on a reduced scale before complete roll-out - if a significant amount of errors are detected on the new version then the rollout can be reverted and all traffic sent to the incumbent service. Testing new features via A/B testing is potentially impactful; it would be safer to perform canary releases.
Figure 2 (Source: https://istio.io/docs/concepts/traffic-management/)
Canary releases could be considered a special case of A/B testing, in which the rollout happens much more gradually. The analogy being alluded to by the name is the canary in the coal mine. A canary release begins with a “dark” deployment of the new service version, which receives no traffic. If the service is observed to start healthily it is directed a small percentage of traffic (e.g. 1%). Errors are continually monitored as continued health is rewarded with increased traffic, until the new service is receiving 100% of the traffic and the old instances can be shut down. Obviously, the ability to safely perform canary releases rests upon reliable and accurate application health checks.
Istio traffic routing configuration can be used to perform canary releases by programmatically adjusting the relative weighting of traffic between service versions. Writing a control loop to observe service health and adjust weighting as part of a canary deployment is left to the user, although Flagger and Theseus are designed for this purpose.
By default, Istio does not permit connections to services outside the mesh, however Istio provides two in-mesh ways to define outbound connections to a permitted URL. Controlling egress using URLs provides an advantage over legacy firewalling techniques using static IP ranges, as modern web services are increasingly hosted using IPs that change regularly.
An example use case would be a microservice that performs read-write actions on a database. The database is hosted outside the mesh by the underlying cloud provider and the IP for the database is not static. Thus, egress needs to be configured such that the microservice can form a connection to the service outside the mesh. This configuration needs to be URL based so that the outbound connection is limited to the specific service, instead of having to whitelist the entire published IP range of the cloud provider, which presents security concerns.
Figure 3 (Source: https://istio.io/docs/concepts/traffic-management/)
An Egress Gateway (see Figure 3) is a dedicated Istio proxy through which all egress traffic passes - a single exit point from the mesh. The use of a gateway enables supplementary controls, such as using Kubernetes network policy, which can be configured to restrict all egress from the cluster except for traffic originating from the Egress Gateway. Thus, configuring an Egress Gateway with supplementary controls is seen as the most secure way of managing egress from the Istio service mesh.
How is Policy Configured in Istio?
As can be seen in Figure 1, above, policy is configured via the control plane’s API, and is then disseminated to Envoy sidecars by the Pilot. Just like other Kubernetes operations, Istio config and policy is expressed in YAML files for Custom Resource Definitions (CRDs) and sent to the API using kubectl.
There are four kinds of policy used to manage traffic using Istio: VirtualServices, DestinationRules, ServiceEntrys, and Gateways. A VirtualService is a kind of Istio policy that manages traffic routing rules defining how requests to a service propagate through the service mesh.
Routing can be configured based upon request source and destination, HTTP paths and headers, and defined weighting for destination services. To create a simple Virtual Service for the BookInfo sample app we can specify some route rules within a YAML file:
We can submit these to Istio using kubectl because Istio has registered the VirtualService as a CRD (note the API field networking.istio.io/v1alpha3 - this is not a core Kubernetes resource):
$ kubectl create -f virtual-service-all-v1.yaml
Configuring policy such as traffic control rules and circuit breaking is also done with YAML and kubectl. Here is a circuit breaking example (taken from the Istio docs):
See the Istio documentation for more examples.This will limit the connection pool for the “httpbin” service to one connection at a time, which will trip quite quickly under any load.
How is Policy Enforced in Istio?
Updates to Istio configuration in the control plane are propagated throughout the service mesh when the Pilot pushes out changes to the Envoy proxies. These changes are eventually consistent, so there will be some delay before the changes take effect across the cluster.
Policy checks are made against Mixer configuration on each request. Naively implemented, this could cause issues at scale with the Mixer becoming a central point of failure. Istio avoids this through sophisticated caching, batching and prefetching of policy configuration in Envoy.
Benefits: Operating Secure Microservice Applications
According to Sam Newman (Building Microservices, O’Reilly), microservice systems should have the following properties (among others):
- Adopt a culture of automation
- Isolate failure
- Highly observable
Istio’s approach of providing a service mesh that is transparent to the developer and the application allows us to adopt its features automatically, at scale, for a new or an existing system. We are effectively automating the adoption of these service mesh features.
The circuit breaking and rate limiting features of Istio enable us to better isolate failure in our microservices, preventing potentially crippling knock-on effects to the rest of the system. This contributes to the stability of the system in general.
Just as Envoy calls out to Mixer for preconditions on each request (although these checks are cached), it also calls out to Mixer to post telemetry after each request. Istio will also inject various headers to enable distributed tracing, which applications must forward on with each request. Failing to forward tracing headers from the incoming request will result in disconnected tracing spans that only include two microservices. To benefit from this tracing data we need to install a tracing backend like Jaeger (a Mixer adapter), which will enable us to visualise the flow of requests through our distributed system.
Distributed tracing increases the observability of your distributed system, making diagnosis and debugging significantly easier. This is crucial in microservice systems; the increased operational complexity needs to be tamed by automation and observability.
The Istio service mesh design facilitates a number of traffic control and observability features that help us operate distributed systems more easily. These are made possible by Envoy’s position on the data path of all requests and its high configurability from a central control plane. Operators of distributed systems now have the ability to tune an application’s network traffic flow, and strike a balance between an application’s reliability requirements and its impact on the system as a whole. Releases can be rolled out gently and selectively, encouraging innovation and agility.
Istio tames the operational impact of sprawling microservice applications, and brings development and operations together to define the bounding parameters of production performance.
Istio is an ambitious project that reaches into many aspects of application security and operations. Its domain begins at the end of the pipeline and extends into runtime operation. Application pipelines with rigorous testing, strong security controls, good secrets management, and a robust approach to supply-chain security are unfortunately not yet the norm; nor are they the whole story. Istio’s ability to defend against malicious internal actors and misbehaving microservices adds further security coverage for cloud native applications.
Learn more about Istio and Kubernetes Security in this O'Reilly book: