Harnessing wild microservices

Andrei Zhozhin

| 20 minutes

Whenever you need to break your monolith into microservices or bootstrap green field using microservices approach there would be some problems that need to be addressed to prevent operations hell. These problems arise from the fact that multiple parts of your system would become independent and every microservice would need to implement the same SDLC process, now instead of one package, you need to care about N-packages. Moreover, microservices need to interact with each other and the idea to have static configuration (like you might have in monolith) would not work with 5+ services, so you need some discovery mechanism.

Three main areas should be covered:

  • Service discovery - in a monolith application all components might know about each other (via dependency injection for example), but in distributed world different parts of the system (read microservices) could be deployed to different machines, thus there should be someplace where service can find out who is providing information it needs. For services, there should be some kind of registry where they could register so other services could find them.
  • Configuration - in monolith application configuration is usually consolidated (or it is a bunch of files that are included into each other), in the distributed world every application should have its own piece of config - and it might become a configuration nightmare if not addressed properly. Configuration should be consolidated in a single place to have a single source of truth, configuration also should be versioned to be able to track all changes.
  • Segmentation - in monolith all dependencies could be traced via source code and if one component would have more dependencies than required you can quickly trace and cut them. In distributed world communications between services could be “almost random” as microservices could be developed by different teams using different stacks, so it would be nice to define relationships between microservices that should be allowed (read whitelisted) and all other calls should be blocked to prevent unsolicited load. This one might become an issue after a certain amount of microservices and could be postponed for some time or could even be completely ignored if your system is not growing anymore and you can solve all problems with simpler approaches.

Anyway, all these three areas could be covered by a single product - Hashicorp Consul.

I would like to demonstrate how to implement the first two concepts in your project to speed up development, simplify configuration, increase observability, and make your sleep better at the night. Segmentation deserves a separate post so I would leave it for later.

We would look at a famous example in the Spring framework community - PetClinic. More specifically on PetClinic was implemented as a microservices. The default implementation uses Spring Cloud Config (to provide a central place for configuration) and Netflix Eureka Discovery Service. I would replace both of them with Hashicorp Consul and explain the benefits of such change.

PetClinic microservices architecture with Consul:

PetClinic microservices consul architecture

What is happening here: all services (except OpenZipkin and Prometheus) are registered and discovered through Consul, service configuration is also stored centrally in Consul KV. During startup services connect to their local Consul agent and fetch configuration, then after bootstrap, they register themselves in the service registry and become discoverable by other services.

Application services:

  • UI + API Gateway - serves both static site (Angular.js javascript app) and implement facade for internal services (API Gateway)
  • Customers Service - serves information about customers (pet owners and their pets)
  • Vets Service - serves information about vets
  • Visits Service - provides functionality to book visits for pet owners

Infrastructural services:

  • Consul - implement two roles: central configuration and service discovery server
  • OpenZipkin - serves distributed traces received from multiple microservices (when the client execute a query to UI there are multiple sub-queries are executed underneath)
  • Prometheus - collects metrics from services
  • Spring Boot Admin server - monitors all spring boot applications (this component is optional and a part of original implementation)

Why consul?

In the microservices world pretty much anything could go wrong and fail and everything should be highly available and resilient to accommodate unexpected traffic increases or support always-changing business requirements. As every service should be highly available we could not allow any service to exist as a solo instance, so there should be either other services behind a load balancer to support single service failure or there should be multiple services where only one is considered as a leader at any point of time and others could step in if a current leader has failed by some reason. As we are talking about service discovery and distributed key-value store capabilities consul covering it as it is designed as distributed and highly available solution which supports multi-datacenter setup.

The core of the distributed architecture is a set of servers that are responsible for maintaining the state of the whole consul cluster. One of the servers become a leader for the group. If a leader fails other follower servers cold decide between each other who would become a new leader to mitigate leader failure. The key aspect of data centre (DC) setup is high bandwidth and low latency within a DC network as communication protocols are designed for such conditions. If you want to cover a multi-datacenter scenario then in every DC you need to spin up your cluster that would cover DC and connect two clusters via WAN.

Reference architecture for multi-DC setup:

Consul multi data center setup

Another nice feature of the consul is the ability to connect Kubernetes services with other standalone services deployed to VMs. In this case, consul clients should be deployed to Kubernetes nodes to work with “local” pods while server cluster is deployed outside Kubernetes and available for “outer” services.

Consul and Kubernetes

Consensus protocol

Consensus protocol in Consul is based on Raft: In search of an Understandable Consensus Algorithm. Only servers participate in consensus and there could be only one leader in the cluster.

Raft nodes are always in one of three states: follower, candidate, or leader. All nodes initially start as a follower. In this state, nodes can accept log entries from a leader and cast votes. If no entries are received for some time, nodes self-promote to the candidate state. In the candidate state, nodes request votes from their peers. If a candidate receives a quorum of votes, then it is promoted to a leader. The leader must accept new log entries and replicate them to all the other followers. In addition, if stale reads are not acceptable, all queries must also be performed on the leader.

For more information about consensus, you can find here.

It is important to keep in mind the requirements of the number of consul servers to support fault tolerance. The following table demonstrates a number of servers required:

Servers Quorum Size Failure Tolerance Comment
1 1 0 Development only
2 2 0
3 2 1 1st recommended setup
4 3 1
5 3 2 2nd recommended setup
6 4 2
7 4 3 Maybe too expensive

Gossip protocol

Consul uses Gossip protocol to broadcast messages and maintain cluster membership.

The concept of gossip communication can be illustrated by the analogy of office workers spreading rumours. Let’s say each hour the office workers congregate around the water cooler. Each employee pairs off with another, chosen at random, and shares the latest gossip. At the start of the day, Dave starts a new rumour: he comments to Bob that he believes that Charlie dyes his moustache. At the next meeting, Bob tells Alice, while Dave repeats the idea to Eve. After each water cooler rendezvous, the number of individuals who have heard the rumour roughly doubles (though this doesn’t account for gossiping twice to the same person; perhaps Dave tries to tell the story to Frank, only to find that Frank already heard it from Alice). Computer systems typically implement this type of protocol with a form of random “peer selection”: with a given frequency, each machine picks another machine at random and shares any rumours.

Consul uses Serf library that implements a modified version of the SWIM (Scalable Weakly-consistent Infection-style Process Group Membership) protocol.

There are two pools of Gossip protocol:

  • LAN gossip pool - this is an intra-datacentre pool that contains all members belonging to the same DC (clients and servers). Clients in this pool could automatically discover servers. These pools allow for quick and reliable event broadcasts.
  • WAN gossip pool - this pool is globally unique and all servers should participate in the WAN pool. WAN pool provides information required for cross-data centre requests. Failure detection allows Consul to survive the loss of connectivity is it a loss of DC or a single server.

Consul basics

Installation

Depending on your platform you can install consul using instructions here).

I would be using consul for windows for simple cases and official docker image with other containers for a more complicated setup.

Operation Modes

Consul service can operate in two different modes: server mode and client mode.

  • Server mode
    • Responsibilities
      • Maintain data center state
      • Respond to RPC queries (read)
      • Process all write operations (KV, catalogue)
    • For Data Center setup 3-5 servers recommended
  • Client mode
    • Responsibilities
      • Register services
      • Runs health checks
      • Forwards queries to servers
    • For every server/VM you would need 1 (one) instance if you are running non-containerized apps

Starting dev consul

Does not persist any data (after restart you would get fresh instance)

consul agent -dev

Console output (I’ve reformatted it to make it fit)

==> Starting Consul agent...
           Version: '1.10.3'
           Node ID: '5591a2d4-fb25-9154-e15d-5a0cc56e24e3'
         Node name: '3191ebe33f0f'
        Datacenter: 'dc1' (Segment: '<all>')
            Server: true (Bootstrap: false)
       Client Addr: [127.0.0.1] (HTTP: 8500, HTTPS: -1, gRPC: 8502, DNS: 8600)
      Cluster Addr: 127.0.0.1 (LAN: 8301, WAN: 8302)
           Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false, 
             Auto-Encrypt-TLS: false

==> Log data will now stream in as it occurs:

2021-10-25T10:02:58.894+0100 [INFO]  
  agent.server.raft: initial configuration: index=1 
    servers="[{Suffrage:Voter ID:5591a2d4-fb25-9154-e15d-5a0cc56e24e3 Address:127.0.0.1:8300}]"
2021-10-25T10:02:58.895+0100 [INFO]  
  agent.server.raft: entering follower state: 
    follower="Node at 127.0.0.1:8300 [Follower]" leader=
2021-10-25T10:02:58.895+0100 [INFO]  
  agent.server.serf.wan: serf: EventMemberJoin: 3191ebe33f0f.dc1 127.0.0.1
2021-10-25T10:02:58.897+0100 [INFO]  
  agent.server.serf.lan: serf: EventMemberJoin: 3191ebe33f0f 127.0.0.1
2021-10-25T10:02:58.897+0100 [INFO]  
  agent.router: Initializing LAN area manager
2021-10-25T10:02:58.898+0100 [INFO]  
  agent.server: Adding LAN server: server="3191ebe33f0f (Addr: tcp/127.0.0.1:8300) (DC: dc1)"

Once you’ve started it you can navigate to Web UI to check the status of your single-node cluster http://localhost:8500/.

Consul UI

You can see single node registered - consul itself. Now we have a single node cluster to play with.

Getting cluster members

If I run consul from the separate console like the following I would get cluster members.

$ consul members

Node          Address         Status  Type    Build   Protocol  DC   Segment
3191ebe33f0f  127.0.0.1:8301  alive   server  1.10.3  2         dc1  <all> 

If I would run a real cluster somewhere all information obtained via consul commands would be taken from the local consul client state and it might be stale, on the other hand, such requests are instantaneous as no network roundtrip is performed. To get the real state of the world we need to use an HTTP endpoint (query would be forwarded to masters), which in effect would take more time to perform, but it would return the current state of cluster thus no stale data.

$ curl localhost:8500/v1/catalog/nodes

[
    {
        "ID": "6b1fbb1c-6f12-e3b9-b693-c1b406e4675b",
        "Node": "3191ebe33f0f",
        "Address": "127.0.0.1",
        "Datacenter": "dc1",
        "TaggedAddresses": {
            "lan": "127.0.0.1",
            "lan_ipv4": "127.0.0.1",
            "wan": "127.0.0.1",
            "wan_ipv4": "127.0.0.1"
        },
        "Meta": {
            "consul-network-segment": ""
        },
        "CreateIndex": 11,
        "ModifyIndex": 13
    }
]

Important note about members and services associated with them:

  • if a member leaves a cluster, all services and service checks associated with it are deregistered
  • if member crashes, the cluster would try to restore the connection to a failed member

Register service in the registry

There are two modes how you can register service:

  • manual (prepare JSON file with service description and invoke Consul HTTP API)
  • automatic (your service/app would register during startup implicitly invoking Consul HTTP API)

For manual registration example let’s create a simple service JSON web.json:

{
  "service": {
    "name": "web",
    "tags": [
      "rails"
    ],
    "port": 80
  }
}

And register it:

curl --request PUT --data @web.json localhost:8500/v1/agent/service/register

Now we can query this service via HTTP API

curl http://localhost:8500/v1/catalog/service/web

And the result would be an array of services with the name web:

[
  {
    "ID": "ca1a84c0-e5d6-4d0f-8481-e2ca2091a996",
    "Node": "3191ebe33f0f",
    "Address": "127.0.0.1",
    "Datacenter": "dc1",
    "TaggedAddresses": {
      "lan": "127.0.0.1",
      "wan": "127.0.0.1"
    },
    "NodeMeta": {
      "consul-network-segment": ""
    },
    "ServiceKind": "",
    "ServiceID": "web",
    "ServiceName": "web",
    "ServiceTags": ["rails"],
    "ServiceAddress": "",
    "ServiceWeights": {
      "Passing": 1,
      "Warning": 1
    },
    "ServiceMeta": {},
    "ServicePort": 80,
    "ServiceEnableTagOverride": false,
    "ServiceProxyDestination": "",
    "ServiceProxy": {},
    "ServiceConnect": {},
    "CreateIndex": 10,
    "ModifyIndex": 10
  }
]

You don’t need to register all your services manually and there are a lot of libraries you can utilize to implement service auto registration in Consul during service startup. I will provide an example below for Spring Boot applications.

Health checks

A health check is considered to be application level and not node level (like full-blown monitoring systems do).

There are several types of health checks available in consul:

  • Script + Interval - classic check that runs an external command and checks the exit code of it regularly (interval).
  • Docker + Interval - the same the previous one but the external command is a docker container invoked via Docker Exec API. Check result is based on the exit code.
  • HTTP + Interval - the check invokes HTTP GET by default (but could be different) request resulting status code considered as result: 2xx - passing, 429 Too Many Requests - warning, anything else - failure.
  • TCP + Interval - the check is trying to establish TCP connection to host and port, if the connection is successful - passing, if no - failure.
  • gRPC + Interval - the check is following gRPC health checking protocol
  • H2ping + Interval - the check would connect via HTTP2 with TLS and send ping frame.
  • Time to Live (TTL) - this check validates that Time To Live is not expired. The state should be updated periodically using the HTTP interface. This type of check could be useful as a heartbeat that some process is still alive and actively reporting its health.
  • Alias - this checks the state of another registered node or service. It might be useful to define dependencies to database or upstream service.

In my practice, I’ve found the most useful the following: HTTP, TCP, and TTL as they could cover 95% of health check monitoring.

HTTP Check example (using POST method with the body):

{
  "check": {
    "id": "api",
    "name": "HTTP API on port 5000",
    "http": "https://localhost:5000/health",
    "tls_server_name": "",
    "tls_skip_verify": false,
    "method": "POST",
    "header": {"Content-Type": ["application/json"]},
    "body": "{\"method\":\"health\"}",
    "interval": "10s",
    "timeout": "1s"
  }
}

TCP check example (checking for availability of SSH on localhost)

{
  "check": {
    "id": "ssh",
    "name": "SSH TCP on port 22",
    "tcp": "localhost:22",
    "interval": "10s",
    "timeout": "1s"
  }
}

TTL check (the external process would need to update this check externally via HTTP API)

{
  "check": {
    "id": "web-app",
    "name": "Web App Status",
    "notes": "Web app does a curl internally every 10 seconds",
    "ttl": "30s"
  }
}

All details regarding types of checks and their technical details could be found here.

Monitoring external services

There is also a very cool feature to register external service (not in your control) to the consul to see if your system upstream/downstream is available.

In the context of Consul, external services run on nodes where you cannot run a local Consul agent. These nodes might be inside your infrastructure (e.g. a mainframe, virtual appliance, or unsupported platform) or outside of it (e.g. a SaaS platform).

There are two ways to archive it:

  • using Script + Interval checks
  • using Consul External Service Monitor

Script + Interval

Let’s start with Script + Interval. We need to run consul with special flag -enable-script-checks to be able to do that:

consul agent -dev -enable-script-checks

Security Warning: Because -enable-script-checks allow script checks to be registered via HTTP API, it may introduce a remote execution vulnerability known to be targeted by malware. For production environments, we strongly recommend using -enable-local-script-checks instead, which removes that vulnerability by allowing script checks to only be defined in the Consul agent’s local configuration files, not via HTTP API.

Now we can create web.json file and register it in the catalogue:

{
  "id": "web1",
  "name": "web",
  "port": 80,
  "check": {
    "name": "ping check",
    "args": ["ping", "-c1", "google.com"],
    "interval": "30s",
    "status": "passing"
  }
}

And register it :

curl --request PUT --data @web.json localhost:8500/v1/agent/service/register

External Service Monitor (ESM)

Because external services by definition don’t have a local Consul agent, you can’t register them with that agent or use it for health checking. Instead, you must register them directly with the catalog using the /catalog/register endpoint. In contrast to service registration where the object context for the endpoint is a service, the object context for the catalog endpoint is the node. In other words, using the /catalog/register endpoint registers an entire node, while the /agent/service/register endpoint registers individual services in the context of a node.

{
  "Node": "google",
  "Address": "google.com",
  "NodeMeta": {
    "external-node": "true",
    "external-probe": "true"
  },
  "Service": {
    "ID": "search1",
    "Service": "search",
    "Port": 80
  },
  "Checks": [
    {
      "Name": "http-check",
      "status": "passing",
      "Definition": {
        "http": "https://google.com",
        "interval": "30s"
      }
    }
  ]
}

Please note NodeMeta property, external-node and external-probe flags are important as they would be picked up by ConsulESM.

Register new node in the catalog:

curl --request PUT --data @external.json localhost:8500/v1/catalog/register

Then you need to start consul-esm (External Service Monitor) that would connect to the consul, grab all external service registration, and start checking its availability on regular basis.

The diagram shows the difference between “local” consul agent checks and external checks with the ConsulESM daemon.

Consul local checks and Consul External checks with ConsulESM

Key value storage

Consul has embedded key-value storage so we can use it to store configuration and metadata for a distributed application. There are multiple conventions how to store configuration in a key-value store:

  • one config item per key (like registry)
  • one blob per key (like a file system)

Please note that it is a simple KV store and is not intended to be a full-featured datastore (such as DynamoDB) but has some similarities to one

There are also some restrictions on object size - 512Kb.

There would be an example below to store existing application configuration in the consul KV store.

Advanced use-cases of Consul KV include the following:

  • Templates - you can create configuration files for other services based on values in KV sore with Consul Template, so technically you can make consul a source of truth for pretty much all configuration in your system. More about the Consul template you can find here.
  • Watches - there is a way to monitor particular data changes (list of nodes, KV pairs, health checks) and execute external handler. More about watches you can find here.
  • Sessions - allows building distributed locks on the KV store. KV Api supports acquire and release operations. This functionality could be useful for the leader election mechanism. More about leader election you can read here.

As you can see these capabilities are pretty powerful and allow you to delegate a lot of functionality to Consul.

Show me the code

At this point, we should understand enough details to start working with Consul.

Service registration

To provide your app ability to register in consul during startup you would need to add a dependency to pom.xml:

<dependency>
  <groupId>org.springframework.cloud</groupId>
  <artifactId>spring-cloud-starter-consul-discovery</artifactId>
</dependency>

Service health is also important for service discovery as only healthy services would be returned to the client, so your service should expose health check endpoint thus adding actuator.

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>

Add annotation @EnableDiscoveryClient on your app class

import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.cloud.client.discovery.EnableDiscoveryClient;

@EnableDiscoveryClient
@SpringBootApplication
public class YourServiceApplication {

    public static void main(String[] args) {
        SpringApplication.run(YourServiceApplication.class, args);
    }
}

Centralized configuration

We want to keep configuration files in original form (YAML), so we would use one blob per key an approach for storing config files in the consul KV store.

We need to add the following dependency to pom.xml

<dependency>
  <groupId>org.springframework.cloud</groupId>
  <artifactId>spring-cloud-starter-consul-config</artifactId>
</dependency>

Change src/main/resources/bootstrap.yml for default and docker profiles.

The value for key spring.application.name would be used to identify the key to fetch configuration YAML from the KV store.

spring:
  cloud:
    consul:
      host: localhost
      port: 8500
      config:
        format: yaml
  application:
    name: <serviceId>
---
spring:
  config:
    activate:
      on-profile: docker
  cloud:
    consul:
      host: consul
      port: 8500

Please note that we are using different spring.cloud.consul.host for default and docker profiles as when you run your application in standalone mode (or in development) you need to have a local (on localhost) consul agent to be running, while in dockerized environment consul agent would be resolved using service name - consul as it would work in a separate container with unknown upfront ip address.

Prepare and populate KV store

By default spring cloud consul config expects to have the following structure (for spring.cloud.consul.config.format==yaml):

config (key)
  - service name 1 (nested key)
    - data (blob with yaml content)
  - service name 2 (nested key)
    - data (blob with yaml content)

There is a way to set profile and split configuration using different “folders” (nested keys):

config (key)
  - service name 1 (nested key, default profile) 
    - data
  - service name 1, profile X (nested key, profile X)
    - data

During startup, an application would request all active profiles and would merge them.

The real folder structure looks like the following in consul UI:

Consul KV folder structure

And the value for the key config/application/data:

Consul KV value structure

After you’ve done all steps your app would fetch configuration during startup from consul KV and register itself in the service registry and it would become discoverable by other services.

Full source code you can find here: https://github.com/azhozhin/spring-petclinic-microservices-consul/

Auto injecting configuration from git repo

It is not very convenient to populate the consul KV store manually, so there is a way to automate it - git2consul.

Git2Consul is a javascript application so we would need to use npm to install it.

npm install -g git2consul

Let’s create a simple config for our application:

{
  "version": "1.0",
  "repos" : [{
    "name" : "config",
    "url" : "https://github.com/azhozhin/spring-petclinic-microservices-consul-config",
    "include_branch_name" : false,
    "branches" : ["master"],
    "hooks": [{
      "type" : "polling",
      "interval" : "1"
    }]
  }]
}

And run git2consul to start monitoring git repo and update Consul KV on regular intervals (every 1 min).

git2consul --config-file git2consul.json

Using a load balancer to select random service

If we have multiple services registered in the catalog Consul would return all healthy entries to the client and the client have to decide which one he would use to execute the request. This is called client-side load balancing.

First, we need to declare RestTemplate

...
@LoadBalanced
@Bean
public RestTemplate loadBalancedRestTemplate() {
     return new RestTemplate();
}
...

and use it like this (assuming we are going to query STORE service)

...
@Autowired
RestTemplate restTemplate;

public String getFirstProduct() {
   return this.restTemplate.getForObject("https://STORE/products/1", String.class);
}
...

Please note: As Spring Cloud Ribbon is now under maintenance, it is suggested to set spring.cloud.loadbalancer.ribbon.enabled to false, so that BlockingLoadBalancerClient is used instead of RibbonLoadBalancerClient.

Using Discovery Client

There is also the possibility to work with discovery clients directly to get service details using org.springframework.cloud.client.discovery.DiscoveryClient.

@Autowired
private DiscoveryClient discoveryClient;

public String getServiceUrl() {
    List<ServiceInstance> list = discoveryClient.getInstances("STORE");
    if (list != null && list.size() > 0 ) {
        // we get first element from the list every time
        // it would be better to select random element to spread the load
        return list.get(0).getUri();
    }
    return null;
}

Final result

I’ve updated the original docker-compose.yml file to set all required services and their dependencies.

Full source code and config available here:

Consul dashboard showing all registered services and their health checks:

Consul dashboard with registered services

Application UI interface:

Application UI

Spring Boot Admin application (it uses consul service discovery to find microservices and then monitor Spring Boot apps):

Spring boot admin UI

Summary

Hashicorp Consul provides a simple and convenient solution to build distributed and highly available applications in the modern era of the cloud. It is not enforcing you to use Kubernetes in the first place so you can start with several microservices deployed to a couple of servers in the cloud and get all benefits of service discovery, health checks, and centralized configuration. If you want to go with Kubernetes Consul also allows such scenarios and would provide even more capabilities like service mesh.

We have gone through the steps to adopt microservices to service discovery and centralized configuration. It is just several lines of configuration and code disregard stack you are using, there are implementations for all major programming languages.

Related content

Elliptic curve cryptography
Riding blockchain
Resume as code