Real-time monitoring at BricoPrivé

Success story

Use case overview

BricoPrivé is an online home improvement retail business whose presence spans across multiple countries. The availability of their services is crucial to provide a good user experience and to make sure no order is missed. To ensure full control over their infrastructure, BricoPrivé needs to have extensive tooling that will expose both technical and business metrics to allow them to operate their platform with confidence.

Technofy is helping BricoPrivé's team on multiple aspects related to AWS and DevOps, but this use case focuses on the implementation of DataDog on their cloud-based infrastructure. DataDog is an observability and monitoring platform that provides metrics, log ingestion and powerful data analytics.

Business pain & challenges

Lack of observability
Pattern analysis in logs and metrics
Troubleshooting
Lack of visibility into resource usage

Tech stack

AWS

  • System Manager - Used to deploy syslog and the DataDog agent on the instances.
  • Kinesis Data Streams - Real-time data streaming service for access logs coming from CloudFront.
  • Kinesis Data Firehose - Real-time streaming service to dispatch data to the DataDog ingestion endpoint.

Technologies

  • DataDog - Monitoring platform of choice for BricoPrivé
  • Syslog-ng - Used to gather and dispatch applicative logs to DataDog
  • Ansible - Configuration management tool used to define the desired state of the compute instances.

Solution

High-level overview of streams between AWS and DataDog

Overview

The solution focuses on providing monitoring on three different levels:

  • On the CDN, thanks to real-time monitoring on CloudFront to capture all the requests coming from the internet.
  • From the instance level, metrics are reported by the DataDog agent.
  • From the application level, the solution uses syslog-ng to filter and dispatch the logs to the DataDog syslog ingestion endpoint.

All these different mediums of logs & metrics delivery are configured to use encryption in transit to comply with the security requirements.

Monitoring CloudFront

CloudFront is BricoPrivé's CDN of choice as multiple hundreds of terabytes are delivered each year to their users. The flexibility of the service allows for routing HTTP requests to different backends thanks to various parameters. This makes CloudFront akin to a more classic reverse-proxy which proves itself very valuable for businesses who wish to incrementally split their monolithic applications into smaller, more agile microservices.

CloudFront provides two ways of delivering access logs, standard and real-time. The standard logs are periodically delivered to S3 which allows for other systems to process them. The time-to-delivery of these logs can range from a few minutes to up to 24 hours. On the other hand, real-time logs are delivered in a few seconds to Kinesis Data Streams which in turn allows Kinesis Firehose to dispatch these to various backends.

BricoPrivé requires the latter as it allows them to have a better understanding of the scale at which they're operating. These real-time logs provide valuable information in terms of performance and customer experience which in turn can be analyzed and turned into optimization actions.

As mentioned above, Technofy has deployed Kinesis Data Streams and Kinesis Firehose to dispatch the logs into DataDog. This procedure is well described in the documentation (See Send AWS services logs with the Datadog Kinesis Firehose Destination).

Monitoring instances

BricoPrivé uses EC2 compute instances for their application and they follow the best practices regarding auto scaling, load balancing, and availability. Because of their growing activity, their computing costs are also increasing. Having an eye on the resource usage of each instance allows them to fine tune their auto scaling policies to maximize the usage of the resources without impacting the end user experience and have a proper cost control during scale out.

This monitoring is done thanks to the DataDog agent which reports system metrics and running processes in real-time. The agent can also stream log files, but in this case, the applicative requirements do not specify to use this feature. 

Monitoring applications

The endeavour of providing observability in the different layers wouldn't be complete without peeking into the application layer. A widely known technology such as Syslog allows multiple log producers to use a standard protocol which reduces the operational complexity. In this case, we have settled on syslog-ng as the configuration is more palatable and understandable to the engineering teams. 

Once again, DataDog provides a Syslog ingestion service. Two endpoints are available, one with TLS and the other without. Given that applicative logs could contain sensitive data, the natural choice was to configure syslog-ng to use the encrypted endpoint.

Deploying the agents

As described in the section "Monitoring instances", the DataDog agent provides a lot of insights on the system it is running on. We have also covered the fact that we also need to deploy syslog-ng on the instances.

At Technofy, we commonly use Ansible for configuration management on the systems of our customers. Luckily for us, DataDog already provides a role on Ansible Galaxy which makes the setup even easier. All we have to do is fill in a few configuration details and the API key.

In a lot of cases, Ansible deployments are made remotely via SSH, but BricoPrivé is heavily using AWS System Manager and uses it to apply patches and create remote sessions on their fleet of machines. The service also provides a way to run Ansible Playbooks through one of their managed SSM documents, namely "AWS-ApplyAnsiblePlaybooks". This document allows us to specify variables as well as an S3 bucket (or a GitHub repository) where it will look for the playbook. Once executed, the document then takes care of automatically installing Ansible on the target machines if it is not already present.

Apply an Ansible Playbook with AWS Systems Manager

Results & highlights

Observability on multiple layers is centralized and configured in a single tool.
Multiple fixes were made to fix erroneous links during customer navigation thanks to the aggregation of real-time logs from different sources.
The most accessed resources are now identified and can benefit from further optimization.
Configuration changes to the DataDog agent and to the syslog service can now be completely automated over dozens of instances.
Better understanding of the resource usage and what can be leveraged on AWS to optimize the costs.