BricoPrivé is an online home improvement retail business whose presence spans across multiple countries. The availability of their services is crucial to provide a good user experience and to make sure no order is missed. To ensure full control over their infrastructure, BricoPrivé needs to have extensive tooling that will expose both technical and business metrics to allow them to operate their platform with confidence.
Technofy is helping BricoPrivé's team on multiple aspects related to AWS and DevOps, but this use case focuses on the implementation of DataDog on their cloud-based infrastructure. DataDog is an observability and monitoring platform that provides metrics, log ingestion and powerful data analytics.
The solution focuses on providing monitoring on three different levels:
All these different mediums of logs & metrics delivery are configured to use encryption in transit to comply with the security requirements.
CloudFront is BricoPrivé's CDN of choice as multiple hundreds of terabytes are delivered each year to their users. The flexibility of the service allows for routing HTTP requests to different backends thanks to various parameters. This makes CloudFront akin to a more classic reverse-proxy which proves itself very valuable for businesses who wish to incrementally split their monolithic applications into smaller, more agile microservices.
CloudFront provides two ways of delivering access logs, standard and real-time. The standard logs are periodically delivered to S3 which allows for other systems to process them. The time-to-delivery of these logs can range from a few minutes to up to 24 hours. On the other hand, real-time logs are delivered in a few seconds to Kinesis Data Streams which in turn allows Kinesis Firehose to dispatch these to various backends.
BricoPrivé requires the latter as it allows them to have a better understanding of the scale at which they're operating. These real-time logs provide valuable information in terms of performance and customer experience which in turn can be analyzed and turned into optimization actions.
As mentioned above, Technofy has deployed Kinesis Data Streams and Kinesis Firehose to dispatch the logs into DataDog. This procedure is well described in the documentation (See Send AWS services logs with the Datadog Kinesis Firehose Destination).
BricoPrivé uses EC2 compute instances for their application and they follow the best practices regarding auto scaling, load balancing, and availability. Because of their growing activity, their computing costs are also increasing. Having an eye on the resource usage of each instance allows them to fine tune their auto scaling policies to maximize the usage of the resources without impacting the end user experience and have a proper cost control during scale out.
This monitoring is done thanks to the DataDog agent which reports system metrics and running processes in real-time. The agent can also stream log files, but in this case, the applicative requirements do not specify to use this feature.
The endeavour of providing observability in the different layers wouldn't be complete without peeking into the application layer. A widely known technology such as Syslog allows multiple log producers to use a standard protocol which reduces the operational complexity. In this case, we have settled on syslog-ng as the configuration is more palatable and understandable to the engineering teams.
Once again, DataDog provides a Syslog ingestion service. Two endpoints are available, one with TLS and the other without. Given that applicative logs could contain sensitive data, the natural choice was to configure syslog-ng to use the encrypted endpoint.
As described in the section "Monitoring instances", the DataDog agent provides a lot of insights on the system it is running on. We have also covered the fact that we also need to deploy syslog-ng on the instances.
At Technofy, we commonly use Ansible for configuration management on the systems of our customers. Luckily for us, DataDog already provides a role on Ansible Galaxy which makes the setup even easier. All we have to do is fill in a few configuration details and the API key.
In a lot of cases, Ansible deployments are made remotely via SSH, but BricoPrivé is heavily using AWS System Manager and uses it to apply patches and create remote sessions on their fleet of machines. The service also provides a way to run Ansible Playbooks through one of their managed SSM documents, namely "AWS-ApplyAnsiblePlaybooks". This document allows us to specify variables as well as an S3 bucket (or a GitHub repository) where it will look for the playbook. Once executed, the document then takes care of automatically installing Ansible on the target machines if it is not already present.