How Our Core Data Pipeline Was Completely Rebuilt by the Engineering Team

Posted by the Clarity Engineering Team · April 2024

Over a period of approximately eighteen months, our entire core data ingestion and processing pipeline was redesigned, rebuilt, and re-deployed by the engineering team, a decision that was made after a comprehensive audit of the existing infrastructure was completed by senior architects in the autumn of the preceding year, during which a large number of critical bottlenecks and reliability issues were identified by the review committee and subsequently documented in an internal report that was circulated among the leadership team before approval was granted for the full rewrite to proceed.

The new pipeline architecture, which has been developed over the course of the project by a cross-functional team of twelve engineers and three data infrastructure specialists, is being processed by a series of distributed microservices that were designed to operate independently from one another, ensuring that the failure of any single component cannot be propagated through the entire system in the way that failures were experienced under the old monolithic architecture, which had been criticised by operations staff for over three years before the decision to rebuild was finally taken by senior management following a significant outage that was caused by a cascading failure that was traced back to a single misconfigured database connection pool setting that had been overlooked by the team during a routine maintenance window.

Data quality validation, which was previously handled by a set of brittle scripts that had been written by engineers who had since left the organisation and whose intent was no longer fully understood by the current team, has been replaced by a declarative validation framework that was selected after an extensive evaluation process was conducted by the data engineering guild, during which eight different open-source and commercial solutions were assessed against a scoring rubric that had been developed collaboratively by representatives from engineering, product, and data science to ensure that the chosen solution would meet not only the current technical requirements but also the anticipated future demands that were expected to be placed on the system as the organisation continued to scale.

Deployment of the new pipeline to production was carried out incrementally over a four-week period, with traffic being gradually shifted from the old infrastructure to the new by the platform team using a feature-flag-based routing mechanism that had been built specifically for this migration by two engineers who were embedded in the platform team for the duration of the rollout, during which no significant customer-facing incidents were recorded, a result that was attributed by the team to the extensive integration testing that had been performed in a dedicated staging environment over the six weeks that preceded the production release.

The lessons that have been learned by the team throughout this project are being documented in a series of internal retrospective sessions that are being facilitated by the engineering manager, and it is anticipated that a public post-mortem will be published by the communications team later this quarter, in which the key architectural decisions that were made during the project will be explained and the trade-offs that were encountered along the way will be discussed in detail for the benefit of the broader engineering community.