← back to writing

Infrastructure · · first published on LinkedIn

CTO and CDO Infrastructure Challenges

Modular infrastructure, resilient pipelines, future-proofing. The architectural moves that handle AI workloads without overwhelming the budget or the team.

In a previous article, I talked about the “three legs of the stool” for data and AI success: Data, Infrastructure Foundation, and Implementation. I got a huge response from that piece, mostly around the foundation infrastructure supporting AI capabilities and legacy software development.

Encouraged by your feedback, today I want to double-click on that second leg, coming from my data-centric point of view of developing AI and data products that not only generate insights and analytics but also are integrated into main software products, guiding customer journeys, offering promotions, and hyper-personalisation, while keeping costs and latency down.

This is not just about bolting on new hardware or spinning up a few more cloud instances; it is about a strategic architectural approach that has to be tailored to take our AI ambitions into account.

Based on my experiences leading numerous data and AI projects, successful technology stacks share key characteristics: modular infrastructure, resilient processing and pipelines, and a focus on future-proofing. Topics such as cybersecurity, metadata monitoring, data lifecycle monitoring, and application performance are intentionally excluded here for brevity as they build upon this foundational stack.

Building out the right infrastructure for AI is not a one-time project; it is an ongoing journey of evaluation, adaptation, and optimisation. As CTOs and CDOs, our role is to ensure this foundation is not only robust and efficient today but also agile enough to embrace the innovations of tomorrow.

Modular Infrastructure Architectures

When we talk about AI, especially training complex models or running inference at scale, the demand for compute and data processing can be immense and spiky. One minute you might be ticking along, the next, a new model training cycle kicks in, or a successful AI-powered feature sees a surge in user adoption, and your resource needs can go through the roof. This is where scalability is not just a nice-to-have; it is the absolute bedrock.

Platforms like AWS, Azure, and Google Cloud offer incredible elasticity. They allow us to scale resources up or down almost instantaneously in response to fluctuating AI workloads. They also carry some level of risk as you might end up in an oversize contract that is expensive for the moment you are in your data journey, especially if your company talks a bigger game about data and AI compared to the actual level of automation and data products adoption.

The obvious benefits are avoiding massive upfront capital expenditures on hardware that might sit idle much of the time, and the pay-as-you-go scheme that closely follows the actual demand. Think about a retail company using AI for personalised recommendations. During a holiday sale, traffic might spike tenfold. A cloud-native infrastructure can automatically scale the AI inference services to handle this peak load and then scale back down when things quieten, optimising both performance and cost.

Then there is the processing power itself. Large-scale data pipelines are the lifeblood of AI. We need to ingest, transform, and feed vast amounts of data to our models. This is where distributed processing frameworks like Apache Spark or Ray come into their own. These frameworks are designed to parallelise tasks across clusters of machines, dramatically speeding up data processing and model training. I have seen reports suggesting that effective use of these frameworks can cut infrastructure costs associated with these heavy workloads by as much as 30%.

Cloud platforms are not the only game in town. In my experience, most large organisations are not starting with a blank slate. I have been fortunate enough to have worked for companies which have their own on-prem data infrastructure, meaning legacy systems, existing data warehouses, and applications that are critical to the business. This poses the question of developing hybrid strategies and consistent integration work.

A hybrid cloud strategy often makes a lot of sense. It allows for a gradual migration of workloads and data, minimising disruption to those crucial legacy systems. We can use the public cloud for its agility and scalability to rapidly prototype and deploy new AI agents, while sensitive data or deeply integrated systems might remain on-premise for a period.

Future-Proofing an Infrastructure That Evolves With AI

Any infrastructure that is cutting-edge today could be outdated tomorrow if it is not designed for flexibility. This is particularly true with the AI landscape evolving at breakneck speed. Future-proofing is not about predicting the future; it is about building an architecture that can adapt to it.

I am a big proponent of Lakehouse architecture. Most organisations are dealing with a mix of structured data (from databases, ERPs) and unstructured data (text, images, videos, sensor data). Traditional data warehouses excel at structured data, while data lakes are great for storing vast amounts of raw, unstructured data. A Lakehouse architecture combines the benefits of both, providing a unified platform that can support both batch processing and real-time AI agent training using all types of data.

Containerisation technologies like Docker and Kubernetes are also fundamental to architectural flexibility. Containers package up an application and its dependencies, ensuring it runs consistently across different environments. Kubernetes then orchestrates these containers, managing deployment, scaling, and operations. This approach not only ensures portability and avoids vendor lock-in but also streamlines the deployment pipeline for AI models.

An API-first design philosophy is paramount. By exposing data and functionalities through well-defined APIs, we simplify integration with new and emerging AI tools and frameworks. As the AI ecosystem continues to churn out innovations, an API-first approach ensures that our core infrastructure can easily connect to and leverage these new capabilities.

Finally, distributed technologies designed from the ground up to address modern AI infrastructure challenges, like CockroachDB, are worth a serious look. These distributed architectures give seamless horizontal scaling across regions, automatic data rebalancing, and low-latency global access, all while preserving strong consistency. That is a major win for real-time AI systems. Coupling this with elastic compute and tiered storage means you are not just scaling, you are optimising resource usage dynamically, which is exactly what volatile AI workloads demand.

The Journey Continues

These foundational discussions around core requirements of scalability, resilience, data agility, and native integration with AI workflows are just the beginning. We must consider cybersecurity and compliance concerns as we become any company’s strategic enablers.

The possibilities that AI unlocks are immense, but realising that potential hinges on the strength and intelligence of the infrastructure that supports it. By prioritising modularity, flexibility, efficiency, and strategic alignment, and by leveraging modern data technologies designed for these challenges, we can empower our organisations to deploy AI agents that truly transform our businesses, adapting to evolving needs without overwhelming our budgets or our teams.