{Core Analysis}: The AI-Native Telco Network IV: Compute

As we have seen in previous posts, to accommodate and make use of AI at scale, a network must be tuned and architected for this purpose. While any telco network can deploy AI in discrete environments or throughout its fabric, the difference between a Data strategy and an AI strategy is speed + feedback loop.

Most Data collected in a telco network has been used for very limited purpose. Mainly archiving for forensics to determine the root cause of an anomaly or outage, charging and customer management functions or for legal interception or regulatory requirements. For these use cases, Data needs to be properly formatted and laid to rest until analytics engines can provide a representation of the state of the network or an account. Speed is not an issue here, the system can suffer minutes or hour delays before a coherent picture is formed and represented.

AI altogether can provide better insight through larger datasets than classical analytics. It provides better capacity to correlate events and to predict the evolution of the network state. It can also propose optimization, enhancements, mitigation recommendations, but to be truly effective, it needs to be able to have feedback loop to the network functions, so that these recommendations can be turned into actions and automated.

Herein lies the trick. If you want to run AI in your network, so that you can automate it, allowing it to reactively or proactively auto scale, heal, optimize its performance, power consumption, cost, etc... at scale, it cannot be done manually. Automation is necessary throughout. Speed from event, anomaly, pattern, insight detection to action becomes key.

As we have seen, speed is the product of high performance, low latency in the production, extraction, storage, and processing of data to create actionable insights that can be automated. At the fabric layer, compute, connectivity and storage are the elements that need to be properly designed to enable the speed to run AI.

In this post, we will look at the compute function. Processing, analyzing, manipulating Data requires computing capabilities. There are different architectures of computing units for different purposes.

The CPU (Central Processing Units) are general purpose computing, suitable for serial tasks. Multiple CPU Cores can work in parallel to enhance performance. Suitable for most telecoms functions, except real time processing. Generic CPUs are used in most telco data centers and clouds for most telco functions, from OSS, BSS to Core and transport. At the edge and the RAN, CPUs are used for Centralized Unit functions.
ASICs (Application Specific Integrated Circuits) are CPUs that have been designed for specific tasks or applications. They are not as versatile as other processing units but deliver the absolute highest performance in smallest footprint for specific applications. They can be found in first generation Open RAN servers to run Distributed Unit functions, as well as in specialized packet routing and packet switching (more on that in the connectivity post).
FPGA (Field Programmable Gate Arrays) are CPUs that can be programmed to adapt to specific workloads without necessitating complete redesign. They provide a good balance between adaptability and performance and are suitable for cryptographic and rapid data processing. They are used in telco networks in security gateways, as well as advanced routing and packet processing functions.
GPUs (Graphics Processing Units) feature large numbers of smaller cores, coupled with high memory bandwidth making them suitable for graphics processing and large number of parallel matrix calculations. In telco network, GPUs are starting to be introduced for AI / ML workloads in data centers and clouds (neural networks and model training), as well as in the RAN for the Distributed Unit and RAN Intelligent Controller.
TPUs (Tensor Processing Units) are Google's specialized processing units optimized for Tensor processing of ML and deep learning model training and inference. They are not yet used in Telco environments but can be used on Google Cloud in a hybrid scenario.
NPUs (Neural Processing Units) are designed for Neural Networks for deep learning processing. They are very suitable for inference tasks as their power consumption and footprint are very small. They start to appear in telco networks at the edge, and in devices.

Artificial Intelligence, Machine Learning can run on any of the above computing platform. The difference is the performance, footprint, cost and power consumption profile. We have seen lately the emergence of GPUs as the new processing unit poised to replace CPUs, ASICs and FPGAs in specialized traffic functions, using the RAN and AI as its beachhead. GPUs are key in running AI workloads at scale , delivering the performance in terms of low latency and high throughput necessary for rapid time to insight.

Their cost and power consumption forces network operators to find the right balance between the number of GPUs and their placement throughout the network, to enable both high processing power necessary for model training, in the private cloud, together with low latency for rapid inferencing and automation at the edge. While this architecture might provide the best basis for an automated or autonomous network, its cost and the rapid rate of change in GPU generations might give most a pause.

The main challenge becomes the selection of compute architecture that can provide the most capacity, speed, while remaining cost effective to procure and run. For this reason, many telco operators have decided to centralize in a first step their GPU farms, to fine tune their use cases, with limited decentralized deployments. Another avenue for exploration is the wholesaling of the compute capacity to reduce internal costs. We have seen a few GPUaaS and AIaaS initiatives recently announced.

In any cases, most operators who have deployed high capacity AI pods with GPUs, find that the performance of the overall system requires further refinement and look at connectivity as the next step in their AI-Native network journey. That will be the theme of our next post.

Pages

Connect on Linkedin

Thursday, January 23, 2025

The AI-Native Telco Network IV: Compute

No comments: