The AI-Native Telco Network I
The AI-Native Telco Network II
The AI-Native Telco Network III
The AI-Native Telco Network IV: Compute
The AI-Native Telco Network V: Network
As it turns out, a network that needs to run AI, either to self optimize or to offer wholesale AI related services needs some adjustments from a conventional telecom network. After looking at the compute and network functions, this post is looking at storage.
Storage has, for the longest time, been an afterthought in telecoms networks. Beyond the IT workloads and the management of data centers, storage needs were usually addressed embedded with the compute functions, sold by server vendors, or when necessary as direct attached storage appliances, usually OEMd or resold by the same vendors.
Today's networks see each network function, whether physical, virtualized or containerized coming with its own dedicated storage. The data generated by each function, whether telemetry, alarm, user, or control plane, logs or event is stored first locally, then a portion is exported to a data lake for cleaning and processing, then eventually a data warehouse, whether on a private or public cloud so that OSS, BSS and analytics functions can provide dashboards on the health, load, usage of the network and recommendations on optimizations.
The extraction, cleaning, and processing of these disparate datasets takes time, anywhere between 30 minutes to hours to accurately represent the network state.
One of the applications of AI/ML in telecoms networks is to optimize the networks reactively when there is an event or proactively when we can plan for a given change. This supposes that a feedback loop is built between the analytics layer and the operational layer, whereas a recommendation to change network parameters can be executed programmatically and automatically.
Speed becomes necessary, particularly to react to unpredicted events. Reducing reaction time if there is an element outage is crucial. This supposes that the state of the network must be observable in near real time, so that the AI/ML engines can detect patterns, anomalies and provide root cause analysis and remediation as fast as possible. The compute applied to these calculations, together with the speed of transmission have a direct effect on the speed, but not only.
Storage, as it turns out is also a crucial element of creating an AI-Native network. The large majority of AI/ML relies on storing data as object, whereas each data element is stored independently, in an unstructured manner, irrespective of size, but with an associated metadata file that describes the data element in details, allowing easy association and manipulation for AI/ML.
Why are traditional storage architectures not suitable for AI-Native Networks?
To facilitate the AI Native network, data element must be extracted from their network functions fast and transferred in a data repository that allows their manipulation at scale. It is easier said than done. Legacy systems have been built originally for block storage (databases and virtual machines, great for low latency, bad for high throughput). Objects are usually not natively supported and are in separate storage. Each vendor supports different protocols and interface, and each store is single tenant to its application.
The data sets are increasingly varied,
between large and small objects, data streams and files, random and sequential
read and write requirements. Legacy storage solutions require different systems
for different use cases and data sets. This lengthens further the data
amalgamation necessary for automation at scale.
Data needs to be properly labeled, without
limitation of metadata, annotation and tags equally for billions of small
objects (event records) or very large ones (video files). Traditional storage
solutions are designed either for small or large objects and struggle to
accommodate both in the same architecture. They also have limitations in the
amount of metadata per object. This increases cost and time to insight while
reducing their capacity to evolve.
Datasets are live structures. They often
exist in different formats and versions for different users. Traditional
architectures are not able to handle multiple formats simultaneously, and
versions of the same datasets require separate storage elements. This leads to
data inconsistencies, corruption and divergence of insight.
Performance is key in AI systems, and it is
multidimensional. Storage solutions need to be able to accommodate
simultaneously high throughput, scale out capacity and low latency. Traditional
storage systems are built for capacity but not designed for high throughput and
low latency, which reduces dramatically the performance of data pipelines.
Hybrid and multi cloud become a key
requirement for AI, as data needs to be exposed to access, transport, core,
OSS/ BSS domains in the edge, the private cloud and the public cloud
simultaneously. Traditional storage solutions necessitate adaptation, translation,
duplication, and migration to be able to function across cloud boundaries,
which significantly increase their cost, while reducing their performance and
capabilities.
As we have seen, the data storage
architecture for a telecom network becomes a strategic infrastructure decision
and the traditional storage solutions cannot accommodate AI and network
automation at scale.
Storage Requirements for AI-Native Networks
Perhaps the most important attribute for AI
project storage is agility—the ability to grow from a few hundred gigabytes to
petabytes, to perform well with rapidly changing mixed workloads, to serve data
to training and production clients simultaneously throughout a project’s life,
and to support the data models used by project tools.
The attributes of an ideal AI storage
solution are:
Performance Agility
•
I/O
performance that scales with capacity.
•
Rapid
manipulation of billions of items, e.g., for randomization during training.
Capacity Flexibility
•
Wide range
(100s of gigabytes to petabytes) .
•
High
performance with billions of data items.
•
Range of
cost points optimized for both active and seldom accessed data.
Availability & Data Durability
•
Continuous
operation over decade-long project lifetimes.
•
Protection
of data against loss due to hardware, software, and operational faults.
•
Non-disruptive
hardware and software upgrade and replacement.
•
Seamless
data sharing by development, training, and production.
Space and Power Efficiency
• Low space and power requirements that free data center resources for power-hungry computation.
Security
•
Strong
administrative authentication.
•
“Data at
rest” encryption.
•
Protection
against malware (especially ransomware) attacks.
Operational Simplicity
•
Non-disruptive
modernization for continuous long-term productivity.
•
Support for
AI projects’ most-used interconnects and protocols.
•
Autonomous configuration (e.g. device groups, data placement,
protection, etc.).
•
Self-tuning
to adjust to rapidly changing mixed random/ sequential I/O loads.
Hybrid and Multi Cloud Natively
•
Data
agility to cross cloud boundaries
•
Centralized
data lifecycle management
• Decide which data set is stored and processed where
• From edge for inference to private cloud for optimization and automation to public cloud for model training and replication.
Traditional "spinning disk" based storage have not been designed for AI/ML workloads. They lack the performance, agility, cost effectiveness, latency, power consumptions attributes necessary to enable AI networks at scale. Modern storage infrastructure, designed for high performance computing rely on Flash storage, an efficient, cost effective, low power, high performance technology that enables compute and network elements to perform at line rate for AI workloads.