{Core Analysis}: flexibility

Showing posts with label flexibility. Show all posts

Wednesday, November 2, 2016

TIPping point

For those of you familiar with this blog, you know that I have been advocating for more collaboration between content providers and network operators for a long time (here and here for instance).

In my new role at Telefonica, I support a number of teams of talented intra-preneurs, tasked with inventing Telefonica's next generation networks, to serve the evolving needs of our consumers, enterprises and things at a global level. Additionally, connecting the unconnected and fostering sustainable, valuable connectivity services is a key mandate for our organization.

Very quickly, much emphasis has been put in delivering specific valuable use cases, through a process of hypothesis validation through prototyping, testing and commercial trials in compressed time frames. I will tell you more about Telefonica's innovation process in a future blog.

What has been clear is that open source projects, and SDN have been a huge contributing factor to our teams' early successes. It is quite impossible to have weekly releases, innovation sprints and rapid prototyping without the flexibility afforded by software-defined networking. What has become increasingly important, as well, is the necessity, as projects grow and get transitioned to our live networks to prepare people and processes for this more organic and rapid development. There are certainly many methodologies and concepts to enhance teams and development's agility, but we have been looking for a hands-on approach that would be best suited to our environment as a networks operator.

As you might have seen, Telefonica has joined Facebook's Telecom Infra Project earlier this year and we have found this collaboration helpful. We are renewing our commitment and increase our areas of interest beyond the Media Friendly Group and the Open Cellular Project with the announcement of our involvement with the People and Processes group. Realizing that - beyond technology- agility, adaptability, predictability and accountability are necessary traits of our teams, we are committing ourselves to sustainably improve our methods in recruitment, training, development, operations and human capital.

We are joining other network operators that have started - or will start- this journey and looking forward to share with the community the results of our efforts, and the path we are taking to transform our capabilities and skills.

Tuesday, June 21, 2016

SDN / NFV: Enemy of the state

Extracted from my SDN and NFV in wireless workshop.

I want to talk today about an interesting subject I have seen popping up over the last six months or so and in many presentations in the stream I chaired at the NFV world congress a couple of months ago.

In NFV and to a certain extent in SDN as well, service availability is achieved through a combination of functions redundancy and fast failover routing whenever a failure is detected in the physical or virtual fabric. Availability is a generic term, though and covers different expectations whether you are a consumer, operator or enterprise. The telecom industry has heralded the mythical 99.999% or five nines availability as the target to reach for telecoms equipment vendors.

This goal has led to networks and appliances that are super redundant, at the silicon, server, rack and geographical levels, with complex routing, load balancing and clustering capabilities to guarantee that element failures do not impact catastrophically services. In today's cloud networks, one arrives to the conclusion that a single cloud, even tweaked can't performed beyond three nines availability and that you need a multi-cloud strategy to attain five nines of service availability...

Consumers, over the last ten years have proven increasingly ready to accept a service that might not be always of the best quality if the price point is low enough. We all remember the start of skype when we would complain of failed and dropped calls or voice distortions, but we all put up with it mostly because it was free-ish. As the service quality improved, new features and subscriptions schemes were added, allowing for new revenues as consumers adopted new services.
One could think from that example that maybe it is time to relax the five nines edict from telecoms networks but there are two data points that run counter to that assumption.

The first and most prominent reason to keep a high level of availability is actually a regulatory mandate. Network operators operate not only a commercial network but also a series of critical infrastructure for emergency and government services. It is easy to think that 95 or 99% availability is sufficient until you have to deliver 911 calls, where that percentage difference means loss of life.
The second reason is more innate to network operators themselves. Year after year, polls show that network operators believe that the way they outcompete each others and OTTs in the future is quality of service, where service availability is one of the first table stakes.

As I am writing this blog, SDN and NFV in wireless have struggled through demonstrating basic load balancing and static traffic routing, to functions virtualization and auto scaling over the last years. What is left to get commercial grade (and telco grade) offerings is resolving the orchestration bit (I'll write another post on the battles in this segment) and creating a service that is both scalable and portable.

The portable bit is important, as a large part of the value proposition is to be able to place functions and services closer to the user or the edge of the network. To do that, an orchestration system has to be able to detect what needs to be consumed where and to place and chain relevant functions there.
Many vendors can demonstrate that part. The difficulty arises when it becomes necessary to scale in or down a function or when there is a failure.

Physical and virtual functions failure are to be expected. When they arise in today's systems, there is a loss of service, at least for the users that were using these functions. In some case, the loss is transient and a new request / call will be routed to another element the second time around, in other cases, it is permanent and the session / service cannot continue until another one is started.

In the case of scaling in or down, most vendors today will starve the virtual function and route all new requests to other VMs until this function can be shut down without impact to live traffic. It is not the fastest or the most efficient way to manage traffic. You essentially lose all the elasticity benefits on the scale down if you have to manage these moribund zombie-VNFs until they are ready to die.

Vendors and operators who have been looking at these issues have come to a conclusion. Beyond the separation of control and data plane, it is necessary to separate further the state of each machine, function service and to centralize it in order to achieve consistent availability, true elasticity and manage disaster recovery scenarios.

In most cases, this is a complete redesign for vendors. Many of them have already struggled to port their product to software, then port it to hypervisor, then optimized for performance... separating state from the execution environment is not going to be just another port. It is going to require redesign and re architecting.

The cloud-native vendors who have designed their platform with microservices and modularity in mind have a better chance, but there is still a series of challenges to be addressed. Namely, collecting state information from every call in every function, centralizing it and then redistribute it is going to create a lot of signalling traffic. Some vendors are advocating some inline signalling capabilities to convey the state information in a tokenized fashion, others are looking at more sophisticated approaches, including state controllers that will collect, transfer and synchronize relevant controllers across clouds.
In any case, it looks like there is still quite a lot of work to be done in creating truly elastic and highly available virtualized, software defined network.

Monday, October 19, 2015

SDN world 2015: unikernels, compromises and orchestrated obsolescence

Last week's Layer123 SDN and OpenFlow World Congress brought its usual slew of announcements and claims.

From my perspective, I have retained a contrasted experience from the show.

On one hand, it is clear that SDN has now transitioned from proof of concept to commercial trial, if not full commercial deployment and operators are now increasingly understanding the limits of open source initiatives such as OpenStack for carrier-grade deployments. The telling sign is the increasing number of companies specialized in OpenFlow or other protocols high performance hardware based switches.

It feels that Open vSwitch has not hit its stride, notably in term of performance and operators are left with either going open source, cost efficient but not scalable nor performing or compromising with best of breed, hardware-based, hardened switches that offer high performance and scalability but not the agility of software-based implementation yet. What is new, however, is that operators seem ready to compromise for time to market, rather than wait for a possibly more open solution that could - or not - deliver on its promises.

On the NFV front, I feel that many vendors have been forced to lower their silly claims in term of performance, agility and elasticity. It is quite clear that many of them have been called to prove themselves in operators' labs and have failed to deliver. In many cases, vendors are able to demonstrate agility, through VM porting / positioning using either their VNFM or an orchestrator's integration, they are even, in some cases, able to show some level of elasticity with auto-scaling powered by their own EMS, and many have put out press releases with Gbps or Tbps or millions of simultaneous sessions of capacity...
... but few are able to demonstrate all three at the same time, since their performance achievement has, in many cases been relying on SR-IOV to bypass the hypervisor layer, which ties the VM to the CPU in a manner that makes agility and elasticity extremely difficult to achieve.
Operators, here again, seem bound to compromise between performance or agility if they want to accelerate their time to market.

Operators themselves came in troves to show their progress on the subject, but I felt a distinct change in tone in term of their capacity to effectively get vendors deliver on the promises of the NFV successive white papers. One issue lies flatly on the operators' attitude themselves. Many MNO are displaying unrealistic and naive expectations. They say that they are investing in NFV as a means to attain vendor independence but they are unwilling to perform any integration themselves. It is very unlikely that large Telecom Equipment Manufacturer will willingly help deconstruct their value proposition by offering commoditized, plug-and-play, open interfaced virtualized functions.

SDN and NFV integration is still dirty work. Nothing really performs at line rate without optimization, no agility, flexibility, scalability is really attained without fine tuned integration. Operators won't realize the benefits of the technology if they don't get in on the integration work themselves.

At last, what is still missing from my perspective is a service creation strategy that would make use of a virtualized network. Most network operators still mention service agility and time to market as a key driver, but when asked what they would launch if their network was fully virtualized and elastic today, they quote disappointing early examples such as virtual (!?) VPN, security or broadband on demand... timid translations of existing "services" in a virtualized world. I am not sure most of the MNOs realize their competition is not each other but Google, Netflix, Uber, Facebook and others...
By the time they launch free and unlimited voice, data and messaging services underpinned by advertising or sponsored model, it will be quite late to think of new services, even if the network is fully virtualized. It feels like MNOs are orchestrating their own obsolescence.

At last, the latest buzzwords you must have in your presentation this quarter are:
The pet and cattle analogy,
SD WAN,
5G

...and if you haven't yet formulated a strategy with respect to containers (Dockers, etc...) don't bother, they're dead and the next big thing are unikernels. This and more in my latest report and workshop on "SDN NFV in wireless networks 2015 / 2016".

Tuesday, August 26, 2014

SDN / NFV part V: flexibility or performance?

Early on in my investigations of how SDN and NFV are being implemented in mobile networks, I have found that performance remains one of the largest stumbling blocks the industry has to overcome if we want to transition to next generations networks.

Specifically, many vendors recognize behind closed doors that a virtualized environment today has many performance challenges. It explains probably why so many of the PoCs feature chipset vendors as a participant. A silicon vendor as a main proponent of virtualization is logical, as the industry seeks to transition from purpose-built proprietary hardware to open COTS platforms. It does not fully explain though the heavy involvement of the chipset vendors in these PoCs. Surely, if the technology is interoperable and open, chipset vendor integration would not be necessary?

Linux limitations

Linux as an operating system has been developed originally for single core systems. As multi-core and multithreaded architectures made their appearance, the operating system has shown great limitations in managing particularly demanding data planes applications. When one looks at virtualized network function, one has to contend with the host OS and the guest OS.
In both cases, a major loss of performance is observed at enter and exits of a VM and the OS. These software interrupt are necessary to pull the packets from the data plane to the application layer so that they can be processed. The cost of software interrupt for kernel Linux access ends up being prohibitive and create bottlenecks and race conditions as the traffic increases and more thread are involved. Specifically, every time the application needs to access the Linux kernel, it must pause the VM, save its context and stall the application to access the kernel. For instance, a base station can have over 100k software interrupt per second.

Intel DPDK and SR-IOV

Intel Data Plane Development Kit (DPDK) developed by 6Wind is used for I/O and packet forwarding functions. A “fast path” is created between the VM and the virtual network interface card (NIC) that improves the data path processing performance. This implementation allows to effectively bypass the guest hypervisor and to provide fast processing of packets between the VM and the host.

At the host level, Single Root I/O Virtualization (SR-IOV) is also used in conjunction with DPDK to provide NIC-to-VM connectivity, bypassing the Linux host and improving packet forwarding performance. The trade off there is that each VM on SR-IOV requires to be tied to a physical network card.

Performance or Flexibility?

The implementation of DPDK and SR-IOV has a cost. While performance for VNFs implementing both techniques show results close to physical appliance, the trade off is flexibility. In implementing these mechanisms, the VMs are effectively bound to the physical hardware resource they depend on. A perfect configuration and complete identical replication of every element, at the software and physical level is necessary for migration and scaling out. While Intel is working on a virtual DPDK integrated in the hypervisor, implementation of SDN / NFV in wireless networks for data plane hungry network functions will force vendors and networks in the short to medium term to choose between performance or flexibility.

Pages

Connect on Linkedin