{Core Analysis}: availability

Tuesday, June 21, 2016

SDN / NFV: Enemy of the state

Extracted from my SDN and NFV in wireless workshop.

I want to talk today about an interesting subject I have seen popping up over the last six months or so and in many presentations in the stream I chaired at the NFV world congress a couple of months ago.

In NFV and to a certain extent in SDN as well, service availability is achieved through a combination of functions redundancy and fast failover routing whenever a failure is detected in the physical or virtual fabric. Availability is a generic term, though and covers different expectations whether you are a consumer, operator or enterprise. The telecom industry has heralded the mythical 99.999% or five nines availability as the target to reach for telecoms equipment vendors.

This goal has led to networks and appliances that are super redundant, at the silicon, server, rack and geographical levels, with complex routing, load balancing and clustering capabilities to guarantee that element failures do not impact catastrophically services. In today's cloud networks, one arrives to the conclusion that a single cloud, even tweaked can't performed beyond three nines availability and that you need a multi-cloud strategy to attain five nines of service availability...

Consumers, over the last ten years have proven increasingly ready to accept a service that might not be always of the best quality if the price point is low enough. We all remember the start of skype when we would complain of failed and dropped calls or voice distortions, but we all put up with it mostly because it was free-ish. As the service quality improved, new features and subscriptions schemes were added, allowing for new revenues as consumers adopted new services.
One could think from that example that maybe it is time to relax the five nines edict from telecoms networks but there are two data points that run counter to that assumption.

The first and most prominent reason to keep a high level of availability is actually a regulatory mandate. Network operators operate not only a commercial network but also a series of critical infrastructure for emergency and government services. It is easy to think that 95 or 99% availability is sufficient until you have to deliver 911 calls, where that percentage difference means loss of life.
The second reason is more innate to network operators themselves. Year after year, polls show that network operators believe that the way they outcompete each others and OTTs in the future is quality of service, where service availability is one of the first table stakes.

As I am writing this blog, SDN and NFV in wireless have struggled through demonstrating basic load balancing and static traffic routing, to functions virtualization and auto scaling over the last years. What is left to get commercial grade (and telco grade) offerings is resolving the orchestration bit (I'll write another post on the battles in this segment) and creating a service that is both scalable and portable.

The portable bit is important, as a large part of the value proposition is to be able to place functions and services closer to the user or the edge of the network. To do that, an orchestration system has to be able to detect what needs to be consumed where and to place and chain relevant functions there.
Many vendors can demonstrate that part. The difficulty arises when it becomes necessary to scale in or down a function or when there is a failure.

Physical and virtual functions failure are to be expected. When they arise in today's systems, there is a loss of service, at least for the users that were using these functions. In some case, the loss is transient and a new request / call will be routed to another element the second time around, in other cases, it is permanent and the session / service cannot continue until another one is started.

In the case of scaling in or down, most vendors today will starve the virtual function and route all new requests to other VMs until this function can be shut down without impact to live traffic. It is not the fastest or the most efficient way to manage traffic. You essentially lose all the elasticity benefits on the scale down if you have to manage these moribund zombie-VNFs until they are ready to die.

Vendors and operators who have been looking at these issues have come to a conclusion. Beyond the separation of control and data plane, it is necessary to separate further the state of each machine, function service and to centralize it in order to achieve consistent availability, true elasticity and manage disaster recovery scenarios.

In most cases, this is a complete redesign for vendors. Many of them have already struggled to port their product to software, then port it to hypervisor, then optimized for performance... separating state from the execution environment is not going to be just another port. It is going to require redesign and re architecting.

The cloud-native vendors who have designed their platform with microservices and modularity in mind have a better chance, but there is still a series of challenges to be addressed. Namely, collecting state information from every call in every function, centralizing it and then redistribute it is going to create a lot of signalling traffic. Some vendors are advocating some inline signalling capabilities to convey the state information in a tokenized fashion, others are looking at more sophisticated approaches, including state controllers that will collect, transfer and synchronize relevant controllers across clouds.
In any case, it looks like there is still quite a lot of work to be done in creating truly elastic and highly available virtualized, software defined network.

Tuesday, May 5, 2015

NFV world congress: thoughts on OPNFV and MANO

I am this week in sunny San Jose, California at the NFV World Congress where I will chair Thursday the stream on Policy and orchestration - NFV management.
My latest views on SDN / NFV implementation in wireless networks are published here.

The show started today with a mini-summit on OPNFV, looking at the organization's mission, roadmap and contribution to date.

The workshop was well-attended, with over 250 seats occupied and a good number of people standing in the back. On the purpose of OPNFV, it feels that the organization is still trying to find its mark a little bit, hesitating between being a transmission belt between ETSI NFV and open source implementation projects and graduating to a prescriptive set of blueprints for NFV implementations in wireless networks.

If you have trouble following, you are not the only one. I am quite confused myself. I thought OpenStack had a mandate to create source code for managing cloud network infrastructure and that NFV was looking at managing service in a virtualized fashion, which could sit on premises, clouds and hybrid environments. While NFV does not produce code, why do we need OPNFV for that?

Admittedly, the organization is not necessarily deterministic in its roadmap, but rather works on what its members feel is needed. As a result, it has decided that its first release, code-named ARNO will be supporting KVM as hypervisor environment and will feature an OpenStack architecture underpinned by an OpenDaylight-based SDN controller. ARNO should be released "this spring" and is limited in its scope as a first attempt to provide an example of a carrier-grade ETSI NFV-based source code for managing a SDN infrastructure. Right now, ARNO is focused on VIM (Virtual Infrastructure Management), and since the full MANO is not yet standardized and it is felt it is too big a chunk to look at for a first release, it will be part of a later requirement phase. The organization is advocating pushing requirements and bug resolution upstream (read to other open source communities) to make the whole SDN / NFV more "carrier-grade".

This is where, in my mind the reasoning breaks down. There is a contradiction in terms and intent here. On one hand, OPNFV advocates that there should not be separate branches within implementation projects such as OpenStack for instance for carrier specific requirements. Carrier-grade being the generic analogy to describe high availability, scalability and high performance. The rationale is that it could be beneficial to the whole OpenStack ecosystem. On the other hand, OPNFV seems to have been created to implement and test primarily NFV-based code for carrier environment. Why do we need OPNFV at all if we can push these requirements within OpenStack and ETSI NFV? The organization feels more like an attempt to supplement or even replace ETSI NFV by an opensource collaborative project that would be out of ETSI's hands.

More importantly, if you have been to OpenStack meeting, you know that you are probably twice as likely to meet people from the banking, insurance, media, automotive industry as from the telecommunications space. I have no doubt that theoretically, everyone would like more availability, scalability, performance, but practically, the specific needs of each enterprise segment rarely means they are willing to pay for over-engineered networks. Telco carrier-grade was born from regulatory pressure to provide a public infrastructure service, many enterprises wouldn't know what to do with the complications and constraints arising from these.

As a result, I personally have doubts for the success of the Telcos and forums such as OPNFV to influence larger groups such as OpenStack to deliver a "carrier-grade" architecture and implementation. I think that Telco operators and vendors are a little confused by open source. They essentially treat it as a standard, submitting change requests, requirements, gap analysis while not enough is done (by the operators community at least) to actually get their hands dirty and code. The examples of AT&T, Telefonica, Telecom Italia and some others are not in my mind reflective of the industry at large.

If ETSI were more effective, service orchestration in MANO would be the first agenda item, and plumbing such as VIM would be delegated to more advanced groups such as OpenStack. If a network has to become truly elastic, programmable, self reliant and agile, in a multi vendor environment, then MANO is the brain and it has to be defined and implemented by the operators themselves. Otherwise, we will see Huawei, Nokialcatelucent, Ericsson, HP and others become effectively the app store of the networks (last I checked, it did not work very well for operators when Apple and Android took control of that value chain...). Vendors have no real incentive to make orchestration open and to fulfill the vendor agnostic vision of NFV.

Pages

Connect on Linkedin

Tuesday, June 21, 2016

SDN / NFV: Enemy of the state

Tuesday, May 5, 2015

NFV world congress: thoughts on OPNFV and MANO