Interview : Daniel Jacobson on Ephemeral APIs and Continuous Innovation at Netflix

November 17, 2015

Following his talk at the recent “I Love APIs” conference, InfoQ had the opportunity to interview Daniel Jacobson about ephemeral APIs, their relationship to experience-based APIs and when to consider them in your organization.

Daniel leads development of critical systems that are the front door of Netflix, servicing 1,000+ different device types and billions of requests per day. He also manages the Netflix playback experience which accounts for approximately one-third of Internet downstream traffic in North America during peak hours.

InfoQ: What is your current role at Netflix and your day-to-day responsibilities?

Daniel Jacobson: I run the edge engineering team which is responsible for handling all traffic for all devices around the world for signup, discovery and playback. On the playback side we are responsible for the functionality that supports the playback experience. The API side is responsible for handling the traffic directly from devices, fetching data from a broad set of mid-tier data services and then we broker the data back. Both teams are critical to success of Netflix because nobody can stream if playback is not available and nobody can stream if the API is not available.

InfoQ: Can you explain what Ephemeral APIs are all about and how different they are from the Experience APIs that you have proposed before?

DJ: Experience APIs are trying to handle an optimized response for a given requesting agent. That’s orthogonal to the ephemeral APIs. The experience API is more about the requesting pattern and the payload. Ephemeral API is more about the process of iterating and evolving the experience APIs.

Traditionally, APIs get set up to make it easier for the API provider to support, which results in one-size-fits-all APIs. The problem with that approach is that the API ends up being harder to use for a wide array of consumers. In other words, the optimization in that model is to make things easier for the few but harder for the many. For experience APIs, the goal is to focus on the needs of the individual requesters and optimize the APIs for each of them. It means that you are essentially running a wide array of different APIs. This results in a more challenging environment for the API provider to support because the variability is higher, but it allows the API consumers to develop what is best for them and for the performance of their clients. Ultimately, this should translate into a better customer experience.

Ephemerality is part of our story in how we develop our APIs, but not essential for the experience API model. Ephemeral APIs mean that the endpoints and payloads should be able to be terminated and created with ease and flexibility with the expectation that this can happen at any moment and potentially very frequently. If we can support ephemerality, then we can innovate faster and continuously to support the product needs without being a bottleneck.

To give an example, if we are running an A/B test to evaluate a new feature in our SmartTV experience, the UI team working on that feature can iterate on the client code and the APIs without the API team’s involvement. As they develop the test, they may realize that the data needs change or can be optimized, which would result in them killing the endpoints and create new ones. This can happen dozens of times over the course of the project and without the API team getting involved (as long as all of the data elements already exist in the pipeline).

InfoQ: What is the best way to find the right granularity for experience-based APIs? Is it mostly based on the device capabilities or on team organization?

DJ: I’ve written a detailed blog post on this topic in the past, which includes the recipe for when experience-based APIs might be a good choice. Basically, it is likely many companies don’t need to go this route because it’s a scale question.

So, if you have a wide array of different interaction models that are diverging and a close relationship with those who are consuming the APIs, those are good indicators that you might want to optimize for this. The proximity to the consumer of the API is key because you have a tighter feedback loop and more understanding of what their individual needs are.

The difference with generic resource-based APIs is that you don’t know who is going to consume the APIs and how they will be consumed. If the consumers are in your organization, and if you understand those nuances, you can create an architecture that is optimized for them all.

Within Netflix, we have created the architecture as a set of Java APIs and all these different device teams can build their own experience-based web APIs that are optimized for their clients. We like to call our system a platform for API development, more than an traditional API.

InfoQ: Do you have a separate API for Netflix mobile app on Android and on iOS?

DJ: In the construct of the platform, we have base Java APIs that are method calls within a JVM. Then, we have an adapter layer that sits on top of that where web APIs can be developed in a device-specific way. So, we have mobile teams developing their corresponding adapters, those are different endpoints, request patterns, payloads and maybe different protocols.

There used to be more overlap between iOS and Android, but now these experiences are indeed different. There are shared functions across all of this so we built a set of tools to allow for the shareability.

InfoQ: Do you rely on an API language to describe Netflix APIs?

DJ: Not at this point. This is something we discuss periodically, but have not pursued yet because of the challenges and costs in maintaining them. Most of the time, if you have language descriptors it means that you are trying to fix things in place, make them consistent for the API consumers. Because our web APIs are ephemeral, the descriptor would also need to be ephemeral, so using one would cost more and not be as helpful.

But another thing is you have many teams building these web APIs with different needs and those teams are iterating on their consumption of the web APIs. This iteration is happening continuously because we are always running A/B tests that require changes to the data being delivered. As the teams iterate, the same person or group is writing and consuming the web API and they are doing the development of both at the same time, which means they already know the nature of the interface, so there is no value.

Most of the discussion for description languages have been at the Java API level, but again, those APIs are changing frequently as well. If we can find a way to describe those APIs consistently at very low cost, we would like to add that to the system, but so far it seems as though the costs of maintenance exceed the benefit.

InfoQ: Do you rely on API tooling to accelerate the development of APIs by device teams?

DJ: We develop a suite of tools to allow people to manage, deploy, and view the health of their API scripts, and to determine which endpoints are active and not. We also have tools to support shareability of code around these scripts and we have tools to inspect the payloads. Also, there are tools that we still need to develop. For example, the difficulty in this world is debuggability and we need to improve in this area.

InfoQ: How does your move to Universal JavaScript for your main web site fit into the experience-based API platform?

DJ: The architecture and API for the web site team is different than most devices because they have a separate tier fronting their API calls. For typical devices, they call directly into the web API but for the web site, they call into their own cluster where they handle the traffic directly and then call into our API cluster to get the data. What’s happening in their cluster and above it is currently outside our view but they are still writing scripts in our adapter layer.

What’s interesting is that we are investigating now if we should apply similar constructs across the breadth of devices or some subsets, and evaluating the cost of doing this more broadly. Some things that we might gain in this approach would be process isolation and an easier path towards debuggability.

InfoQ: What is the place of Groovy and other scripting languages in the Netflix API platform?

DJ: Groovy is the only language in our API environment that people are writing adapter scripts with, but we are looking at other languages. The next one is likely going to be Node.js. Going to another JVM language would be easier, but there hasn’t been enough interest so far. If device teams want to use Scala or other languages, we would need to do more investigation and work to make it happen.

Node.js is not going to run integrated in the JVM so it’s an additional benefit of isolating that into another layer like we’ve done for the main web site.

InfoQ: How were the device teams able to adapt to such changes in their development flows?

DJ: The cultural change to the company was a lot harder than the technology changes. Even with teams willing to go to this route, there were some challenges in getting people to think and operate differently in the new environment. For example, it took some time for them to adapt to writing Groovy and to the functional programming paradigm. But looking back it is definitely a net win.

InfoQ: In your talk, you mentioned an ongoing project to introduce containers at the API adapter layer. Will that effort have impact on the Nicobar open source project?

DJ: As we are investigating containers for the web site layer, we are thinking about how it could be applied to other devices as well. For the container-based model, Nicobar would not be a central player for us. In fact, when we designed Nicobar and the scriptability, it was in part to deploy the scripts in an isolated way. Containers take our original intent to the next level and obviates away the need for Nicobar. That said, our system will continue to support the scripting and Nicobar for years to come, so we expect to continue to develop and evolve Nicobar for a while. As Nicobar evolves, it is likely that such changes will be made in the open source project as well.

InfoQ: The Netflix Falcor open source project was announced in August and its usage on Android recently explained. What does it offer and how does it relate to your broader API platform?

DJ: It helps us represent remote data sources as a single domain model through a virtual JSON graph. You code the same way no matter where the data is, whether in memory on the client or over the network on the server. Falcor also handles network communications between devices and servers, and can batch and deduplicate requests to make them more efficient.

Because Falcor is a more efficient data fetching mechanism between devices and servers, it’s going to continue to play a significant role in our platform even as our system evolves into a different architecture.

The main benefits we get out of Falcor are developer efficiency and improved application performance. We get the developer efficiency because the access patterns for the engineers writing the adapters is more consistent. That said, there is a steeper learning curve to use Falcor and it is a more challenging environment to debug.

InfoQ: What are the limitations that you found with AWS Auto Scaling Groups and how does Netflix Scryer help? Will it become open source?

DJ: AWS autoscaling is used widely at Netflix. It’s very useful and powerful. Amazon is responding to metrics like load average, determining that it’s time to add new servers when those metrics pass a certain threshold. Meanwhile, it can take 10 to 20 minutes to bring a new set of servers online. A lot of bad things can happen in a manner of minutes, so that adds risk to our availability profile.

That’s what prompted us to develop Scryer. What Scryer does is it looks at the historical data and incorporates a feedback loop of real-time data, evaluates what the needs will be in the near future for capacity, and then it adds servers in advance of that need. What we see is that response times and latencies are much more leveled with Scryer because load averages are not spiking and because the cluster can handle the traffic more effectively.

While we announced it via a blog post a couple of years ago, there is no plan right now to open source it.

InfoQ: Netflix Engineering is well known for its Chaos Monkey service. Can you tell more about other services that are part of your Simian Army?

DJ:There is a suite of monkeys that do different things. Here are some of these services:

  • Latency Monkey has various degrees of utility and was designed to inject errors and latencies into a service to see how the failure would cascade. That has since evolved into FIT (Failure Injection Testing).
  • Chaos Gorilla is similar to Chaos Monkey but instead of killing individual instances, it is killing AWS availability zones. The idea here is to test high availability across zones by redirecting traffic from a failed zone to a healthy one.
  • Conformity Monkey and Security Monkey make sure that builds conform to certain operational and security guidelines and shuts down those that are not confirming.
  • Janitor Monkey which will cleanup unhealthy or dead instances.
  • Chaos Kong is a recent addition to the army, which simulates and outage in an entire AWS region and pushes traffic to a different region.

InfoQ: Over the years, Netflix has launched many open source projects. What is the best way to know what is available and actively maintained, to take advantage of these contributions?

DJ: As our OSS strategy has evolved, we’ve released around 60 projects in total across a diverse set of categories including UI, cloud and tools. Some of them are more actively managed than others and we try to partition them in our developer website. Supporting the APIs directly, there are a range of tools including ZuulNicobarHistrix and RxJava.

InfoQ: Should a company new to APIs start with a one-size-fits-all API approach and progressively evolve like Netflix did, or start immediately with finer-grained ephemeral experience APIs?

DJ: If you are brand new to APIs, start with OSFA (one size fits all). There is a question of whether you will ever get to the scale needs that Netflix has. Experience APIs are more of a challenge. I believe that ephemerality should be part of the mindset of each company, regardless.

Going the experience based API route is a function of opportunity and cost. You are adding more overall cost, but the efficiency and the optimization gains might be worth it. If you only have a few devices or very small development team or if you have a wide range of external parties that consume APIs, the cost of operating this more variable environment would likely not be recovered.

You really need to have a tipping point where the development efficiency of the API consumers is hindered by the fact that they are fighting against the rigid API. In other words, if you have different device teams, that have to make inefficient API calls that are different from each other and they have to compensate by doing additional parsing, error handling, etc. then the cost of all of that added energy can potentially be obfuscated by creating an optimized interaction model. This benefit is only worth it if you have enough developers doing these inefficient activities.

InfoQ: In addition to developer efficiency, are there other benefits that you might be looking for with Experience APIs?

DJ: With an optimized set of APIs, you are building a solution to provide a better experience for the customer, such as improved system performance and improved velocity in getting changes into the product.

If you want to have this kind of ephemerality and optimization, you can’t set it up for public APIs. The experience APIs are excellent tactics but are geared towards private APIs because having a close relationship with a small set of developers allows you to have much more latitude in solving the needs of the API consumers.

InfoQ: What excites you the most right now about the API space?

DJ: We are most excited about things like containers, streaming data, HTTP 2.0, websocket and persistence connections, tooling and analytics behind supporting a massive scale API. So we are investigating in those kind of things and experimenting.

Other things are emerging in this space like microservices, continuous integration, continuous deployment, and we are already doing them. At Netflix, we have a distributed architecture with specific functions for each microservice. But successful microservices inevitably grow in scope, potentially causing it to become more of a monolith over time. At that point, it makes sense to start breaking things down again.

InfoQ: Finally, how does continuous deployment relate to ephemeral APIs?

DJ:I often describe my team as being the skinny part of the hourglass that’s pushing data back and forth between the two fat parts. In one of the fat parts is all of the API consumers, the UI and devices teams. On the other fat part we have all the distributed server-side microservices. Both of the fat parts are constantly changing (A/B testing, new features, new devices, etc.).

As those change, we need to ensure that data is flowing through the skinny part to support the product and any test that is being performed on the product. We need to change at a faster rate than the rest of the company because we need to handle the changes that many other teams make.

Several years ago we decided the only way to do this was to develop a fully automated deployment pipeline. From a continuous deployment perspective, it was important for us to be able to deploy rapidly, frequently, at low risk and with the high ability to quickly rollback. The goal behind all of that is that we should not be the bottleneck to getting product change to the customer.

Like other things my team does, continuous deployment is a means to an end. And the end is continuous innovation. Having an environment that can rapidly and constantly change to the need of the business and the customer ties back to our ephemerality mindset.

Presentation : Maintaining the Netflix Front Door – Intuit Meetup

May 23, 2014

This presentation was for a meetup at Intuit on May 23, 2024

This presentation goes into detail on the key principles behind the Netflix API, including design, resiliency, scaling, and deployment. Among other things, I discuss our migration from our REST API to what we call our Experienced-Based API design. It also shares several of our open source efforts such as Zuul, Scryer, Hystrix, RxJava and the Simian Army.

Presentation : Maintaining the Front Door to Netflix – To Zendesk Engineering Team

May 9, 2014

This presentation was given to the engineering organization at Zendesk on May 9, 2014. In this presentation, I talk about the challenges that the Netflix API faces in supporting the 1000+ different device types, millions of users, and billions of transactions. The topics range from resiliency, scale, API design, failure injection, continuous delivery, and more.

Presentation : Netflix API – Separation of Concerns

April 8, 2014

This presentation was originally given at the following API Meetup in SF on April 8, 2014.

Most API providers focus on solving all three of the key challenges for APIs: data gathering, data formatting and data delivery. All three of these functions are critical for the success of an API, however, not all should be solved by the API provider. Rather, the API consumers have a strong, vested interest in the formatting and delivery. As a result, API design should be addressed based on the true separation of concerns between the needs of the API provider and the various API consumers.

This presentation goes into the separation of concerns. It also goes into depth in how Netflix has solved for this problem through a very different approach to API design.

The Next Web : Engineering spirals: 10 philosophies to facilitate innovation

March 25, 2014

This article was first published on The Next Web on March 25, 2014

Engineering spirals: 10 philosophies to facilitate innovation

Daniel Jacobson (LinkedIn) is the VP of Edge Engineering for the Netflix API. Prior to Netflix, Daniel ran application development for NPR where, among other things, he created the NPR API. He is also the co-author of APIs: A Strategy Guide.

“Get busy living, or get busy dying” – Shawshank Redemption

Building great engineering teams is difficult, but it is also increasingly important as the world in which we live is more than ever driven by software. Because of this growing importance, it is essential for engineering leaders to maintain a culture of innovation within their teams to ensure high performance and to keep the company ahead of the curve.

In high performance cultures like at Netflix, there are basically two outcomes that will play out over time for engineering teams. Either the team will enjoy an upward spiral established by a strong culture of innovation or it will spiral in the downward direction, resulting in an inevitable decay of the team and its products.

Here are my experiences as an engineering leader and how I’ve worked to build a culture around innovation for my teams, virtually at all costs.

The downward spiral

For most engineering teams, it is easy to enter a steady state of development and maintenance as systems get off the ground and mature.

Accordingly, managers often slow or halt hiring as the amount of work is relatively well-understood. As a result, the engineers on the team enter a daily or weekly (or perhaps monthly) ritual of incremental improvements, responding to requests, and fixing bugs.

As engineers churn through task lists, however, they become bored, uninspired, and complacent, resulting in degradation in velocity and/or quality. That degradation will result in more churn around testing and/or support issues, which will further frustrate and bore the engineers while generating more potential for system failures that will increase the churn.

The more churn, the more turnover in staff; the more turnover in staff, the more additional churn. This downward spiral can play out very quickly or it can take quite a while.

In either case, there is a clear direction, it is inevitable, and it has a bad ending.

Upward spiral

The way out of the downward spiral is to make some very difficult decisions that have short-term ramifications for the benefit of the long term. I call this “taking your lumps.”

If you take your lumps now by deferring non-essential work, it frees the team up to think about the long-term and to seek patterns in their work, systems, and operations. Through these patterns, the team can potentially program away a class of work that otherwise would occupy the team’s time on an ongoing basis.

Eliminating a class of work enables the team have more available time in the future to seek other such patterns or opportunities, which will create even more available time.

With the available time, not only is the team further alleviated from the daily churn of reacting to external needs, they are also able to pursue higher order projects that allow the team to make transformative leaps forward rather than churning to keep up or making minor incremental improvements.

Collaborative team

Repeated enough, this will eventually become part of the team’s culture, resulting in higher quality work and greater velocity. Unlike the downward spiral, there will positivity around the team that will be infectious and will create a breeding ground for attracting new talent.

Virtually every engineering team will find itself in one of the two aforementioned trajectories. It might not be obvious which way things are headed, but there will be a trend one way or the other.

It is the job of the engineering leader to ensure that the spiral is upward. Here are my 10 philosophies and approaches that I employ with my teams to strive for the upward trajectory:

1. Establish a strong identity

Be very clear on the identity of the team and establish a set of philosophies against which the team can operate. Be stubborn about adhering to the identity. The more that identity gets compromised by one-off requests, the more the architecture weakens, the more churn the team will have to deal with, and the more likely morale will suffer.

Be clear on what you will and won’t do and make sure the team knows these boundaries, lives them, and communicates them to others.

2. Important vs. Urgent

In “The 7 Habits of Highly Effective People,” Stephen R. Covey talks about the difference between urgent and important. Engineering organizations can very easily fall into the trap of being highly reactionary to externally imposed requests.

While many of these externally imposed requests are very important (and in fact, even if they are not), they tend to team’s attention as both urgent and important. But there are many other tasks or efforts that are very important despite the fact that they are internally driven and elective.

Understanding this distinction and being able to distinguish which tasks fall into which category is paramount in getting out of the churn and enabling that first critical step: introspection.

3. Introspection

Introspection is the key to innovation. Handling requests from a range of external (or even internal) stakeholders is the natural, easy thing for a team to do. Taking a step back from those requests and looking for patterns across them while imagining what they might look like in the future will give a broader and more impactful perspective.

If the system gets refactored in some other way, will that eliminate a class of requests in the future?  Given how the industry is evolving, can you anticipate weaknesses in the system’s architecture that should be examined now? These are examples of important questions that can help springboard your team out of their everyday churn of satisfying urgent requests.

4. Don’t throw good money at bad

During the introspection process, it is important to be future-oriented. Your team has a lot of functioning code and other system-oriented assets which should be considered.

That said, they should only be considered after evaluating the long-term needs of the team and its relationship to its constituents. Imagine starting from scratch and target that as your outcome. From there, it is much easier to see how, if at all, existing assets can play a role in that future state (or in the transition to get there).

5. Hire beyond your needs

job interview

The most important resource to enable introspection is time. Many companies and hiring managers work towards “right-sizing” their teams. That is, they project what the incoming requests will be for the team and attempt to staff the team based on those expectations.

This is perhaps the biggest flaw that a team manager can make when building and operating an innovation team because that will ultimately limit the amount of available time for introspection.

Instead, hiring managers should staff beyond the bandwidth needed for known tasks. This will give the team the ability to swell and contract its focus on such work while continually maintaining a reasonable amount of time towards introspection and innovation.

6. Great engineers NEED to be challenged

If staffing is such that your great engineers are spending the majority of their time handling very tactical work, they will slowly but surely lose interest in the job and eventually leave.

Of course, doing that kind of work is a necessary part of every engineering job, but there needs to be a balance for great engineers to remain happy and excited about their work. Engineers need to also have deep architectural challenges that allow them to think, to stretch their minds, and to have a greater value to the company than just keeping the lights on.

In fact, most of them want to have the freedom to identify and pursue these challenges in a way that help them feel empowered and impactful. That is why engineers get into this field in the first place and if that is not available in their current job for too long, they will find those opportunities elsewhere.

7. Instill a culture of (good) laziness

There are two kinds of “lazy” in engineering: bad laziness and good laziness. Bad laziness is allowing yourself to repeat the same tasks over and over because that is easier than stepping back, looking for patterns, and spending the up-front time to program those tasks away. Manual deployment pipelines or manual tests are great examples. But ultimately, if a human can do it, a computer can (and should) do it too.

This is where good laziness comes in. Great engineers will ultimately be fed up with the arduous nature of the repeated task and seek to eliminate that work from his/her docket.

8. Innovation breeds innovation

Once an initial innovation occurs that liberates the team from some encumbering set of repeated tasks, the team now has some newly available time. That time can be used in any number of ways, but to maximize its utility the team should use that time for even more introspection which paves the way for the upward spiral.

The more such innovations that the team can yield, the more likely the team can yield more innovations. This is the case, not only because of the growth in available time, but also because it eventually becomes part of the team’s culture.

9. Don’t treat your systems like your baby

Many people in the engineering world grow very attached to the systems that they build. It is easy to establish that loyalty as engineers spend a lot of time working on a specific system. In fact, I have often heard people call their systems their baby (I may have been guilty of that in my past as well).

There is a value in growing so attached to the systems in that is does strengthen the bond and builds pride for the team as they strive for excellence with that system. That said, there is a long-term detriment to this as well.

Systems, like virtually any piece of technology, have a limited shelf life. At some point, the system will hit its limit and will need to be overhauled or replaced.

Loyalty to that system clouds one’s objectivity about what is best. We need to be able to treat our solutions as tactics towards a broader goal and if the tactic is no longer effective we need to abandon it.

10. There’s no such thing as maintenance mode

api modeling

If a system is to go into maintenance mode, it really means one of two things: It is either not an important system anymore (which begs the question as to whether or not it should just be retired outright) or the business function is still important to the company even though the company no longer wants to invest in the system that supports it.

As part of the team’s culture, it is important to aspire to eliminate the idea of maintenance mode from the team’s vernacular.

Maintenance mode has two main detriments. First, it adversely affects the team’s morale and goes against the spirit of great engineers, which is to constantly be challenged. Second, most maintenance systems conflate the idea of supporting a legacy system with supporting its business function.

In fact, the latter is the real goal and an innovative team will seek ways to retire legacy systems in favor of future-oriented systems that still supports the required business function. This is not always easy or feasible, but you should always be seeking opportunities to move on from the legacy system.  Sometimes executing on that migration work is of equal or greater value to pursuing new innovations.

External risks

Ultimately, all of these principles depend on having excellent talent on the team. No amount of leadership can offset the challenges introduced by having the wrong skills or people.

Another risk is that many engineers like to chase the shiny new objects. There is a balance that needs to be maintained between enabling great engineers to experiment, innovate, and identify and pursue challenges with their propensity to play with emerging technologies.

It is also worth noting that there are often external forces that prevent some organizations and/or leaders from achieving the above philosophies. For example, not all companies have enough available resources to staff beyond the needs or they may have a legacy of disparate and unrelated technologies that make it inherently more difficult to find a path out of the churn.

As a result, these philosophies require a strong company-level culture that puts leaders and teams in a position to achieve greatness. If the culture is there, however, these 10 philosophies, if truly embraced, will help springboard your team to being innovative and non-reactionary.

