This presentation was given to the engineering organization at Zendesk on May 9, 2014. In this presentation, I talk about the challenges that the Netflix API faces in supporting the 1000+ different device types, millions of users, and billions of transactions. The topics range from resiliency, scale, API design, failure injection, continuous delivery, and more.
Most API providers focus on solving all three of the key challenges for APIs: data gathering, data formatting and data delivery. All three of these functions are critical for the success of an API, however, not all should be solved by the API provider. Rather, the API consumers have a strong, vested interest in the formatting and delivery. As a result, API design should be addressed based on the true separation of concerns between the needs of the API provider and the various API consumers.
This presentation goes into the separation of concerns. It also goes into depth in how Netflix has solved for this problem through a very different approach to API design.
This article was first published on The Next Web on March 25, 2014
Daniel Jacobson (LinkedIn) is the VP of Edge Engineering for the NetflixAPI. Prior to Netflix, Daniel ran application development for NPR where, among other things, he created the NPR API. He is also the co-author of APIs: A Strategy Guide.
Building great engineering teams is difficult, but it is also increasingly important as the world in which we live is more than ever driven by software. Because of this growing importance, it is essential for engineering leaders to maintain a culture of innovation within their teams to ensure high performance and to keep the company ahead of the curve.
In high performance cultures like at Netflix, there are basically two outcomes that will play out over time for engineering teams. Either the team will enjoy an upward spiral established by a strong culture of innovation or it will spiral in the downward direction, resulting in an inevitable decay of the team and its products.
Here are my experiences as an engineering leader and how I’ve worked to build a culture around innovation for my teams, virtually at all costs.
The downward spiral
For most engineering teams, it is easy to enter a steady state of development and maintenance as systems get off the ground and mature.
Accordingly, managers often slow or halt hiring as the amount of work is relatively well-understood. As a result, the engineers on the team enter a daily or weekly (or perhaps monthly) ritual of incremental improvements, responding to requests, and fixing bugs.
As engineers churn through task lists, however, they become bored, uninspired, and complacent, resulting in degradation in velocity and/or quality. That degradation will result in more churn around testing and/or support issues, which will further frustrate and bore the engineers while generating more potential for system failures that will increase the churn.
The more churn, the more turnover in staff; the more turnover in staff, the more additional churn. This downward spiral can play out very quickly or it can take quite a while.
In either case, there is a clear direction, it is inevitable, and it has a bad ending.
Upward spiral
The way out of the downward spiral is to make some very difficult decisions that have short-term ramifications for the benefit of the long term. I call this “taking your lumps.”
If you take your lumps now by deferring non-essential work, it frees the team up to think about the long-term and to seek patterns in their work, systems, and operations. Through these patterns, the team can potentially program away a class of work that otherwise would occupy the team’s time on an ongoing basis.
Eliminating a class of work enables the team have more available time in the future to seek other such patterns or opportunities, which will create even more available time.
With the available time, not only is the team further alleviated from the daily churn of reacting to external needs, they are also able to pursue higher order projects that allow the team to make transformative leaps forward rather than churning to keep up or making minor incremental improvements.
Repeated enough, this will eventually become part of the team’s culture, resulting in higher quality work and greater velocity. Unlike the downward spiral, there will positivity around the team that will be infectious and will create a breeding ground for attracting new talent.
Virtually every engineering team will find itself in one of the two aforementioned trajectories. It might not be obvious which way things are headed, but there will be a trend one way or the other.
It is the job of the engineering leader to ensure that the spiral is upward. Here are my 10 philosophies and approaches that I employ with my teams to strive for the upward trajectory:
1. Establish a strong identity
Be very clear on the identity of the team and establish a set of philosophies against which the team can operate. Be stubborn about adhering to the identity. The more that identity gets compromised by one-off requests, the more the architecture weakens, the more churn the team will have to deal with, and the more likely morale will suffer.
Be clear on what you will and won’t do and make sure the team knows these boundaries, lives them, and communicates them to others.
2. Important vs. Urgent
In “The 7 Habits of Highly Effective People,” Stephen R. Covey talks about the difference between urgent and important. Engineering organizations can very easily fall into the trap of being highly reactionary to externally imposed requests.
While many of these externally imposed requests are very important (and in fact, even if they are not), they tend to team’s attention as both urgent and important. But there are many other tasks or efforts that are very important despite the fact that they are internally driven and elective.
Understanding this distinction and being able to distinguish which tasks fall into which category is paramount in getting out of the churn and enabling that first critical step: introspection.
3. Introspection
Introspection is the key to innovation. Handling requests from a range of external (or even internal) stakeholders is the natural, easy thing for a team to do. Taking a step back from those requests and looking for patterns across them while imagining what they might look like in the future will give a broader and more impactful perspective.
If the system gets refactored in some other way, will that eliminate a class of requests in the future? Given how the industry is evolving, can you anticipate weaknesses in the system’s architecture that should be examined now? These are examples of important questions that can help springboard your team out of their everyday churn of satisfying urgent requests.
4. Don’t throw good money at bad
During the introspection process, it is important to be future-oriented. Your team has a lot of functioning code and other system-oriented assets which should be considered.
That said, they should only be considered after evaluating the long-term needs of the team and its relationship to its constituents. Imagine starting from scratch and target that as your outcome. From there, it is much easier to see how, if at all, existing assets can play a role in that future state (or in the transition to get there).
5. Hire beyond your needs
The most important resource to enable introspection is time. Many companies and hiring managers work towards “right-sizing” their teams. That is, they project what the incoming requests will be for the team and attempt to staff the team based on those expectations.
This is perhaps the biggest flaw that a team manager can make when building and operating an innovation team because that will ultimately limit the amount of available time for introspection.
Instead, hiring managers should staff beyond the bandwidth needed for known tasks. This will give the team the ability to swell and contract its focus on such work while continually maintaining a reasonable amount of time towards introspection and innovation.
6. Great engineers NEED to be challenged
If staffing is such that your great engineers are spending the majority of their time handling very tactical work, they will slowly but surely lose interest in the job and eventually leave.
Of course, doing that kind of work is a necessary part of every engineering job, but there needs to be a balance for great engineers to remain happy and excited about their work. Engineers need to also have deep architectural challenges that allow them to think, to stretch their minds, and to have a greater value to the company than just keeping the lights on.
In fact, most of them want to have the freedom to identify and pursue these challenges in a way that help them feel empowered and impactful. That is why engineers get into this field in the first place and if that is not available in their current job for too long, they will find those opportunities elsewhere.
7. Instill a culture of (good) laziness
There are two kinds of “lazy” in engineering: bad laziness and good laziness. Bad laziness is allowing yourself to repeat the same tasks over and over because that is easier than stepping back, looking for patterns, and spending the up-front time to program those tasks away. Manual deployment pipelines or manual tests are great examples. But ultimately, if a human can do it, a computer can (and should) do it too.
This is where good laziness comes in. Great engineers will ultimately be fed up with the arduous nature of the repeated task and seek to eliminate that work from his/her docket.
8. Innovation breeds innovation
Once an initial innovation occurs that liberates the team from some encumbering set of repeated tasks, the team now has some newly available time. That time can be used in any number of ways, but to maximize its utility the team should use that time for even more introspection which paves the way for the upward spiral.
The more such innovations that the team can yield, the more likely the team can yield more innovations. This is the case, not only because of the growth in available time, but also because it eventually becomes part of the team’s culture.
9. Don’t treat your systems like your baby
Many people in the engineering world grow very attached to the systems that they build. It is easy to establish that loyalty as engineers spend a lot of time working on a specific system. In fact, I have often heard people call their systems their baby (I may have been guilty of that in my past as well).
There is a value in growing so attached to the systems in that is does strengthen the bond and builds pride for the team as they strive for excellence with that system. That said, there is a long-term detriment to this as well.
Systems, like virtually any piece of technology, have a limited shelf life. At some point, the system will hit its limit and will need to be overhauled or replaced.
Loyalty to that system clouds one’s objectivity about what is best. We need to be able to treat our solutions as tactics towards a broader goal and if the tactic is no longer effective we need to abandon it.
10. There’s no such thing as maintenance mode
If a system is to go into maintenance mode, it really means one of two things: It is either not an important system anymore (which begs the question as to whether or not it should just be retired outright) or the business function is still important to the company even though the company no longer wants to invest in the system that supports it.
As part of the team’s culture, it is important to aspire to eliminate the idea of maintenance mode from the team’s vernacular.
Maintenance mode has two main detriments. First, it adversely affects the team’s morale and goes against the spirit of great engineers, which is to constantly be challenged. Second, most maintenance systems conflate the idea of supporting a legacy system with supporting its business function.
In fact, the latter is the real goal and an innovative team will seek ways to retire legacy systems in favor of future-oriented systems that still supports the required business function. This is not always easy or feasible, but you should always be seeking opportunities to move on from the legacy system. Sometimes executing on that migration work is of equal or greater value to pursuing new innovations.
External risks
Ultimately, all of these principles depend on having excellent talent on the team. No amount of leadership can offset the challenges introduced by having the wrong skills or people.
Another risk is that many engineers like to chase the shiny new objects. There is a balance that needs to be maintained between enabling great engineers to experiment, innovate, and identify and pursue challenges with their propensity to play with emerging technologies.
It is also worth noting that there are often external forces that prevent some organizations and/or leaders from achieving the above philosophies. For example, not all companies have enough available resources to staff beyond the needs or they may have a legacy of disparate and unrelated technologies that make it inherently more difficult to find a path out of the churn.
As a result, these philosophies require a strong company-level culture that puts leaders and teams in a position to achieve greatness. If the culture is there, however, these 10 philosophies, if truly embraced, will help springboard your team to being innovative and non-reactionary.
The digital world is expanding at an amazing rate, giving us access to applications and content on myriad connected devices in your homes, offices, cars, pockets and on even on your body. The glue that allows all of this to happen, that connect the companies who provide these services to the devices that you use, is the API.
Because APIs have such a huge responsibility for so many people and companies, it is natural that API design is often one of the industry’s liveliest discussions, touching on a range of topics including resource modeling, payload format, how to version the system, and security.
While these are likely important areas to explore when designing virtually any API, the reality is that a much larger decision needs to be made first. That decision is based on a fundamental question: who are the primary audiences for this API and how can we optimize for those audiences?
This seems like an innocuous enough question, but don’t underestimate its importance or complexity in the growing world of APIs.
Years ago, this question was much simpler
At that time, many emerging APIs were being built as open or public APIs, targeting a large set of unknown developers (LSUDs) external to the providing company.
In a previous post, I referred to these as OSFA APIs (or one-size-fits-all APIs). Allowing for such granularity in the modeling means that any developer who wishes to use the API can mix and match the elements in whatever way they choose to satisfy their application without further API team involvement.
The resource-model approach to designing an API can be very powerful, especially for this type of audience. The problem with this approach, however, is that the way that many companies use APIs today is different than described above. While many are still supporting the use cases of LSUDs, more are using their APIs to support a growing mobile or device strategy.
For some of these implementations, the engagement with the developers is different. The audience is a small set of known developers (SSKDs). They may be engineers down the hall from the API team, a contracted company hired to develop an iPhone app, or an engineering team in a partnering company. In all of these cases, however, the API team knows who these people are (at least in the abstract sense).
More importantly, however, the API team and the providing company care about the success of these implementations in a different way than they might care about the applications developed by the LSUDs. In fact, the success of the SSKDs may very well be paramount to the success of the business as a whole, a model that is becoming increasingly more pervasive.
Because of this change in audience and the deep interest in their success, there is great opportunity to change the API design.
For the SSKDs, having granular resource-based APIs that closely represent the data model works, but it just isn’t as optimal as it could be. This is especially the case when you consider the growing number of device types in the world and the fact that more and more companies’ business strategies are dependent on providing value to customers on such devices.
So, all it takes is a couple of devices with diverging needs and/or capabilities, each of great import to the company, for the resource-based API to start to show some warts. Making the API better, more optimized, for each of these target applications is the next logical, and most critical, step.
Enter the Orchestration Layer
An API Orchestration Layer (OL) is an abstraction layer that takes generically-modeled data elements and/or features and prepares them in a more specific way for a targeted developer or application
To address this opportunity, more companies are employing orchestration layers into their API infrastructure. While there are many ways in which to implement this architectural construct, the concept remains the same across all of them.
Below, I will describe a few of the more common patterns that I have seen (and/or been involved in implementing). But first, here are a few key principles that need to be considered when building an OL:
1. Most APIs are designed by the API provider with the goal of maintaining data model purity. When building an OL, be prepared to sometimes abandon purity in favor of optimizations and/or performance.
2. Many APIs are designed by API teams to make it easier for the API team to support. When building an OL, be prepared to potentially add complexity for the API team (or other teams, depending on the way it is implemented).
While this sounds undesirable, the goal here is to dramatically improve efficiency and/or simplicity for other people at some mild cost to the API team. Also keep in mind that such costs can potentially be programmed away over time.
3. It is important to understand the breadth of the audiences for the API.Depending on those constituents, you may only need the OL. In other cases, you may need the OSFA foundation in addition to the OL.
Here are a few examples of how some OLs have been approached:
Device-specific wrappers
This is the most common pattern that I have seen because most companies that are experiencing the distress referenced above already have APIs that they still use, continue to support, and invest in. The result is to continue to offer the granular resources as they always have, but to offer a wrapper tier on top of them – with new endpoints that are tailored to specific developers, devices or device clusters.
In this model, the API team will work more closely with, for example, the iPhone team to write a custom wrapper that handles specific requests and deliver specific payloads that are optimized for the iPhone app. In this model, most often the team to build the endpoints and the wrappers is the API team although that doesn’t have to be the case.
Query-based APIs
In this model, the API team is putting the power in the hands of the requesting developer, although that power is limited. The goal here is to create a more flexible way in which the requester can make requests and tailor payloads without putting additional ongoing burden on the API team, as could be the case with the Device-Specific Wrappers.
This is achieved by breaking down the resource-based APIs and allowing them to be queried against like a database through flexible parameters and payloads that can contract, expand and possibly morph based on what is needed. The benefit here is that once the query language is set, the API team does not need to keep writing wrappers as new implementations are needed for different devices.
The detriment, however, is that the query-based API is still a set of rules to which the developer needs to adhere, although these rules are much more flexible than the resource-based API model.
Experienced-based APIs
This is the model that Netflix has implemented, which in some ways is a blend of the two above. In this model, we basically have device-specific wrappers but they are designed, implemented and owned by the device teams.
A key concept here is that we have put the API team in the position of gathering the data in a generic, reusable way while putting the device teams in the position of owning the data formatting and delivery. After all, the formatting needs evolve in concert with the UI changes so putting that effort in the hands of those closest to the changes eliminates additional steps.
(For more details on how this system operates, see the links at the bottom of the post.)
As I noted, the range of implementations is potentially much more diverse than these three, although these are some of the most consistent and interesting patterns that I have seen. Regardless of how this is achieved, however, the key is for the API team to stop supporting the API as a service that is designed independent of those SSKDs who consume it.
Rather, the API team needs to view the SSKDs as partners in the design with an interest in making the products as great as possible so the end-users can get the best experience possible. The API team has the opportunity to build services that help developers to be better at developing by focusing on optimizing for the developers’ needs rather than how to optimize the time spent supporting the API.
I am finding it increasingly common, when talking about working at Netflix, for people to tell me that their companies are embracing an unlimited vacation policy similar to the one Netflix has been employing for years. When I hear these kinds of statements, my first thought is that these programs are not likely to have the desired outcome. I think that because Netflix does not have an unlimited vacation policy. Rather, Netflix has a culture of “Freedom and Responsibility” (F&R), which is discussed extensively in our publicly available culture slides.
The culture of F&R extends much further than its application to vacations, sick days, and holidays. F&R is a mindset on how we treat each other and our jobs on a day-to-day basis. It is a simple mantra that basically states that we are all adults, so let’s all behave and treat each other like adults. And if a person is not consistently behaving like an adult, they should not work at Netflix.
So, what does it mean to “behave like an adult”? At the core, Netflix seeks out senior-level people who are seasoned, both in their primary competency as well as in how to work effectively with others. We expect people to live up to the values that we discuss in the culture slides, to be people on which we would want to trust the future of our business. In fact, we are trusting the future of the business on each and every person who works with us. And that is the central point of it all… trust. If we cannot trust you, you should not be working at Netflix. Period.
This kind of trust takes many shapes. One example is that there are no strict silos in our server access rights. Virtually every engineer at the company can easily get root-level access to virtually any system in the production environment. Another manifestation is the fact that virtually everyone in the company knows key strategic initiatives that we are focused on well before these initiatives go public. As a result, everyone in the company is subject to trading windows for our stock options. That level of openness, at a company the size of Netflix, is exceedingly rare. And we can do all of this because we hire (and fire), essentially, for that deep level of trust.
With respect to vacation, that is just another manifestation of our trust as it pertains to F&R. We trust our employees and peers to take care of what they are responsible for. As long as we are being responsible (ie. getting our part of the project done, communicating our situation to others, etc.), then we don’t have much interest or concern around how others budget the rest of their time. We are free to do as we please because we can be trusted to do what we are supposed to do. As a result, tracking people’s time for vacation, sick, holidays, etc. just does not make sense. Track the results of the output, not the amount of time that it takes to execute it (or even which hours of the day were used to complete it).
On a related note, for the companies that do track hours work and audit leave days, many of them do not also track late night hours working on production deployments, site outages, etc. If you are going to track one, you must track the other. But it really makes no sense to track either. Companies trust these people to deploy and maintain their entire digital presence at midnight on a Saturday, but they won’t trust these same people to be responsible with their own time. That makes no sense!
At the core, Netflix does not have an “unlimited vacation” policy, we have a trust policy. If we trust you, then we will trust you and will give you great liberty to accomplish amazing things. If not, we will part ways.