Author Archive

Pushpin

Pushpin’s primary value prop is that it is an open source solution that enables real-time push — a requisite of evented APIs (GitHub Repo). At it’s core, it is a reverse proxy server that makes it easy to implement WebSocket, HTTP streaming, and HTTP long-polling services. Structurally, Pushpin communicates with backend web applications using regular, short-lived HTTP requests.

This architecture provides a few core benefits:

  • Backend languages can be written in any language and use any webserver.
  • Data can be pushed via a simple HTTP POST request to Pushpin’s private control API
  • It is invisible to connected clients
  • It manages stateful elements by acting as the responsible party when requiring data from your backend server
  • Horizontally scalable by not requiring communication between Pushpin instances
  • It harnesses a publish-subscribe model for data transmission
  • It can act as both a proxy server and publish-subscribe broker

Integrating Pushpin

From a more systemic perspective, there are a few ways you can integrate Pushpin into your stack. The most basic setup is to put Pushpin in front of a typical web service backend, where the backend publishes data directly to Pushpin. The web service itself might publish data in reaction to incoming requests, or there might be some kind of background process/job that publishes data.

PushPin real-time reverse proxy

Because Pushpin is a proxy server, it works with most API management systems — allowing you to do perform actual API development.  For instance, you can chain proxies together, placing Pushpin in the front so your API management system isn’t subjected to long-lived connections.  More importantly, Pushpin can translate WebSocket protocol to HTTP, allowing the API management system to operate on the translated data.

PushPin real-time reverse proxy with API management system

Pushpin Technical Details

Pushpin makes it easy to create HTTP long-polling, HTTP streaming, and WebSocket services using any web stack as the backend. It’s compatible with any framework, whether Django, Rails, ASP, PHP, Node, etc. Pushpin works as a reverse proxy, sitting in front of your server application and managing all of the open client connections.

pushpin-diagram2

Communication between Pushpin and the backend server is done using conventional short-lived HTTP requests and responses. There is also a ZeroMQ interface for advanced users.

The approach is powerful for several reasons:

  • The application logic can be written in the most natural way, using existing web frameworks.
  • Scaling is easy and also natural. If your bottleneck is the number of recipients you can push realtime updates to, then add more Pushpin instances.
  • It’s highly versatile. You define the HTTP/WebSocket exchanges between the client and server. This makes it ideal for building APIs.

How it works

Like any reverse proxy, Pushpin relays HTTP requests and responses between clients and a backend server. Unless and until the backend invokes any of Pushpin’s special realtime features, this proxying is purely a pass-through. The magic happens when the backend server decides to respond with special instructions to a request. For example, if the backend server wants to long-poll a request, it can respond to the request with instructions saying that the connection be held open and bound to a channel. Pushpin will act on these instructions rather than forwarding them down to the requesting client. Later on, when the backend wants to respond to a request being held open, it makes a publish call to Pushpin’s local REST API containing the HTTP response data to be delivered.

Below is a sequence diagram showing the network interactions:

pushpin-diagram3

As you can see, the backend web application can either respond to an HTTP request normally, or it can respond with holding instructions and send data down the connection at a later time. Either way, the backend never maintains long-lived connections on its own. Instead, it is Pushpin’s job to maintain long-lived connections to clients.

The interfacing protocol between Pushpin and the backend server is called “GRIP”. You can read more about GRIP here.

An example

Let’s say you want to build an “incrementing counter” service that supports live updates. You could design a REST API as follows:

  • Single integer counter exists at resource /counter/value/.
  • POST /counter/value/ to increment and return the counter value (the value after incrementing).
  • GET /counter/value/ to retrieve the current counter value. Optionally, pass parameter last=N to specify the last value known by the client. If the server recognizes this value as the current value, then long-poll until the value changes.

Before we discuss how to implement this API with Pushpin, let’s go over the counter API design in more detail so it’s clear what we are trying to accomplish.

The POST action is straightforward. It’s the GET action that’s more complex, because it needs to long-poll or not, depending on the state of things. Suppose the current counter value is 120. Below, different GET requests are shown with the expected server behavior.

Client requests counter value, without specifying last known value:

GET /counter/value/ HTTP/1.1

Server immediately responds:

HTTP/1.1 200 OK
Content-Type: application/json

120

Client requests counter value, specifying a last known value that is not the current value:

GET /counter/value/?last=119 HTTP/1.1

Server immediately responds:

HTTP/1.1 200 OK
Content-Type: application/json

120

Client requests counter value, specifying last known value that is the current value:

GET /counter/value/?last=120 HTTP/1.1

The server will now wait (long-poll) before responding. The server will either respond with the next value eventually:

HTTP/1.1 200 OK
Content-Type: application/json

121

Or, the server will timeout the request, because the counter has not changed within some timeout window. In this case we’ll say the server should respond with an empty JSON object:

HTTP/1.1 200 OK
Content-Type: application/json

{}

At this point we haven’t even gotten to the Pushpin part. We’re just designing and describing a counter API, and there is nothing necessarily Pushpin-specific about the above design. You might come to this same design regardless of how you actually planned to implement it. This helps showcase Pushpin’s versatility in being able to drive any API. In fact, if a counter service already existed with this API, it could be migrated to Pushpin and clients wouldn’t even notice the switch.

Normally, implementing any kind of custom long-polling interface would require using an event-driven framework such as Node.js, Twisted, Tornado, etc. With Pushpin, however, one can implement this interfacing using any web framework, even those that are not event-driven. Below we’ll go over how one might implement the counter API using Django.

First, here’s the model code, which creates a database table with two columns, name (string) and value (integer):

class Counter(models.Model):
  name = models.CharField(max_length=32)
  value = models.IntegerField(default=0)

  @classmethod
  def inc(cls, name):
    cls.objects.filter(name=name).update(value = F('value') + 1)

Just a basic model with an increment method. Our service will use a counter called “main”. Now for the view, where things get interesting:

from gripcontrol import GripPubControl, create_grip_channel_header

pub = GripPubControl({'uri': 'http://localhost:5561'})

def value(request):
    if request.method == 'GET':
        c = Counter.objects.get(name='main')
        last = request.GET.get('last')
        if last is None or int(last) < c.value:
            resp = HttpResponse(json.dumps(c.value) + '\n')
        else:
            resp = HttpResponse('{}\n')
            resp['Grip-Hold'] = 'response'
            resp['Grip-Channel'] = create_grip_channel_header('counter')
        return resp
    elif request.method == 'POST':
        Counter.inc('main') # DB-level atomic increment
        c = Counter.objects.get(name='main')
        pub.publish_http_response('counter', str(c.value) + '\n')
        return HttpResponse(json.dumps(c.value) + '\n')
    else:
        return HttpResponseNotAllowed(['GET', 'POST'])

Here we’re using the Python gripcontrol library to interface with Pushpin. It’s not necessary to use a special library to speak GRIP (it’s just headers/JSON over HTTP), but the library is a nice convenience. We’ll go over the key lines:

pub = GripPubControl({'uri': 'http://localhost:5561'})

The above line sets up the library to point at Pushpin’s local REST API. No remote accesses are performed on this line, but whenever we attempt to interact with Pushpin later on in the code, calls will be made against this base URI.

resp = HttpResponse('{}\n')
resp['Grip-Hold'] = 'response'
resp['Grip-Channel'] = create_grip_channel_header('counter')

The above code generates a hold instruction, sent as an HTTP response to a proxied request. Essentially this tells Pushpin to hold the HTTP request (to the client) open until we publish data on a channel named “counter”. If enough time passes without a publish occurring, then Pushpin should timeout the connection by responding to the client with an empty JSON object. Once we respond with these instructions, the HTTP request between Pushpin and the Django application is finished, even though the HTTP request between Pushpin and the client remains open.

pub.publish_http_response('counter', str(c.value) + '\n')

The above call publishes an “HTTP response” to Pushpin, with the body of the response set to the value of the counter. This payload is published on the “counter” channel, causing Pushpin to deliver it to any requests that are currently open and bound to this channel.

That’s all there is to it!

Realtime is no longer special

The great part about being able to use existing web frameworks is that you don’t need separate codebases for realtime and non-realtime. It’s not uncommon for projects to implement the non-realtime parts of their API using a traditional web framework, and the realtime parts in a more customized way using a specialized server. Pushpin eliminates the need for multiple worlds here. Instead, your entire API, realtime or not, can be implemented using the same framework (e.g. entirely in Django). Any HTTP resource can be made to stream or long-poll on a whim. All facilities of your traditional web framework, such as authentication or debugging, will work within a realtime context.

Ideal for everyone

Finally, lest Pushpin be misunderstood solely as a way to shoehorn realtime capabilities onto non-event-driven web frameworks, it’s worth emphasizing that the proxying approach makes a lot of sense even if your backend is Node.js. The decoupling of application logic from connection management will make your overall application much easier to manage and maintain. Additionally, introducing proxying layers is the inevitable endgame for high scale data delivery (just look at the topology of a CDN).

Pushpin is open source and available on GitHub. For more information about the motivation and thought process behind Pushpin, see this article. And if you find yourself wishing there was a cloud service that worked like Pushpin, there is.

Fanout / Pushpin

Fanout

Fanout is real-time API development kit that helps you push data to connected devices easily. Fanout is a cross between a reverse proxy and a message broker. This unique design lets you delegate away the complexity and load of realtime data push, while leveraging your API stack for business logic.

The Case for a Push CDN

It is true that there are a bunch of software solutions that make pushing data in realtime easier, and if you’re enthusiastic about maintaining your own servers then a cloud service may not be that interesting. It’s important to recognize, though, that Fanout Cloud is about more than just making push easy. It’s about making it scalable.

The key to scaling is delegating work among many machines. Fanout Cloud achieves this by load balancing message deliveries across a set of servers. This delegation is even more important for push than it is for traditional request/response traffic. Aside from cases like a single tweet from Ashton Kutcher driving thousands of people to pounce on your website simultaneously, requests tend to be evenly distributed over a given period of time, with a rise during peak hours. This is because there is generally no coordination between clients, and in the case of web traffic people click links with a degree of random timing. Push traffic, on the other hand, is bursty. Suppose your single server website handles 200 hits per second at maximum, but every once in awhile you need to push a message to 5000 recipients. If we say the work needed to handle a hit is roughly the same as the work needed to push a message, then it would take 25 seconds to make all of the deliveries. That’s a long time. Ideally, pushes would be near instantaneous, but is this practical? According to the math, getting 5000 deliveries out in 1 second would require 25 servers!

This is where the power of a shared service can come into play. If you need to push a message to 5000 recipients instantly, once per hour, it is likely wasteful to invest in a heap of distributed server infrastructure that will spend 3599 of every 3600 seconds idle. On the other hand, what if 3600 organizations with identical requirements split the cost? Suddenly a state-of-the-art infrastructure becomes not only affordable, but a steal. This is the reason traditional content delivery networks (CDNs) are popular. Just look at Akamai or Amazon CloudFront. These are powerful services that you would have little hope in replicating on your own unless you are in the business of making CDNs.

I think cost alone makes the case here, but some people may cite that there are drawbacks that come with dependence on an external service. Certainly this is true. Often, the choice to use the cloud is not a pure win but a trade-off: simpler administration in exchange for some loss of control or increased latency. However, keep in mind that the trade-offs vary by service type. If you maintain your documents in Google Docs, you lose control, increase latency, and even limit accessibility (offline situations). If you use the Tumblr service instead of a WordPress install on your own server, you lose control, but latency and accessibility should remain about the same.

With Fanout Cloud, you lose a little bit of control in the way you route network transmissions, but that’s about it. You don’t lose control of your data, as Fanout Cloud is not a database and does not store your data. You might think Fanout Cloud would introduce latency, and while the truth is that it does, it’s the kind of necessary latency that is unavoidable as you grow. In other words, if your web service today consists of a single server in a single location, then Fanout Cloud will indeed add latency. If you’ve grown to the point where you need multiple servers in multiple locations (for example, one server in California and one server in Virginia), then suddenly you may be introducing a small, but necessary amount of latency for sake of scalability. To achieve a Fanout-level of scale or throughput on your own would require making the very same trade-offs, with your own infrastructure.

Fanout Overview

In a nutshell, clients connect to Fanout Cloud to listen for data, and API calls can be made to Fanout Cloud to send data to one or more connected clients. It’s like a publish-subscribe service, but with a twist: incoming client requests are proxied to a configured origin server (e.g. your API backend server), and Fanout Cloud’s behavior is determined by the responses it receives.

The network architecture looks like this:

_images/fanout-diagram-small.png

Clients connect to Fanout Cloud, and Fanout Cloud communicates with the origin server using regular, short-lived HTTP requests. The origin server application can be written in any language and use any webserver. There are two main integration points:

  1. The origin server must handle proxied requests from Fanout Cloud. For HTTP, each incoming request is proxied to the origin server. For WebSockets, the activity of each connection is translated into a series of HTTP requests sent to the origin server. The responses from the origin server are used to control which publish-subscribe channels to associate with each connection, among other things.
  2. Your application must send data to Fanout Cloud whenever there is data to push out to listeners. This is done by making an HTTP POST request to Fanout Cloud’s Publish endpoint. The data will then be injected into any client connections as necessary.

Additionally, Fanout Cloud supports pushing data using Webhooks, in which case the receivers are not clients but servers able to accept HTTP requests.

Product Site

Documentation

Pushpin

Pushpin’s primary value prop is that it is an open source solution (the open source version of Fanout) that enables real-time push — a requisite of evented APIs (GitHub Repo). At it’s core, it is a reverse proxy server that makes it easy to implement WebSocket, HTTP streaming, and HTTP long-polling services. Structurally, Pushpin communicates with backend web applications using regular, short-lived HTTP requests.

This architecture provides a few core benefits:

  • Backend languages can be written in any language and use any webserver.
  • Data can be pushed via a simple HTTP POST request to Pushpin’s private control API
  • It is invisible to connected clients
  • It manages stateful elements by acting as the responsible party when requiring data from your backend server
  • Horizontally scalable by not requiring communication between Pushpin instances
  • It harnesses a publish-subscribe model for data transmission
  • It can act as both a proxy server and publish-subscribe broker

Integrating Pushpin

From a more systemic perspective, there are a few ways you can integrate Pushpin into your stack. The most basic setup is to put Pushpin in front of a typical web service backend, where the backend publishes data directly to Pushpin. The web service itself might publish data in reaction to incoming requests, or there might be some kind of background process/job that publishes data.

PushPin real-time reverse proxy

Because Pushpin is a proxy server, it works with most API management systems — allowing you to do perform actual API development.  For instance, you can chain proxies together, placing Pushpin in the front so your API management system isn’t subjected to long-lived connections.  More importantly, Pushpin can translate WebSocket protocol to HTTP, allowing the API management system to operate on the translated data.

PushPin real-time reverse proxy with API management system

Pushpin Technical Details

Pushpin makes it easy to create HTTP long-polling, HTTP streaming, and WebSocket services using any web stack as the backend. It’s compatible with any framework, whether Django, Rails, ASP, PHP, Node, etc. Pushpin works as a reverse proxy, sitting in front of your server application and managing all of the open client connections.

pushpin-diagram2

Communication between Pushpin and the backend server is done using conventional short-lived HTTP requests and responses. There is also a ZeroMQ interface for advanced users.

The approach is powerful for several reasons:

  • The application logic can be written in the most natural way, using existing web frameworks.
  • Scaling is easy and also natural. If your bottleneck is the number of recipients you can push realtime updates to, then add more Pushpin instances.
  • It’s highly versatile. You define the HTTP/WebSocket exchanges between the client and server. This makes it ideal for building APIs.

How it works

Like any reverse proxy, Pushpin relays HTTP requests and responses between clients and a backend server. Unless and until the backend invokes any of Pushpin’s special realtime features, this proxying is purely a pass-through. The magic happens when the backend server decides to respond with special instructions to a request. For example, if the backend server wants to long-poll a request, it can respond to the request with instructions saying that the connection be held open and bound to a channel. Pushpin will act on these instructions rather than forwarding them down to the requesting client. Later on, when the backend wants to respond to a request being held open, it makes a publish call to Pushpin’s local REST API containing the HTTP response data to be delivered.

Below is a sequence diagram showing the network interactions:

pushpin-diagram3

As you can see, the backend web application can either respond to an HTTP request normally, or it can respond with holding instructions and send data down the connection at a later time. Either way, the backend never maintains long-lived connections on its own. Instead, it is Pushpin’s job to maintain long-lived connections to clients.

The interfacing protocol between Pushpin and the backend server is called “GRIP”. You can read more about GRIP here.

An example

Let’s say you want to build an “incrementing counter” service that supports live updates. You could design a REST API as follows:

  • Single integer counter exists at resource /counter/value/.
  • POST /counter/value/ to increment and return the counter value (the value after incrementing).
  • GET /counter/value/ to retrieve the current counter value. Optionally, pass parameter last=N to specify the last value known by the client. If the server recognizes this value as the current value, then long-poll until the value changes.

Before we discuss how to implement this API with Pushpin, let’s go over the counter API design in more detail so it’s clear what we are trying to accomplish.

The POST action is straightforward. It’s the GET action that’s more complex, because it needs to long-poll or not, depending on the state of things. Suppose the current counter value is 120. Below, different GET requests are shown with the expected server behavior.

Client requests counter value, without specifying last known value:

GET /counter/value/ HTTP/1.1

Server immediately responds:

HTTP/1.1 200 OK
Content-Type: application/json

120

Client requests counter value, specifying a last known value that is not the current value:

GET /counter/value/?last=119 HTTP/1.1

Server immediately responds:

HTTP/1.1 200 OK
Content-Type: application/json

120

Client requests counter value, specifying last known value that is the current value:

GET /counter/value/?last=120 HTTP/1.1

The server will now wait (long-poll) before responding. The server will either respond with the next value eventually:

HTTP/1.1 200 OK
Content-Type: application/json

121

Or, the server will timeout the request, because the counter has not changed within some timeout window. In this case we’ll say the server should respond with an empty JSON object:

HTTP/1.1 200 OK
Content-Type: application/json

{}

At this point we haven’t even gotten to the Pushpin part. We’re just designing and describing a counter API, and there is nothing necessarily Pushpin-specific about the above design. You might come to this same design regardless of how you actually planned to implement it. This helps showcase Pushpin’s versatility in being able to drive any API. In fact, if a counter service already existed with this API, it could be migrated to Pushpin and clients wouldn’t even notice the switch.

Normally, implementing any kind of custom long-polling interface would require using an event-driven framework such as Node.js, Twisted, Tornado, etc. With Pushpin, however, one can implement this interfacing using any web framework, even those that are not event-driven. Below we’ll go over how one might implement the counter API using Django.

First, here’s the model code, which creates a database table with two columns, name (string) and value (integer):

class Counter(models.Model):
  name = models.CharField(max_length=32)
  value = models.IntegerField(default=0)

  @classmethod
  def inc(cls, name):
    cls.objects.filter(name=name).update(value = F('value') + 1)

Just a basic model with an increment method. Our service will use a counter called “main”. Now for the view, where things get interesting:

from gripcontrol import GripPubControl, create_grip_channel_header

pub = GripPubControl({'uri': 'http://localhost:5561'})

def value(request):
    if request.method == 'GET':
        c = Counter.objects.get(name='main')
        last = request.GET.get('last')
        if last is None or int(last) < c.value:
            resp = HttpResponse(json.dumps(c.value) + '\n')
        else:
            resp = HttpResponse('{}\n')
            resp['Grip-Hold'] = 'response'
            resp['Grip-Channel'] = create_grip_channel_header('counter')
        return resp
    elif request.method == 'POST':
        Counter.inc('main') # DB-level atomic increment
        c = Counter.objects.get(name='main')
        pub.publish_http_response('counter', str(c.value) + '\n')
        return HttpResponse(json.dumps(c.value) + '\n')
    else:
        return HttpResponseNotAllowed(['GET', 'POST'])

Here we’re using the Python gripcontrol library to interface with Pushpin. It’s not necessary to use a special library to speak GRIP (it’s just headers/JSON over HTTP), but the library is a nice convenience. We’ll go over the key lines:

pub = GripPubControl({'uri': 'http://localhost:5561'})

The above line sets up the library to point at Pushpin’s local REST API. No remote accesses are performed on this line, but whenever we attempt to interact with Pushpin later on in the code, calls will be made against this base URI.

resp = HttpResponse('{}\n')
resp['Grip-Hold'] = 'response'
resp['Grip-Channel'] = create_grip_channel_header('counter')

The above code generates a hold instruction, sent as an HTTP response to a proxied request. Essentially this tells Pushpin to hold the HTTP request (to the client) open until we publish data on a channel named “counter”. If enough time passes without a publish occurring, then Pushpin should timeout the connection by responding to the client with an empty JSON object. Once we respond with these instructions, the HTTP request between Pushpin and the Django application is finished, even though the HTTP request between Pushpin and the client remains open.

pub.publish_http_response('counter', str(c.value) + '\n')

The above call publishes an “HTTP response” to Pushpin, with the body of the response set to the value of the counter. This payload is published on the “counter” channel, causing Pushpin to deliver it to any requests that are currently open and bound to this channel.

That’s all there is to it!

Realtime is no longer special

The great part about being able to use existing web frameworks is that you don’t need separate codebases for realtime and non-realtime. It’s not uncommon for projects to implement the non-realtime parts of their API using a traditional web framework, and the realtime parts in a more customized way using a specialized server. Pushpin eliminates the need for multiple worlds here. Instead, your entire API, realtime or not, can be implemented using the same framework (e.g. entirely in Django). Any HTTP resource can be made to stream or long-poll on a whim. All facilities of your traditional web framework, such as authentication or debugging, will work within a realtime context.

Ideal for everyone

Finally, lest Pushpin be misunderstood solely as a way to shoehorn realtime capabilities onto non-event-driven web frameworks, it’s worth emphasizing that the proxying approach makes a lot of sense even if your backend is Node.js. The decoupling of application logic from connection management will make your overall application much easier to manage and maintain. Additionally, introducing proxying layers is the inevitable endgame for high scale data delivery (just look at the topology of a CDN).

Pushpin is open source and available on GitHub. For more information about the motivation and thought process behind Pushpin, see this article. And if you find yourself wishing there was a cloud service that worked like Pushpin, there is.

Data Push

Data push is at the core of realtime.  Simply speaking, it is when the given data transaction is initiated by the publisher or generator, instead of by the receiver or client.  A good way to think about this is with the publish/subscribe model, whereby the a client “subscribes” to various information “channels” provided by a server; whenever new content is available on one of those channels, the server pushes that information out to the client.

Resources

Push technology

What is an API?

An application programming interface (API) is a set of routines, protocols, and tools for building software applications. An API often serves as the middleman that helps two programs communicate with one another. Most companies have open-source APIs so other software developers can build their products off of their service.

One of the most common examples of explaining how APIs work is the “restaurant scenario.” When you’re at a restaurant, the waiter acts as the middleman who takes your requests and tells the kitchen what you want. In this scenario, the waiter is acting as an API, the messenger who helps two systems communicate – you, the client, and the kitchen, the system. When your food is ready, the waiter returns the food/response to you. Without the waiter, the restaurant’s service wouldn’t work. Just like how many applications wouldn’t work if there weren’t APIs helping systems communicate with one another.

API Examples

Google Maps – Google Maps API allow developers to embed amps into their own applications.

YouTube – YouTube’s API allows developers to integrate video into their websites.

Instagram – Instagram’s API allows developers to incorporate photos, pull tags, and view trending photos on their own applications.

 

 

Spotlight Article: How to Describe, Publish & Consume Real-Time Data by Phil Leggetter

In this article, Phil Leggetter discusses techniques for analyzing and processing realtime data. He goes through an example using RethinkDB. Check out the full article here.

In the first post in the series we covered discovering real-time data within your systems and applications. In part two we went through the use cases for your real-time data. In this final section we’ll cover the how: how to describe, publish & consume real-time data from your systems and expose the data so that you can build real-time features.

The main steps we’re going to cover are doing the following with the real-time event data:

Analyse/Process

Describe

Publish

Consume and Use

Full Source

The Developer’s Guide to Building vs Buying Services

Defining a process for objectively selecting homegrown or purchased solutions

For almost every functional or architectural application component, there are a plethora of ‘as a service’ offerings. We see infrastructure as a service (IaaS), backend as a service (BaaS), SaaS, PaaS.. and a new ‘aaS’ seems to be added daily.

What do all these services have in common? Well, they aspirationally promise to give you, the engineer, (1) more freedom to focus on your core product, (2) faster time to market, and (3) production-ready solutions for complex and repeatable engineering operations.

Sometimes this is case. Sometimes it isn’t. This purpose of this guide is to provide a rational set of objective criteria to assess whether you should build or buy a particular service.

What is build? What is buy?

Build does not necessarily mean that you are making something from scratch. It means that you are combining custom code, open source libraries, and individual/community expertise to construct a solution for your use case. This solution is something that you will design, build, run, maintain, and scale internally.

On the other hand, buy does not necessarily mean that you are purchasing an end-to-end, out-of-the-box solution for your use case. It more accurately represents the purchase of a defined service that adds near-immediate value to your use case. Typically, the viability of the service itself will be guaranteed by the seller and you will not need to design and build the service itself. However, depending on the type of service purchased, you may choose to run and scale it internally. Generally, you will offload the running, maintenance, and scalability to the seller.

The Developer Mind

Before we continue, let’s reset our frame of mind.

Many developers have strong egos, and that’s generally an empowering attribute. Strong egos give devs the confidence to power through complex obstacles, focus for days and weeks at a time, and cultivate entirely new industries. However, there’s a fine line between reasonable and unreasonable confidence.

“I can build ____ in ____ days!”
“Ha! I can build a better ____ in a weekend!”
“This is so expensive. I’m just going to build it.”

We frequently see and hear these comments on dev forums, aggregators like Reddit and HackerNews, and in our day-to-day interactions. If we don’t say it, then some of us probably think it from time to time. Hey, sometimes we’re probably right, but often times, our initial ego-driven reaction distances us from the objective criteria we apply to our general practice of programming.

When assessing what to build vs buy, or which ratio we choose, it is critical that we reset our frame of mind and approach our solutioning as open-minded and objectively as possible. Excluding the purists, no one cares if we were able to build our product from scratch or if we cleverly integrated a series of purchased solutions together. What people care about is if our product works and delivers exceptional value to customers.

With the build vs buy decision-making process, we will answer the question: “How do we deliver exceptional value to our customers quickly, efficiently, and prudently?”

Build vs Buy Decision-Making Model

build versus buy guide and process for developers to choose software

Step 1 – Identify and categorize your product’s functional scope

Your team has been tasked with building an ecommerce platform that allows users to upvote and downvote products. So, what are your product’s functional and architectural features?

Functional

  • Marketplace service
  • Voting service
  • Product display service
  • Inventory management service
  • Transaction service
  • Buyer, seller, and admin account management service
  • Search, filter, refine service

Architectural and Process

  • Databases
  • Servers
  • Load Balancers
  • Dev Environment / Version Control
  • Continuous Integration / Delivery Pipeline
  • REST / Realtime APIs
  • Frontend Framework
  • Deployment Controls / AB Testing

While these are not comprehensive feature sets, the important point is that there is a clear distinction between core product features (marketplace, voting), and necessary system & process architecture (server environment, CI/CD pipeline). There are features that are proprietary and unique to your product, and there are architectural features that are found in almost every modern application system.

Your job is to identify which of these features are proprietary to your platform and which are replicable proven solutions. To do this, ask the following questions:

  • What are the proprietary, core features that make my application unique?
  • What architectural services do I need for my platform scaffolding?
  • What is my ideal development pipeline going to look like?

Keep in mind, we are not solutioning yet or deciding what to build vs buy. We are identifying and categorizing our product’s functionality.

Step 2 – Define the scope of work and reconcile against constraints

Based on your feature categorization in step 1, it is time to define the scope of work to build each feature.

First, itemize and prioritize the detailed functionality for each feature:

  • What is the minimum functional scope for the feature to be viable?
  • What is the ideal functional scope for the feature?
  • Is this a feature I need now? Or can it wait?

Second, for each feature, answer the following build questions for the minimum and ideal functional scope:

  • How many developer resources do I have available to build this feature? Maintain this feature?
  • Can I harness any domain experts to help design this feature?
  • Has anyone on my team built this before?
  • How much time to design (A), build (B), test (C), deploy (D), maintain (E) this feature?
  • Will building this divert resources from something else?
  • Do I need to hire additional resources? If so, what is the cost breakdown?
  • What is the infrastructure cost to run this internally?

Third, for each core feature, answer the following buy questions for the minimum and ideal functional scope:

  • What is my monthly budget for this service?
  • How do I anticipate my budget changing over time?
  • Can I harness any domain experts to help me assess the best solution?
  • What developer resources do I have available to integrate and configure the solution?
  • If applicable, will I have the resources to self-host, run, maintain, and scale the service?

Step 3 – Solution divergence

Now we can get to the good stuff! In this step, we are not deciding what to build or buy; rather, we are aggregating an inventory of choices.

First, scour the interwebs, get referrals, and assess the solution ecosystem. Have other teams built this successfully? Have they bought it successfully? What are the horror and success stories?

Second, create a build vs buy comparison matrix. Make sure to note the monthly, infrastructure, and long-term maintenance costs. Note the total upfront and ongoing time needed for each build or buy solution (having build/buy hybrids are great too!).

Step 4 – Solution convergence

Start narrowing down your options.

Remember that buying does not mean out-of-the-box instant magic. There are always build costs associated with buying:

  • Sandboxing and initial technical vetting
  • Integration and setup
  • Configuration and fine tuning
  • Operational training and staff onboarding

Similarly, building does not necessarily mean that everything is made from scratch, but it does mean that you will assume the costs of ongoing maintenance, scaling, and debugging. You will also need to train staff and develop new operational processes.

Step 5 – Build or buy or both

Choose a primary and secondary solution option for each feature. This way, you will have a backup plan if the primary solution does not pan out. It is absolutely critical that you involve your team during the selection process and make the selection criteria transparent.

Step 6 – Develop guidelines for reassessment

The solution that you’ve selected for day 1 of your product will likely not fit your product at day 600. This is okay, but we must be able to anticipate and preempt any future scaling issues. To do this, set both quantitative and qualitative benchmarks for triggering a build vs buy scaling reassessment. For example, we’re confident that our current architectural solution allows us to handle up to 500k concurrent connections with ease, but our current growth model forecasts 2m connections in 8 months. When we start to near the 300k mark, then this will trigger another build vs buy assessment so we can preempt any issues at scale. This reassessment should include:

  • What have we learned about the needs of our product in the past X months?
  • What has been more difficult than anticipated? What has been easier?
  • How has our resource and knowledge pool shifted?
  • Have our product’s core competencies shifted?
  • Is there anything new and better out there?

Final Thoughts – Try It Your Way

Well, this looks like a lot of work. It may even take a day or multiple days to assess a feature. But realistically, when we take into account the full lifecycle of your product, a few upfront days can save you months and lots of money down the road. Those few days may also make or break your product.

Customize your build vs buy assessment process to meet your organization’s needs. Though a large enterprise is way different than a startup, the assessment metrics remain very similar. Add or remove metrics, codify a more refined process, or make your own from scratch.

Either way, it is important to remember that building a successful product is very hard, so don’t make it harder on yourself than necessary. Let your decision be driven by choosing the right solution for your product, rather than the right solution for you.

Spotlight Article: Bringing The API Deployment Landscape Into Focus by Kin Lane

In this article, Kin Lane (API Evangelist) dives into the current landscape of APIs and the host of definitions that drive the industry.

I am finally getting the time to invest more into the rest of my API industry guides, which involves deep dives into core areas of my research like API definitionsdesign, and now deployment. The outline for my API deployment research has begun to come into focus and looks like it will rival my API management research in size.

With this release, I am looking to help onboard some of my less technical readers with API deployment. Not the technical details, but the big picture, so I wanted to start with some simple questions, to help prime the discussion around API development.

Where? – Where are APIs being deployed. On-premise, and in the clouds. Traditional website hosting, and even containerized and serverless API deployment

How? – What technologies are being used to deploy APIs? From using spreadsheets, document and file stores, or the central database. Also thinking smaller with microservices, containes, and serverless

Who? – Who will be doing the deployment? Of course, IT and developers groups will be leading the charge, but increasingly business users are leveraging new solutions to play a significant role in how APIs are deployed.

Full Source