How we retrofit a framework into a scalable, self-managed SaaS tool. This is the story of Frame’s engineering journey from building a framework for exploring conversational AI products to a rapid adaptation of our architecture into a scalable, self-installed product made to assist B2B customer success relationships.
Imagine several months of exciting discovery and experimentation building embedded conversational assistants for different use cases and courting potential customers. Then, as a pilot assisting B2B customer success conversations was taking off, a critical announcement from Slack cemented our confidence in going all-in on a SaaS product making team chat the ideal place to build B2B relationships.
Here’s how we made the drastic change from supporting a few teams to hundreds with 4 weeks of well-placed engineering effort.
Early on, we designed Frame’s architecture to enable rapid exploration in the conversational AI space. To support a variety of use cases, we built flexible, interconnecting components able to source messages from various platforms and drive stateful, human-assisting business logic modules triggered by contextual information. Using these components, we built a system for delivering English language tutoring over WhatsApp and Telegram, a platform for assisting conversations between buyers and real estate agents, a tool to get real-time mortgage quotes and support over Facebook Messenger, an integration for live chat with an auto insurance claims agent, and many other pilots. The generality and expressiveness built into our system dramatically accelerated our early stage hustle for product-market fit.
Each of the above product explorations leveraged many core technology components, but they also incorporated a fair amount of custom logic. We recognized early on the need to be able to develop, deploy, and maintain these divergent bloodlines of our product easily so we wouldn’t get bogged down with devops or the mental overhead of maintaining incompatible codebases. Our early architecture choices enabled rapid market exploration and later proved invaluable for doubling down on assisting customer success conversations.
Suddenly, committing to a self-installed product
In mid 2017, we were deep into an exploration of conversational assistance for customer success with our outstanding partner, Simon Data. Around the same time, Slack released Shared Channels, signaling loudly and clearly that their vision aligned with ours: team chat is the best place to conduct B2B support. This alignment was phenomenal news, but it meant we then needed a completely self-installable, SaaS-oriented version of our conversational assistant for customer success conversations available immediately.
We huddled and excitedly took stock of our tech stack with the aim of building a self-installed product. ** Our single-app infrastructure model emerged clearly as the primary blocker to a public product launch.**
We needed to find a quick way to stand up a generic ring of service instances that could divide the work of serving multiple copies of our product as users installed it, rather than creating a specific application configuration and deployment for each new customer as we had been doing. In other words, we needed infrastructure standing ready to serve new customers on demand.
Not long after our huddle, we proposed the following sequence of steps as both a fast and desirable solution to our problem:
- Use consistent hashing to introduce a core service instance ring
- Push events to core service instances by consistent_hash(app_id)
- Create a cache of System objects in the core to receive event dispatches
- Modify our tools to deploy service rings or custom apps
- Create an app configuration template and associated behavior tree for the new product business logic
We selected these steps because they leveraged proven techniques for distributing application workloads and because our flexible configuration, transport, and business logic systems provided enough flexibility to implement a self-installed product quickly and without a major redesign.
For consistent hashing, we went with our own Python implementation because most of the existing libraries were more feature rich than we needed. The algorithm is simple and elegant. It gave us a way to take an event tied to a specific Frame app and route it to a generic service instance by hashing the app id rather than routing by app id directly as we had done previously. Consistent hashing is simple to implement using library functions for hashes and binarysearch.
All of our event producers already used the same library functions when publishing events to the stream for a particular app. Preceding each of these calls, we added a service instance lookup to discover the correct queue to receive the events. Discovering the proper service instance for every event using a consistent hash of theapp_id.
Moving on to the consumer at the core of our application server stack, we added a lookup for a System object encapsulating the entire application runtime state before dispatching an event to the application’s business logic. Under a fixed number of service instances, all apps remain fixed to the same instance, ensuring that our cache is consistent and avoids costly missed lookups. Dispatching incoming events to the correct System instance.
Finally, we modified our deployment, logging, and alerting tools to construct a service ring, embed the service instance id and app id on every log line, and manage alerts at the level of the service instance. Filtering logs to appear as though each app has its own dedicated instance became a simple query on the log output, which is a small extra cost and a win for easy debugging. High-level model of the per-app infrastructure versus the ring, where incoming messages are routed to service instances based on target appid.
In retrospect, several of our early investments in a framework sped along our adaptation to a self-installed SaaS product. They’re each interesting systems in their own right, and discussing them here might inspire others who are facing similar challenges.
A key feature to keep us speeding along in our market explorations was building a configuration system based on YAML syntax that specifies the core and custom components of every Frame application. Using this flexible tool, we treated our codebase as a collection of generic capabilities for different messaging platforms and a suite of custom product modules tied together using single configuration object.
As our customer-facing surface area increased by running pilots alongside deployed contract commitments, we added more tools to our configuration system that made it easier to control feature flags. These improvements allowed us to deploy more often while achieving reliable uptime and SLA performance across a diverse set of product behaviors.
Separate transport Layers
We also gained speed and flexibility by separating the transport layers serving our instances and the event protocols sitting on top of them.
We built a generic core service structured as a transport consumer to dispatch events to business logic controlling the state for each conversation. We also introduced a single System object holding all global and runtime state for each app.
Combined, these abstractions created a strong boundary around the logical unit of an application as a deployed event consumer and processing loop. These features made it easy to change the grouping of logical applications to physical service instances, which was good for maintainability while we explored product use cases and a primary ingredient when we built our self-installed product.
We also built out a novel internal framework for writing business logic components and composing them into trees of functionality invoked on each incoming message by combining the benefits of actor frameworks with behavior trees.
Forging a library of reusable logic alongside custom explorations and having a single set of contracts and standards for error handling and reliability allowed us to develop new features and pilots faster. Packaging all of these up into a generic Docker image and injecting a few deploy-time variables to specify transports and application configuration gave us an easy-to-use pattern for deploying new (and different) Frame applications as often as we needed them.
While many engineering projects at Frame benefit from smooth execution and satisfactory outcomes, this one was especially sweet. We executed our conversion plan in a tight 4-week timeframe and established ourselves as a must-have add-on in the rapidly emerging B2B team chat space.
The flow for the self-installed product ended up being only a relatively small adjustment from the custom deployment pattern that came before it, with each of the planned steps leveraging the flexibility and layers of abstraction of our early designs to allow a single service instance to handle a large number of Frame applications rather than a single one at a time.
We’ve been able to focus our efforts on making our product easier to install and adopt because our early investments in flexible configuration and message processing allowed us to scale easily from a few pilots to hundreds of installed teams in very short order. An exciting possibility remains for using the ability to deploy custom apps as a QoS mechanism, by moving any high-traffic app ids to their own dedicated container deployment, and, if increased traffic warrants it, there’s also much more we can do to support zero-downtime deploys and service autoscaling.
Over the coming year, Frame will continue to introduce new capabilities that help our customers understand and improve their B2B relationships. We’ll increase Frame’s usability both inside and outside of messaging platforms, and add new reports and actionable insights to delight our customers.
If assisting B2B conversations and building exceptional technology with a supportive and talented team sounds appealing to you, then please check out our open positions and reach out today.