Spring 2024 - Technology Improvements
An overview of changes we've made to Outseta's tech stack and backend processes so we can build better software faster and more reliably.
This post is intended for developers—if you're not a software engineer, I suggest you skip this one.
Historically, we haven't shared much about Outseta's tech stack or the behind-the-scenes processes we use to run such a robust product. This post is sort of a toe in the water to see if our developer customers are interested in more of this type of content.
The timing feels right, as we've focused a good deal of our energy over the last few months in changes most customers won't notice. That said, these changes will enable us to deliver better software faster and more reliably well into the future. Our engineering team assured me that this is akin to "changing the engine on an airplane while flying it."
Without further adieu, here's breakdown of that work.
Outseta was originally built on a technology stack that included .NET Framework 4.8, Entity Framework 6, AngularJS 1.x, and MySQL 5.7, all hosted at AWS. Given the state of tools and technology, this stack made sense at the time we started Outseta, but some of these choices are starting to hold us back. Over the last quarter, we have been making a number of changes aimed at modernizing the platform. We’ve also spent some time making portions of the existing system more robust.
Database upgrade
At the end of 2023, we upgraded our database from MySQL 5.7 to MySQL 8. AWS made this upgrade fairly straightforward. After running some tools to test our application for any version-specific incompatibility issues, we set up replication between our production MySQL 5.7 database and a new instance running MySQL 8.0. Once the replication was fully caught up, we put the system in read-only mode for a brief period, and then switched over the primary database to be the MySQL 8 database. We tested this whole process in a separate environment first, but we were still very pleasantly surprised by how smoothly the cutover went.
One of the issues that we raised about this database migration process was that we would want a point in time where nothing was being written to the database so that the replication could be fully caught up for the cutover. We were concerned that putting the database in a read-only mode would cause problems with users logging in, and that would impact our customers’ ability to run their applications. When a user logs in, we read information about the user to validate their credentials, but we also write to the database to update when they last logged in, record the activity so we can call any webhooks tied to logins, and schedule work to see if the login means the person should be added or removed from segments that are based on last login date. Rather than writing anything to the database, we made a change to record the login on an AWS SQS queue, and we process that queue message separately. During the database migration, we could still write to the queue, and we delayed the queue processing until the database was ready again.
Queuing to respond to bursts of traffic
We also started using AWS SQS queues in another part of the application. When our customers send out email, our email provider, Sendgrid, calls back to us to tell us what has happened. They provide email events indicating when the email was processed, delivered, opened, clicked, etc. When we send out a large volume of emails in a short period of time, Sendgrid responds with a flood of calls back to us. This burst of traffic was causing problems for our system, so rather than trying to process these callback requests directly, we now send them all to a queue, and process them at a pace that does not overload the system.
Robustness of back-end processes
Processing messages from our queues is something that happens separately from the web application. We have a set of virtual machines that run a variety of processes based on a centralized schedule. These processes handle things like consuming our queues, sending out emails, renewing subscriptions, making webhook calls, etc. We do not have the capacity to run all of these processes all of the time. Instead, each process needs to perform its work and then end, so that it can be directed to perform a different job when it starts up again. We’ve made a number of changes around how these jobs are scheduled and run so that we can prioritize time-sensitive tasks better, and also make sure everything that should be run is handled correctly.
One issue we found was that we would sometimes be running some important, but not urgent jobs with all of our capacity, and the time-sensitive work would be neglected. To address this, we permanently allocated some of our capacity towards time-sensitive work.
Another issue we found was that sometimes our back-end processes would be focused entirely on a single customer’s work at the expense of our other customers. The issue was that we were processing work in the order it came in. If customer A sends a large burst of traffic and that creates a large number of work items in a short period of time, then customers B and C might find their work scheduled after customer A’s. To address this, we updated how we pick the work to perform. While there is work waiting from multiple customers, we select the work in a round-robin fashion. Customers A, B, and C all get equal time to have their work performed. If we finish all of the work for B and C, we then focus on the remaining work from customer A.
All of the back-end processes coordinate with a central job scheduler. As each back-end process completes its job, it informs the scheduler that the job is complete. The job can then be scheduled to be started again later. There are cases where a process fails to tell the scheduler that the job is complete though. This might happen if the virtual machine that the process is running on is suddenly terminated. To handle this, we made a couple of changes to the jobs and scheduler. We set a value for each job to indicate how long it is expected to run. When it has been running for that duration, it knows to end, and any remaining work can be picked up the next time the job is run. Because we know the expected duration of any job, we can now recognize when a job failed without informing the scheduler. The scheduler can then restart these jobs without any human intervention. Previously, these jobs would simply be stalled until someone on the team was alerted that a job appeared to be running for too long.
Finally, we would like to have a lot more flexibility to scale our system up and down to meet demand on our back-end processes. Currently, because of our choice to build on the .NET Framework, our processes each run on virtual machines. Spinning up virtual machines is slow and potentially expensive. If we can move the processes to the latest version of .NET, we can have these processes run as AWS Lambda functions. At that point, AWS Lambda can scale up the capacity however we need. Moving from the .NET Framework to .NET 8 has its own challenges though.
Moving to .NET 8
Most of our code and most of the libraries we depend on are compatible with .NET 8. Even Entity Framework 6 (EF6), which handles how we write our entities to the database, is compatible with .NET 8. As we start to move towards .NET 8, though, we would also like to switch from EF6 to Entity Framework Core (EF Core).
We’ve started the process of swapping our EF6 implementation with an EF Core implementation. Our models need very little work, though we have found that EF Core and .NET 8 are more finicky about specifying which string-based fields can store null values. The biggest issue we have found has to do with how we tell Entity Framework what parts of the entity hierarchy should be updated.
When a request comes in through our API, it often contains a partial representation of our entities. For instance, if we’re updating a Plan to say that some AddOns are available on the Plan, the PUT request may contain a top level entity that represents the Plan, and a collection of PlanAddOn entities where each entity has a reference to the AddOn. The Plan entity, the PlanAddOn entities, and the AddOn entity are each identified by their respective Uids. These entities that are submitted as part of the request might contain no other data besides the Uids. The Plan name, its rates, the AddOn’s names and rates, etc. may all be missing from the PUT request. We currently handle this request through a library called GraphDiff. With GraphDiff, we indicate which entities are “owned” versus “associated”. If an entity is “owned”, we will update the database with any values that are submitted, but we will not touch any properties that were missing from the request. If an entity is “associated”, we will not touch any properties for that entity, even if the request contained them. In this case, the Plan and its PlanAddOns collection are “owned”. The AddOn itself is “associated”. When we process this request, we’ll update any data that was sent in about the Plan, and we’ll update which AddOns are associated with the Plan, but this request cannot modify the AddOn itself.
There isn’t a version of GraphDiff that works with EFCore. That’s because EFCore does some of this interpretation of what to update on its own. It doesn’t handle everything in the same way that GraphDiff does, so we’re still working through how to migrate the existing code to remove the GraphDiff dependency while maintaining the same logic of what can and can’t be updated by an incoming call.
Decoupling the AngularJS admin site from the .NET solution
When we first built the admin site with AngularJS, we served up the initial JavaScript files and resources from the same ASP.NET website that hosts the API. This makes deployment simple, and .NET has tools for managing the JavaScript dependencies and for minifying and packaging the release. We think it’ll be an improvement to separate the AngularJS admin site from the .NET solution. We’ve already seen some of these benefits when we migrated the embeds to React and separated them from the .NET Solution. The embeds are served as static files through a CDN, which takes load off of our web servers, and the deployment of the embeds is considerably faster than releasing the entire .NET solution.
There is a lot more code associated with the AngularJS-based admin site. We’re not currently focused on moving away from AngularJS, but we’re in the process of separating out the site from the .NET solution. Similar to the embeds, the admin site will also be served as static files through a CDN and at that point, the .NET solution will only contain the APIs and the back-end processes. One of the benefits of this change is that all of the JavaScript code can then be developed on non-Windows machines. Currently, the back-end and admin site require Windows to run because of the .NET Framework, but if we can develop the JavaScript code on macOS, a larger portion of the team will be able to work on it, and we should hopefully see faster changes to the application. Faster changes means more features for our customers!
If more of this type of content is interesting to you, please let us know and we'll publish more similar updates that may be of interest to other development teams.
On this page
Get our newsletter