Serverless Functions Post-Mortem

sizeoftheuniverse@programming.dev · 1 year ago

Serverless Functions Post-Mortem

snowe@programming.dev · 1 year ago

Hm. this is a very strange article for me to read because in my experience, only 1 or 2 things in the whole article have been true for our company (17k employee company with 300 people in the tech org).

Users would ingress through the API Gateway technology, which handles everything from traffic management, CORS, authorization and API version management. It basically serves as the web server and framework all in one. Easy to test new versions with multiple versions of the same API at the same time, easy to monitor and easy to set up.

We don’t use API Gateway. The best use for lambdas is as a direct call, using the ARN. You don’t need to worry about CORS, permissions, etc. You either have access to call the lambda or you don’t. You can directly control exactly who can call your service, and you never need to set up IAM at all.

Local development. Typically a developer pulls down the entire application they’re working on and runs it on their device to be able to test quickly. With serverless, that doesn’t really work since the application is potentially thousands of different services written in different languages. You can do this with serverless functions but it’s way more complicated.

I think if you’re literally recreating your monolith in a lambda then you’re doing something fundamentally wrong. Our entire team only has a few lambdas (10-15) and they’re very easy to manage. But yes, testing locally is an issue. We’ve solved this with testcontainers, which would be the same solution if you were just deploying docker services to k8s or openshift or even directly to a VM. This is the first very large issue with lambdas that the article is correct about though.

Hard to set resources correctly. How much memory did this function need under testing can be very different from how much it needs under production. Developers tended to set their limits high to avoid problems, wiping out much of the cost savings. There is no easy way to adjust functions based on real-world data outside of doing it by hand one by one.

I do not understand this. How are your resources changing like that? We’ve only had to touch the resources for our very large functions, and even then we’ve touched them only once or twice in 3 years. This is absolutely a non-issue. Set it to the lowest to start, then when it times out update it to the next level. It really isn’t difficult.

Since even a medium sized application can be made up of 100+ functions, this is a non-trivial thing to do.

please. why in the world would you think a ‘medium sized application’ would have 100+ functions? That’s absolutely insane. there’s no way to manage that. That’s not using serverless properly. We have a medium sized ‘application’ (250k+ lines of kotlin, with tens of millions of lines of generated Java from Drools Rules) and it’s 10-15 lambdas. You should not have hundreds of lambdas for a medium sized app. That’s just idiotic, I’m sorry, but that was never what serverless was meant for.

Is it working? Observability is harder with a distributed system vs a monolith and serverless just added to that. Metrics are less useful as are old systems like uptime checks. You need, certainly in the beginning, to rely on logs and traces a lot more. For smaller teams especially, the monitoring shift from “uptime checks + grafana” to a more complex log-based profile of health was a rough adjustment.

I can agree with some of this partially, but I’m not sure why the author thinks that getting rid of uptime checks is a problem. I’ve never once had to worry about whether our lambdas are up. There’s no uptime! It either works every time or it doesn’t work at all. It’s pretty awesome actually. Of course you do need to test it when you deploy, but that’s a simple http call and boom you know whether the deploy worked or not.

Traces are also a great way of tracking where you have slowness in your system. I’m guessing a lot of this depends on which ‘ecosystem’ you choose, but with Quarkus and XRay, tracing is dead simple. Add a dependency, you’ve got tracing. Done.

Now, the big problem here is the error passing, which the author talks about later.

Latency. Traditional web frameworks and containers are fast at processing requests, typically hitting latency in database calls. Serverless functions were slow depending on the last time you invoked them. This led to teams needing to keep “functions warm.”

well sure, but if you have 100+ functions then you’re multiplying your instantiation costs by 100+. This is not a non-issue for fewer lambdas, but it’s much less of a problem.

Later Provisioned Concurrency was added, which is effectively…a server. It’s a VM where your code is already loaded. You are limited per account to how many functions you can have set to be Provisioned Concurrency, so it’s hardly a silver bullet. Again none of this happens automatically, so its up to someone to go through and carefully tune each function to ensure it is in the right category.

correct. but it’s a server you don’t have to manage. I don’t know why the author calls this out this way, you have to manage autoscaling servers at a much finer grained level. Provisioned Concurrency is literally “how many functions do you want to be running at any point in time by default”. There’s not much else to it, besides the next point.

But it is very possible for one function to eat all of the capacity for every other function. Again it requires someone to go through and understand what Reserved Concurrency each function needs and divide that up as a component of the whole.

this is a major issue. no other thing to say about it. I do not understand why this is the case, but yes it’s a huge problem, and makes scaling across an org very difficult. The one solution to this is to split your org into separate aws accounts (not sure how gcp manages it, but we do use gcp too), which helps with that, but it’s still a weird restriction.

In addition, serverless functions don’t magically get rid of database concurrency limits. So you’ll hit situations where a spike of traffic somewhere else kills your ability to access the database. This is also true of monoliths, but it is typically easier to see when this is happening when the logs and metrics are all flowing from the same spot.

Hm. maybe this depends on using RDS, because we’ve never seen this with Dynamo.

In practice it is far harder to scale serverless functions than an autoscaling group. With autoscaling groups I can just add more servers and be done with it. With serverless functions I need an in-depth understanding of each route of my app and where those resources are being spent. Traditional VMs give you a lot of flexibility in dealing with spikes, but serverless functions don’t.

This has not been our experience. Lambdas have been simple set it and forget it, allowing our team (and company) to focus on the business rather than infra. We only spend time configuring lambdas when we are creating new ones, which isn’t too often. It’s been 3 years since we started using lambdas, and I would say we create maybe 3-5 a year and they’re all for new features, not new individual functions. The companies business has grown more than 3x in that time and we have daily spikes and it all just works.

Teams switched from always having a detailed response from the API to just returning a 200 showing that the request had been received. That allowed teams to stick stuff into an SQS queue and process it later. This works unless there is a problem in processing, breaking the expectations from most clients that 200 means the request was successful, not that the request had been received.

this is by far the most annoying thing about lambdas. they are http under the covers, but you can’t modify any http headers, response codes, etc. It’s either ‘throw an exception’ or ‘200’. nothing in between. very annoying.

snowe@programming.dev · 1 year ago

continued

Functions often needed to be rewritten as you went, moving everything you could to the initialization phase and keeping all the connection logic out of the handler code. The initial momentem of serverless was crashing into the rewrites as teams learned painful lesson after painful lesson.

we have never encountered this. this is probably exacerbated by the fact that the author thinks having 100+ lambdas for a medium sized app is normal. you focus even more on the startup time, rather than solving business problems.

Price. Instead of being fire and forget, serverless functions proved to be very expensive at scale. Developers don’t think of routes of an API in terms of how many seconds they need to run and how much memory they use. It was a change in thinking and certainly compared to a flat per-month EC2 pricing, the spikes in traffic and usage was an unpleasant surprise for a lot of teams.

Combined with the cost of RDS and API Gateway and you are looking at a lot of cash going out every month.

lambdas are saving us tens of thousands of dollars a month because we don’t need to worry about massive monoliths and the required ec2 autoscaling instances needed, nor the insane costs of RDS.

The other cost was the requirement that you have a full suite of cloud services identical to production for testing. How do you test your application end to end with serverless functions? You need to stand up the exact same thing as production.

why do you need this? That’s not how most testing works. you mock what you need. Unless you’re using a monolith then this applies to any architecture.

Traditional applications you could test on your laptop and run tests against it in the CI/CD pipeline before deployment.

if by traditional you mean monoliths. Any sort of microservices, or even a slightly macroservices architecture.

Serverless stacks you need to rely a lot more on Blue/Green deployments and monitoring failure rates.

but why? this isn’t explained. We haven’t seen this. Maybe it’s how we use lambdas, but we use versioned lambdas and we deploy and immediately forget about it. There’s nothing to maintain about old versions, rollbacks are automatic.

Slow deployments. Pushing out a ton of new Lambdas is a time-consuming process. I’ve waited 30+ minutes for a medium-sized application. God knows how long people running massive stacks were waiting.

why are you ‘pushing out a ton of new lambdas’? The whole point is for things to be self contained. If you are needing to touch multiple things often then your lambdas should be a single thing, not multiple. This comes back to the ‘100+’ lambdas thing. That’s just bad design. Don’t blame lambdas for this.

We are able to build GraalVM Kotlin lambdas in less time than that, along with the deploy. The slowest part is literally the CDK synthesis. If we were using CF yaml then it would be half the time.

Security. Not running the server is great, but you still need to run all the dependencies. It’s possible for teams to spawn tons of functions with different versions of the same dependencies, or even choosing to use different libraries. This makes auditing your dependency security very hard, even with automation checking your repos. It is more difficult to guarantee that every compromised version of X dependency is removed from production than it would be for a smaller number of traditional servers.

This is going to completely depend on your team, your languages, and the frameworks you’re using. For us, it’s dead simple to keep up to date. Snyk helps us, it’s one click deploy for each lambda, and we can send to prod immediately due to having a very mature CI/CD pipeline. We are getting even better at this as we’ll be switching to gradle’s version catalogs which means that all of the applications can use the exact same version catalog and it will then require a single change whenever we need to update stuff, instead of hundreds.

The complexity of running a server in a modern cloud platform was massively overstated. Especially with containers, running a linux box of some variety and pushing containers to it isn’t that hard. All the cloud platform offer load balancers, letting you offload SSL termination, so really any Linux box with Podman or Docker can run listening on that port until the box has some sort of error.

so now you have to maintain your linux security, your autoscaling on linux, your deployment pipelines for linux, your nginx configs for linux, or if you’re using k8s you have to learn two stacks, k8s and AWS. If you’re adding loadbalancers then you should be deploying those with CDK anyway so now you’re both using cdk and k8s along with maintaining security on your ec2 instances.

Setting up Jenkins to be able to monitor Docker Hub for an image change and trigger a deployment is not that hard. If the servers are just doing that, setting up a new box doesn’t require the deep infrastructure skills that serverless function advocates were talking about. The “skill gap” just didn’t exist in the way that people were talking about.

O___o we literally have an entire infra team that is unable to manage Jenkins to the level that devs need due to how difficult it is to maintain a jenkins build pipeline. Not only that, but now you’re dependent on maintaining security for jenkins which is, and has always been, a nightmare. Jenkins pipelines aren’t testable locally (github actions you can use Act along with something like mock-github and act.js to even test your pipelines as part of ci/cd!). We’re currently switching the entire org to github actions due to how terrible jenkins is. And then you’re writing more pipelines to do monitoring! The author claims that serverless has more monitoring, but then goes on to say that you can ‘simply set up’ all this other stuff which is wayyyy harder to maintain in the long run.

People didn’t think critically about price. Serverless functions look cheap, but we never think about how many seconds or minute a server is busy. That isn’t how we’ve been conditioned to think about applications and it showed. Often the first bill was a shocker, meaning the savings from maintenance had to be massive and they just weren’t.

100+ lambdas once again.

Really hard to debug problems. Relying on logs and X-Ray to figure out what went wrong is just much harder than pulling the entire stack down to your laptop and triggering the same requests.

I don’t know what the author is doing, but it’s so dead simple to run lambdas locally and test locally that I really really don’t understand this. There’s only a single entrypoint. You know what the request was going in and out. It takes me way less time to debug something in a lambda than it ever did in a monolith (I’ve worked in a lot of monoliths and we still maintain a monolith on my team). If you have 100+ lambdas then maybe you should start blaming your architecture, rather than lambdas. It would be the exact same if you had 100+ microservices…a nightmare.

I am very sorry, but I honestly read through the whole article and agreed with a lot of it, and then when I went to write this up just got angrier and angrier because it’s very very clear that the author has a terrible architecture and is blaming it on lambdas. Lambdas don’t work for everything. In general don’t use them for web servers! But for a great solution to small self contained applications, or an architecture that might need one side to scale differently than others, or for step functions where you’re a state-flow diagram, the list goes on and on… then it’s a fantastic solution.