I believe the topic of the scheduled launch of lambda functions, when the system has no initiator, is very relevant for serverless solutions on AWS. In almost every project, I see only one approach that I really don’t like. So this has encouraged me to share a bit of my experience.
Let’s suppose we have the following serverless architecture of an application: the entry point is API GW, the computing service is lambda functions.
We are developing a mobile application for booking hotel rooms with keyless access to rooms. To open/close the door to the room, clients use a mobile application with Bluetooth.
A back-end developer has the following tasks:
Add an API for creating and managing reservations. The client can choose the date and time of the start and end of the reservation. Let’s assume that this already exists.
The system must check whether the client arrived at the hotel within 2 hours from the start of the reservation. If this didn’t happen, the reservation must be canceled.
The system must activate the client’s virtual key right at the time when the reservation starts.
This is where an unusual task for serverless solutions appears. We need to launch the function to activate the key or cancel the reservation, given the system does not have an initiator since it’s a serverless solution. We have no running servers, workers, or anything like that.
If we start looking for solutions, we mainly consider CloudWatch rules, SQS, SNS, and someone may think of Kinesis Stream.
Why these particular services? Simply because they can trigger lambda functions.
When considering these services, we discover the following:
Kinesis Stream triggers a function but does not allow delays before calling a function.
SNS doesn’t have built-in delay support.
SQS is a much more suitable service that provides the possibility to set a delay (the DelaySeconds parameter). But as soon as we look into the documentation, we see that DelaySeconds can be 15 minutes max. For our task, this is as unsuitable as SNS or Kinesis.
In this way, what remains is CloudWatch rules, and I’ve observed that these rules are used to solve such problems on almost all projects.
How to solve this problem using CloudWatch rules
We create a rule which, using the Cron scheduler, will run our lambda function, for example, every 5 minutes.
This solution works, but this is its only benefit.
And the drawbacks are the following:
A day is 24 hours long, 24 * 60 / 5 = 288, so as a result, we run our function 288 times a day. However, we do not know if we even have reservations in the system.
288 * 31 = 8,928 function starts per month. Many will say that for lambda, this is showing off. Indeed, 1 million requests and 400,000 GB-seconds are provided per month for free.
If we do not go beyond the limit, this is a quick and free solution. But let’s suppose that a client has booked a room for 06:01 p.m. Our function launches every 5 minutes. Let’s say it started at 06:00 p.m – that means we won’t process the reservation at 06:01 p.m., and the next launch will only be at 06:05 p.m.
If the client arrives on time, they’ll have to stay around and wait. The client won’t understand why it refuses to work and will try to do something: restart the application, turn on/off Bluetooth, etc.
This will be a tragedy for the client.
Of course, another solution immediately comes to mind: launch of CloudWatch rule every minute (by the way, this is the minimum value, one can’t specify less than a minute). In this case, the client might see a delay of 59 seconds max.
Then we get 24 * 60 * 31 = 44,640 starts of our lambda function per month, and again, this is a solution. But our application is not limited only to this functionality, so it’s worth considering what happens when we go beyond the limits of free use.
If you allocated 128 MB of memory for your function, launched it 44,640 times during the month, and it performed every time for 500 ms, the costs are calculated as follows:
Monthly compute charges:
The cost of computing is $0.00001667 per GB-s.
Total compute (in seconds) = 44,640 * (0.5 s) = 22,320 s.
Total compute (in GB-s) = 22,320 * 128 MB / 1024 = 2,790 GB-s.
Monthly request charges = 0 million * $0.2/mln = $0.
Total charges for a month = Compute charges + Request charges = $0.0465093 per month.
As a result, we have the following structure:
Let us be honest: I don’t think anyone likes this solution. Besides, we may experience quite unpleasant problems. But given that even in the worst case, with a huge load, this, at first glance, would cost less than a penny, the solution remains acceptable.
Usually, we don’t get by with only one CloudWatch rule – there can be 10-20 of them, and problems can arise right when the function works for more than a minute. And this happens very often because we start to process all possible reservations in lambda functions. Therefore, one function reads the list of reservations, fails to perform within a minute, and CloudWatch launches another function in parallel. A new function can get from the database those reservations that were gotten by the previous function, the one that is still working. This leads to collisions; plus, slots of parallel functions are consumed.
Yes, the concurrency limit is a software limitation, but not infinite. AWS has clients who have this limit at the maximum, and it’s still not enough for them.
Let’s try to find a more optimal solution
A while back, AWS Support offered two approaches:
Programmatically create a CloudWatch rule with a specified start time via Cron. But we should give up this option since there is a limit of 100 rules for CloudWatch.
As an alternative, we can also use a third-party tool called Rundeck, which is a workload scheduler and can be used in the same scheme, but without limitation.
At the same time, we go beyond AWS; it no longer bears any responsibility and does not help anyway. Therefore, I didn’t consider this approach either.
Finally, having looked at a bunch of services, I clapped my eyes on Step Functions, which in the end became a great solution for me. Step Functions allows us to work with AWS services at a higher level and offers many possibilities, but we will concentrate on what helps to solve our task.
Here is what is important to me:
I can launch a lambda function.
I can specify the exact time when this should be done.
So I got the following State Machine Definition:
Then, I can create an execution, to which I can pass delayTimestamp – the exact time when it is needed to run the function and body – the data that will be passed to the function.
What are the limitations? The task can be performed for no longer than one year. For specific tasks, we can say that there are no limitations.
What about the cost? Let’s calculate it. The work process of a two-step application contains three transitions between states, indicated by arrows in the scheme:
From the Start state (the beginning) to the Delay state (Waiting for the exact time).
From the Delay state to the Invoke Lambda state.
From the Invoke Lambda state to the End state (Ending).
The price for a transition between states in the East US (North Virginia) region is $0.000025, and the free use limit is 4,000 transitions between states per month. If there were no errors when executing this process 100,000 times during the month, the cost calculation will be as follows:
The number of state transitions in workflow * the number of executions of workflow = the total number of state transitions.
3 * 100,000 = 300,000
The total state transitions – FRee Tier state transitions = the number of billable state transitions.
300,000 – 4,000 = 296,000
Monthly charges = 296,000 * $0.000025 = $7.40
Here we also need to add 100,000 lambda function calls.
128 MB of memory, 100,000 calls per month, 500 ms = $0.041675
Total Monthly Expenses: $7.441675
Someone can say, “You kidding? They’re asking for a whole $7?!” but we have another thing coming! Using Step Functions, we create a task after every booking. This means that we had 100,000 reservations per month, while in the case of CloudWatch rules, we might have no reservations and still be running a huge number of functions. Even if we don’t pay for Lambda, we will pay a lot for CloudWatch Logs (which are far from being cheap).
As a result, we get:
a far cleaner, more beautiful, and intuitive architecture,
a clear cost plan,
easily extendable, as usual.
Adding SQS with the trigger of the Lambda function, so as not to lose some important reservation events in case of errors, we get the following scheme:
State Machine:
The Step Functions Developer’s Guide describes a project Task Timer that uses a lambda function. This is a great example of how to manage AWS Lambda with Step Functions, although it is not really relevant since Step Functions now supports direct integration with SNS, as in the example above with SQS.
Ivan Sarokin, JavaScript Software Engineer at Andersen