We all know that Python has risen above its humble beginnings such that it now powers billion dollar companies. Let’s not forget Python’s roots, though! It’s still an excellent language for running quick and dirty scripts that automate some task. While this works fine for automating my own tasks because I know how to navigate the command line, it’s a bit much to ask a layperson to somehow install python and dependencies, open Terminal on a Mac (god help you if they have a Windows computer), type a random string of characters, and hit enter. Ideally, you would give the layperson a button, they hit it, and they get their result.
I recently deployed a solution which allowed me to do just this. Even better - it was all free! In this post, I’ll talk about how I used Google Sheets as my input form, datasheets to convert Google Sheets to pandas, Zappa for deploying a serverless Flask app, and PuLP for solving a quick integer programming problem to make a simple and free ad hoc optimization service.
Note: all the code for this service is located on my github
FML
Every project should start as a problem, and mine was no different. My wife competes in fantasy movie league. This is like fantasy football for movie geeks. The rules are simple:
You are a fantasy movie theater owner. You must decide which movies to play on your 8 screens. Each movie costs a different amount to screen, and the goal is to generate the most box office revenue over the weekend given your available budget. Talking with her, I realized that, if one can do a good job predicting box office revenue for the weekend (the hard part of the problem), then deciding how many screens to play each movie becomes a simple integer programming allocation problem.
Requirements
Now that we have the problem, what are the requirements?
- A method for inputting a bunch of data:
- Movie name
- Expected revenue
- Cost to screen
- Ability to run the allocation problem from a browser.
- A view of the solution
What’s the easiest input form that data scientists hate?
Excel
What’s worse than Excel?
Google Sheets
Datasheets
Thankfully, Squarespace created datasheets. This is a nice library that makes interactions between pandas
and Google Sheets impressively painless. The library is worth it for the detailed OAuth page alone (I once spent 2 weeks struggling with Google OAuth pain and really wish this page had existed at that time). What’s particularly nice about the OAuth page is that it walks through setting up a service account which does not require the end-user to go through the typical OAuth dance of browser redirects to and from the Google login page. This is nce because these redirects can get messed up when moving from local development to production systems in the cloud (or at least they always get messed up when I try to do it!).
Anywho, the first step was to setup my Google Sheets credentials and download the client_secrets.json
and service_key.json
files. With these handy, I can now access my Google Sheets spreadsheet using datasheets
. The spreadsheet is named FML
, and the inputs
tab looks like
We can pull this into a pandas DataFrame by setting some datasheets
environment variables to point to our credentials and then creating a Client
import os
import datasheets
os.environ['DATASHEETS_SECRETS_PATH'] = 'client_secrets.json'
os.environ['DATASHEETS_SERVICE_PATH'] = 'service_key.json'
client = datasheets.Client(service=True)
If that goes well, we can now grab our workbook (aka the Google Sheets file) and download the tab of data
workbook = client.fetch_workbook('FML')
input_tab = workbook.fetch_tab('inputs')
input_data = input_tab.fetch_data()
input_data
movie | revenue | cost | |
---|---|---|---|
0 | Hotel Transylvania | 13600000.0 | 157.0 |
1 | Ant Man | 9100000.0 | 116.0 |
2 | Skyscraper | 5300000.0 | 61.0 |
3 | Incredibles 2 | 7900000.0 | 89.0 |
4 | Jurassic World | 6700000.0 | 76.0 |
5 | Purge | 2400000.0 | 28.0 |
6 | Sorry to Bother | 1800000.0 | 18.0 |
7 | MI: Fallout | 63600000.0 | 756.0 |
8 | Mamma Mia | 19800000.0 | 227.0 |
9 | Equalizer | 18300000.0 | 201.0 |
10 | Unfriended | 1600000.0 | 18.0 |
11 | Blindspotting | 3000000.0 | 41.0 |
12 | Teen Titans | 13400000.0 | 149.0 |
13 | Three Idential Strangers | 1100000.0 | 16.0 |
14 | Eighth Grade | 946000.0 | 26.0 |
Allocating Movies
I’ve written previously about integer programming in Python using the PuLP package, so I will avoid the introductions to integer programming and pulp
. For this post, I will just quickly summarize the optimization problem, as it’s reasonably simple!
We only have a single decision variable in our problem. In the code, I call this movie_counts
. In math, we can call it $S_{m}$ which corresponds to how many screens we will play movie $m$ on for the weekend. This is an integer decision variable with a lower bound of 0 and an upper bound of 8 (the number of screens we have available in our fantasy movie theater). It is an integer variable because we cannot screen a movie on 2.5 screens.
With our decision variable in hand, we must now define an objective function. We simply want to maximize expected revenue. Let’s define a quantity $r_{m}$ which is the amount of money that we expect movie $m$ to bring in (this is the revenue column in the above DataFrame). Our objective function is then simply
$$\sum_{m} r_{m} * S_{m}$$
Lastly, we need some constraints. We only have two, but, before I introduce them, I need to introduce one slight wrinkle in fantasy movie league. You get charged $2 million for every screen that you leave empty. We can incorporate this into our optimization problem by assuming that there is an _extra_ movie called “Empty Screen” and that the expected revenue for that movie is _negative_ $2 million. Our two constraints can now be defined:
- Every screen must be assigned a movie $$ \sum_{m} S_{m} = 8 $$
- We have a limited budget of $1000. Let’s say movie $m$ costs $c_{m}$ to screen. Our budget constraint is thus $$ \sum_{m} c_{m} * S_{m} \leq 1000 $$
And that’s it: one type of decision variable, a simple objective function, and two constraints. If you’re interested, I wrap all of the above steps into an Optimizer
class in the [fml code]((https://github.com/EthanRosenthal/fml/blob/master/fml/optimizer.py).
With the optimization problem complete, I can pack up the solution as a DataFrame and use datasheets
to write the data back to the outputs
tab of the spreadsheet
solution = ...
outputs_tab = workbook.fetch_tab('outputs')
outputs_tab.insert_data(solution)
Painless Serverless
The final step was to create a tiny Flask app with a button to launch the optimization. I made the simplest barebones site that I could, and then it was time to deploy.
Zappa is a ridiculously cool Python library that lets you run any Python application as an AWS Lambda function and make it all discoverable via API Gateway. What this means is that you can make a Python website and run it in the cloud without an actual server running the code (as long as your website runs quickly, and uses few resources). You only pay for each time the website runs, but the first million times per month are free. If my wife happens to run this more than 1 million times, then I’ll happily pay money.
I was blown away by how easy Zappa
was to use. Honestly, the hardest part was figuring out how to install python 3.6 on my linux computer because you have to use virtualenv
instead of conda
(though there’s a PR to fix that).
I’m just going to copy the documentation on how to get Zappa
working because this is literally all that I had to do:
pip install zappa
zappa init
zappa deploy
After all of your code gets zipped up and sent to the cloud, Zappa
tells you what cryptic URL at which you can now find your app. You can use custom domains and a gazillion other options, but this is quick and dirty serverless integer programming, remember?
With the website deployed, my wife can now input data into the spreadsheet, hit the Calculate
button on the website, and then watch the spreadsheet for the optimal movie screens with nary a command line in sight.