GitHub - wikimedia/research-api-endpoint-template at edit-types

Basic Cloud VPS API Endpoint setup

This repo provides the basic to get a robust and extensible API endpoint up and running. The basic pre-requisites are as follows:

Cloud VPS instance: https://wikitech.wikimedia.org/wiki/Help:Cloud_VPS_Instances
Cloud VPS web-proxy: https://wikitech.wikimedia.org/wiki/Help:Using_a_web_proxy_to_reach_Cloud_VPS_servers_from_the_internet

With these in place, you can ssh onto your instance and use the cloudvps_setup.sh script to get a basic API setup -- e.g.,:

From local branch: scp model/config/cloudvps_setup.sh <your-shell-name>@<your-instance>.<your-project>.eqiad1.wikimedia.cloud:~/
ssh <your-shell-name>@<your-instance>.<your-project>.eqiad1.wikimedia.cloud
sudo chmod +x cloudvps_setup.sh
sudo ./cloudvps_setup.sh

The basic components of the API are as follows:

systemd: Linux service manager that we configure to start up nginx (listen for user requests) and uwsgi (listen for nginx requests). Controlled via systemctl utility. Configuration provided in config/model.service.
nginx: handles incoming user requests (someone visits your URL), does load balancing, and sends them via uwsgi to be handled. We keep this lightweight so it just passes messages as opposed to handling heavy processing so one incoming request doesn't stall another. Configuration provided in config/model.nginx.
uwsgi: service / protocol through which requests are passed by nginx to the application. This happens via a unix socket. Configuration provided in config/uwsgi.ini.
flask: Python library that can handle uwsgi requests, do the processing, and serve back responses. Configuration provided in wsgi.py

Data collection

The default logging by nginx builds an access log located at /var/log/nginx/access.log that logs IP, timestamp, referer, request, and user_agent information. I have overridden that in this repository to remove IP and user-agent so as not to retain private data unnecessariliy. This can be updated easily.

Privacy and encryption

For encryption, there are two important components to this:

Cloud VPS handles all incoming traffic and enforces HTTPS and maintains the certs to support this. This means that a user who visits the cite will see an appropriately-certified, secure connection without any special configuration.
The traffic between Cloud VPS and our nginx server, however, is unencrypted and currently cannot be encrypted. This is not a large security concern because it's very difficult to snoop on this traffic, but be aware that it is not end-to-end encrypted.

Additionally, CORS is enabled so that any external site (e.g., your UI on toolforge) can make API requests. From a privacy perspective, this does not pose any concerns as no private information is served via this API.

Debugging

Various commands can be checked to see why your API isn't working:

sudo less /var/log/nginx/error.log: nginx errors
sudo systemctl status model: success at getting uWSGI service up and running to pass nginx requests to flask (generally badd uwsgi.ini file)
sudo less /var/log/uwsgi/uwsgi.log: inspect uWSGI log for startup and handling requests (this is where you're often find Python errors that crashed the service)

Adapting to a new model etc.

You will probably have to change the following components:

model/wsgi.py: this is the file with your model / Flask so you'll have to update it depending your desired input URL parameters and output JSON result.
flask_config.yaml: any Flask config variables that need to be set.
model/config/cloudvps_setup.sh: you likely will have to change some of the parameters at the top of the file and how you download any larger data/model files. Likewise, model/config/release.sh and model/config/new_data.sh will need to be updated in a similar manner.
model/config/model.nginx: server name will need to be updated to your instance / proxy (set in Horizon)
model/config/uwsgi.ini: potentially update number of processes and virtualenv location
model/config/model.service: potentially update description, though this won't affect the API
requirements.txt: update to include your Python dependencies
Currently setup.py is not used, but it would need to be updated in a more complete package system.

Managing large files

A common dependency for these APIs is some sort of trained machine-learning model or database. The following scenarios assume the file originates on the stat100x machines and can be made public. If the file is a research dataset that would be valuable as a public resource, doing a formal data release and uploading to Figshare or a related site is likely the best solution.

Small (e.g., <1GB), temporary: probably easiest to just scp these files to your local laptop and then back up to the Cloud VPS instance.
Large (e.g., <20GB), temporary: use the web publication process to make available in the one-off folder and then wget the file to your Cloud VPS instance. You can then remove it from the web publication folder.
Really large: talk to analytics.

What this template is not

This repo does not include a UI for interacting with and contextualizing this API. For that, see: https://github.com/wikimedia/research-api-interface-template or the wiki-topic example.

For a much simpler combined API endpoint + UI for interacting with it, you can also set up a simple Flask app in Toolforge, though you will also have much less control over the memory / disk / CPUs available to you.

Acknowledgements

Built largely from a mixture of https://github.com/wikimedia/research-recommendation-api and https://www.digitalocean.com/community/tutorials/how-to-serve-flask-applications-with-uwsgi-and-nginx-on-ubuntu-20-04.

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
model		model
resources		resources
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Basic Cloud VPS API Endpoint setup

Data collection

Privacy and encryption

Debugging

Adapting to a new model etc.

Managing large files

What this template is not

Acknowledgements

About

Uh oh!

Uh oh!

Contributors 2

Uh oh!

Languages

License

wikimedia/research-api-endpoint-template

Folders and files

Latest commit

History

Repository files navigation

Basic Cloud VPS API Endpoint setup

Data collection

Privacy and encryption

Debugging

Adapting to a new model etc.

Managing large files

What this template is not

Acknowledgements

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 2

Uh oh!

Languages