Setting up python environment on AWS for Machine learning
Do you ever feel as a Data Scientist that managing a python or conda environment (across dev, prod) is a real pain in the ass!!? If Hell YEAH, then this piece for you. Let’s jump straight into work.
Conda Environment
- Let’s create a conda environment first on local, then we’ll discuss the ways to move this venv on AWS EC2 (Linux machine).
So For Linux, I’ve created your_environment_name conda venv.
##downloading (you can change your edition by visting anaconda.com)
$ wget https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-x86_64.sh#Installing
$ sh Anaconda3-2020.02-Linux-x86_64.sh##Making a Virtual Environment for specific project (Recommended)#For specific python version
$ conda create -n your_environment_name python=3.7#Activate your envirnement
$ conda activate your_environment_name
2. Install python libs (whatever you need) with pip
#Installing three libs
$ pip install transformers allennlp flask
So we’re done with setting up conda env on the local machine. But now we want to move the same env on AWS EC2 Machine and this led to two cases (due to Organization Compliance issue mostly):
- EC2 with Internet: In this case, we first need to wrap our package distribution info in a file called, requirements.txt. And then push this on the EC2 server.
#Freezing the env info
$ pip freeze > requirements.txtOR#pipreqs lib to create requirement.txt file
$ pip install pipreqs
$ pipreqs /home/project/location
Once requirements.txt is on EC2 Machine, we need to run the below command to create a venv with the same distribution as local (Make sure you’ve installed Anaconda on Ec2)
## Run on EC2 where the requirements.txt is located$ conda create --name your_ec2_environment_name --file requirements.txt
And TaDa 🎉… just activate your environment, DONE!
2. EC2 without Internet: Because there is no internet on the machine, we need to literally wrap EVERYTHING 😭 and put it here.
First, we’ll download the Anaconda shell file, push it to EC2, and make installation of it there.
Second, wrap up the anaconda environment on local. For that, there are two methods, either you download all .whl files for python libs (with pip command) or use conda-pack CLI service provided by Anaconda.
## Method-1: Wrapping up all .whl files on localOn local/source Machine:#Download all .whl files in dir named 'dir_name'
1. $ mkdir dir_name && pip download -r requirements.txt -d dir_name2. Copy requirement.txt into dir_name
3. Archive it: $ tar -zcf dir_name.tar.gz dir_name
4. Upload this zip to target machineOn Ec2/target Machine:1. Unzip: $ tar -zxf dir_name.tar.gz
2. Create plain conda env here and activate it.
3. Install: $ pip install -r dir_name/requirements.txt --no-index --find-links dir_name
Now, with Method-2.
## Method-2: Using Conda-pack ServiceOn local/source Machine:1. Install this service:
$ pip install conda-pack
2. Pack enviroment your_environment_name into your_enviroment_name.tar.gz:
$ conda pack -n your_environment_nameOn Ec2/target Machine:1. Unpack environment into dir 'your_environment_name' :
$ mkdir -p your_environment_name
$ tar -xzf your_environment_name.tar.gz -C your_environment_name
2. Activate the environment:
$ source your_environment_name/bin/activte
so basically, conda-pack archives conda environment and unpack that on the target machine.
Python Virtual Environment
So as you know Conda provides a complete suite for Data science libs but there is one problem, it takes much space. At prod, Engineers try to maintain the python environment as minimal as possible. Therefore if you prefer python virtual environment over conda, then how do we manage this all, let see it through too.
- Let’s set up a Python virtual environment first on local, named ‘your_environment_name’.
#Installing pip
$ sudo get install python-pip#Creating python virtualenv
$ pip install virtualenv
$ virtualenv your_environment_name
$ virtualenv -p /usr/bin/python3 your_environment_name#Activating
$ source your_environment_name/bin/activate
2. Install some python lib with pip (just like we did before)
3. Freeze the env info and create requirements.txt (with the same cmd we did it before)
4. Now for EC2 without Internet, first download all .whl files on local, wrap them up (along with requirements.txt) in zip (dir_name.tar.gz) and push it to the target machine. And then run below installation command
On Ec2/target Machine:1. Unzip: $ tar -zxf dir_name.tar.gz
3. Install: $ pip install -r dir_name/requirements.txt --no-index --find-links dir_name
5. For EC2 with Internet, Just push requirements.txt on the target machine, install with the below command.
##Install with pip command$ pip install -r requirements.txt
That’s all. This environment capsule will help you in managing of transferability of the python environment for Machine learning and Data Science projects.
Thanks for the read. Anywho, if you have any feedback, I’m all ears.
Follow me: https://impyadav.medium.com
Connect with me: Twitter || www.impyadav.com