Configuring Python Workspace

I like Python. For the last several years, I have used it extensively in my research. There are a lot of useful libraries, and it is an equally powerful language for writing simple scripts, producing large systems, doing data analysis and machine learning. It is very laconic and allows you to use different programming paradigms. It is quite easy to start developing in Python: modern operating systems are either already supplied with a Python interpreter or provide you with an easy way to install it. However, when you start developing more professionally using this language, you discover that its ecosystem is quite complicated. In this article, I try to shed a bit more light on the topic how to configure Python workspace.

Table of Contents

Introduction

According to Tiobe Index, currently (January 2020) Python is on the third place among the most popular programming languages, with year-to-year gain of 1.4%. The popularity of the language explains why modern operating systems are either already supplied with a Python interpreter or provide you with an easy way to install it.

There are many different versions of the interpreter available. For instance, as of January 2020, there are 4 different versions of Python 3 supported (3.5, 3.6, 3.7, 3.8). Python 2, although is not supported anymore, is still popular and widely used (for example, it is one of the system interpreters installed in Ubuntu 18.04). The list of features is different even between the Python 3 versions, not saying about Python 2. Therefore, if you develop a package you have to test it with all these interpreters because you do not know which one is used by a user. Even more issues come with the package management. There could be so many different versions of a library that you “could easily end up installing a version of a package that conflicts with the needs of another package” (source).

To deal with these issues, Pythoners has been developing a lot of tools: pyenv, virtualenv, venv, virtualenvwrapper, pipx, pipenv, poetry, hatch etc. It is very easy to get lost among all these tools. In order to provide some bread crumbs, I describe my approach to manage Python environments and how I have come to it.

Background

Being a self-learner of Python, I started to use it as almost all unprofessional people did: I used the default interpreter supplied with my Ubuntu operating system. I did not use any virtual environment management system: all packages were installed globally, and I often used sudo pip install if there were file permission issues. At that time, although Python 3 had been already released, Python 2 was still the standard. There was no reason to use Python 3 for me, so for a long time I simply ignored it. However, several years after the situation has changed: while a large number of libraries were still relying on Python 2, several packages have appeared requiring only Python 3. At that time, the first time I realized that I need a tool that would allow me to have several versions of the Python interpreter installed and to switch between them easily.

As usual with popular things, if there is no standard then there is no single opinion on how to approach this issue. Some developers recommended to have interpreters for Python 2 and 3 installed in the system. In this case, to execute a script you had to explicitly mark what interpreter to use either python2 or python3. This approach requires you to remember which script should be called with which interpreter version. Some people preferred to use the update-alternatives tool (I am using Ubuntu, therefore I talk about the tools for this operating system). Using this approach, you can run every script using this same python command, however still you have to remember to which Python version you need to switch to run it.

At that time, I read a lot of articles on this topic and chose pyenv for managing Python versions. However, in respect to dependency management my workflow has not changed a lot. I had several versions of Python installed (usually, one version of Python 2 and one of Python 3), and I installed all dependencies globally (however, now using pip only). Such an approach was an ok for me for several years. But recently, when I tried to replay a Jupyter notebook of my year old research I encountered with several errors that appeared because some libraries became backward incompatible.

This issue forced me to reconsider my approach. At first, I read a number of articles on this topic, and experimented with several approaches. However, none of them has fit all my needs therefore, after some experiments I came to my own way of configuring Python workspace. Still, some ideas are borrowed from the following articles (an interested reader may read them in order to understand the foundation of my approach):

  1. The definitive guide to setup my Python workspace
  2. Anaconda vs {pyenv + pipenv}
  3. Pipenv and Poetry: Benchmarks & Ergonomics
  4. Pipenv and Poetry: Benchmarks & Ergonomics II

Requirements

It is a good practice to start every activity with the list of requirements. A good list of requirements for managing Python workspace is provided by Henrique Bastos in his article. However, I have some additional on top of them:

  1. I do not want to remember what version of Python interpreter should be used to run a particular tool, i.e., I do not want to use the commands like python2 or python3.8. I want to configure a project once and then use all the time the command python.
  2. I have some platform constraints. I am using Kubuntu 18.04 as my desktop operating system. Usually, I use Visual Studio Code for Python coding, but I use Jupyter Notebook (or Jupyter Lab) run in a web browser for data analysis.
  3. My Python projects are located in different directories (I am a researcher, therefore, the first order directories are research projects). Therefore, I should have a possibility to create source code directories in any place.
  4. I should be able to lock dependencies required to run a Jupyter notebook. It should be easy to replicate the environments for data analysis.
  5. I use some development dependencies. Therefore, it should be possible to specify development dependencies separately from project dependencies.

Python workspace management can be split into two phases:

  • Python interpreter versions management;
  • Package dependencies management.

Let me explain what I mean under each of them. Some tools require specific version of the Python interpreter to run. For instance, for a long time scapy, which is used to manipulate network packets, was available only for Python 2 (of course, there were some forks, e.g., scapy3k). Similarly, the rename tool is still available only for Python 2. For a long time, some scripts used to build the Android OS from sources can be run only with the Python 2 interpreter. Thus, to use these tools I must have the Python 2 installed on my laptop. Therefore, at first you need to find a way to manage your Python interpreter versions.

Once you have found a way to manage interpreter versions, you also need a way to manage package dependencies. If you install dependencies globally all the time, it may lead to some issues. For instance, one package may update the library that the second package depends on. If old and new versions are incompatible (e.g., the new version of the library removes some deprecated functionality that the second package uses) this will make the second package unusable. The default package manager, pip, does not resolve dependencies well (if you are interested in this topic, please see this issue). In order to reduce the probability of such events, it is better to isolate each package installing into the isolated environment the libraries with correct versions the project depends on.

These two phases are closely related. Often the same tools are used to ensure both of them. Because of that, it is quite hard to grasp the problem and pick the right tool to facilitate each phase. To make things even harder, the tools usually used to do this are also written in Python, that in its turn requires some Python workspace.

Phase 1: Managing Python Interpreter Versions

So as I was already familiar with pyenv I decided to continue using it to manage my Python interpreters. Actually, besides Linux’s update-alternative I am not aware about other tools that allows you to do this, so the choice was not that hard. Note that officially pyenv is available only for Linux and MacOS. If you are a Windows user, please consider the pyenv-win project (it is recommended by the developers of pyenv as well), however I have not tested my approach with it.

Installing pyenv

The pyenv tool is a set of shell scripts. It does not require an installed Python interpreter to work. Below, I provide instructions how to install pyenv on Ubuntu-based OSes (e.g., I use Kubuntu). If you use other operating system please see the website of the tool.

Before installing the tool, please fulfil the prerequisites for your operating system. I remember that several times I encountered issues because I had not installed the required dependencies. So, if you are a user of an Ubuntu based distributive, before installing pyenv you have to install the following libraries and tools:

$ sudo apt-get install -y make build-essential libssl-dev zlib1g-dev libbz2-dev \
libreadline-dev libsqlite3-dev wget curl llvm libncurses5-dev libncursesw5-dev \
xz-utils tk-dev libffi-dev liblzma-dev python-openssl git

After that, we can install pyenv. You can do this manually (the instructions can be found here) or you can do this using pyenv-installer. Personally, I prefer the second approach. Besides pyenv, pyenv-installer will also install a set of pyenv plugins, therefore I recommend you to use it as well.

In order to install pyenv with pyenv-installer, execute the following command:

$ curl -L https://github.com/pyenv/pyenv-installer/raw/master/bin/pyenv-installer | bash

During the installation, pyenv-installer will offer you to add the following lines (they could be different in your case) to .bashrc (where to add these lines depends on your distribution, please check the instructions in the official repository):

export PATH="$HOME/.pyenv/bin:$PATH"
eval "$(pyenv init -)"
eval "$(pyenv virtualenv-init -)

These instructions add to the path the directory with pyenv binaries and initialize pyenv itself and pyenv virtualenv plugin. Now, you need just to restart your shell to execute these lines, and you can start using pyenv:

$ exec $SHELL

How pyenv Works

The pyenv tool operates in the following way. It adds the directory ~/.pyenv/shims, where shims for every Python command (e.g., pip or python) across every installed interpreter version are stored, into the beginning of the PATH variable. When you execute one of these Python commands, the operating system will run the corresponding shim from this directory that will proxy the call to the corresponding executable that depends on the currently chosen Python interpreter (you can find more details here).

To run a script with a particular Python interpreter, we just need to tell pyenv what Python version to use. There are several ways to do this. The pyenv tool considers the following sources of information in order to find out what Python version to choose, in the following order (taken from the documentation):

  1. The PYENV_VERSION environment variable (if specified). You can use the pyenv shell command to set this environment variable in your current shell session.

  2. The application-specific .python-version file in the current directory (if present). You can modify the current directory’s .python-version file with the pyenv local command.

  3. The first .python-version file found (if any) by searching each parent directory, until reaching the root of your filesystem.

  4. The global $(pyenv root)/version file. You can modify this file using the pyenv global command. If the global version file is not present, pyenv assumes you want to use the “system” Python. (In other words, whatever version would run if pyenv weren’t in your PATH.)

Let’s consider each of these sources a bit deeper. After you have installed pyenv, no Python version is chosen, and we just see the “system” Python interpreter installed by default:

$ pyenv versions
* system (set by /home/yury/.pyenv/version)

As you can see, there is only one “system” version of Python installed and currently active (marked with a * sign). You can check this if you run the pyenv which python command that will show the real path to the python executable (note that the default Linux command to look up the path to an executable, namely which python, shows the path only to the shim).

$ pyenv which python
/usr/bin/python

Actually, there are two interpreters installed by default in Ubuntu 18.04 based systems. You can see this executing the following commands:

$ python --version
Python 2.7.15+
$ python3 --version
Python 3.6.8

This listing shows that there are two Python interpreters installed (2.7.15 and 3.6.8). The default python command calls the Python 2 interpreter. But you can call the Python 3 interpreter adding 3 in the end of the command, e.g., python3 <file>. Python 3 tools are also available using this suffix, e.g., pip3 or pydoc3. However, this way of calling them is not very convenient, especially now when Python 2 is not supported anymore.

Let’s install two Python interpreters using pyenv. To install a new version of a Python interpreter, you have to use the pyenv install <interpreter_name> command. You can list of all possible interpreter names the pyenv install --list command.

$ pyenv install --list
Available versions:
  2.1.3
  2.2.3
  ...
$ pyenv install 3.8.1
$ pyenv install 2.7.17

Now, if you run the pyenv versions command you should see the following output:

$ pyenv versions
* system (set by /home/yury/.pyenv/version)
  2.7.17
  3.8.1

As you see, now we have 3 different versions of Python installed: “system”, 2.7.17 and 3.8.1. However, you can still use only the “system” one, because newly installed are not chosen. You can see the chosen versions of Python interpreter (and their order) using the pyenv version command:

$ pyenv version
system (set by /home/yury/.pyenv/version)

Let’s activate globally our newly installed interpreters. In addition, let’s set precedence of the Python 3 interpreter higher than Python 2.

$ pyenv global 3.8.1 2.7.17
$ pyenv versions
  system
* 2.7.17 (set by /home/yury/.pyenv/version)
* 3.8.1 (set by /home/yury/.pyenv/version)

$ pyenv version
3.8.1 (set by /home/yury/.pyenv/version)
2.7.17 (set by /home/yury/.pyenv/version)

As you can see, now we have two interpreters chosen. Using pyenv it is possible to have several Python interpreters active simultaneously. However, it is possible to activate only one too.

Let’s check where these interpreters installed. Remember, we have to use the pyenv which <python_interpreter> command instead of which <python_interpreter> to discover the real path to the interpreter executable:

$ pyenv which python
/home/yury/.pyenv/versions/3.8.1/bin/python
$ pyenv which python3
/home/yury/.pyenv/versions/3.8.1/bin/python3
$ pyenv which python2
/home/yury/.pyenv/versions/2.7.17/bin/python2

The Python 3 interpreter has the precedence over Python 2, therefore, you can just use the python, as well as pip and pydoc, commands to run tools related to Python 3 workspace.

However, for some your projects you would still like to use the Python 2 workspace. For instance, if you have an old your tool that is not ported to Python 3. In this case, when you go to the directory of this project you would like to use the Python 2 tools. We can achieve this behavior using the pyenv local command (I add current path to the output bash invitation string to show this):

~/projects$ mkdir python2_proj
~/projects$ cd python2_proj
~/projects/python2_proj$ pyenv local 2.7.17
~/projects/python2_proj$ pyenv which python
/home/yury/.pyenv/versions/2.7.17/bin/python

Let’s run the pyenv version command in this directory:

~/projects/python2_proj$ pyenv version
2.7.17 (set by /home/yury/projects/python2_proj/.python-version)

As you can see, the output of the command is different: now the version is set by the .python-version file. Let’s check our directory:

~/projects/python2_proj$ ls -al .
total 12
drwxrwxr-x  2 yury yury 4096 Feb  7 17:40 .
drwxrwxr-x 10 yury yury 4096 Feb  7 17:40 ..
-rw-rw-r--  1 yury yury    7 Feb  7 17:40 .python-version
~/projects/python2_proj$ cat .python-version
2.7.17

It should be clear that the pyenv local 2.7.17 command create the .python-version file in the current directory with the content 2.7.17. Now, every time when you change directory to this one, pyenv will activate this Python version automatically. Do you remember the line eval "$(pyenv virtualenv-init -) that we added to our .bashrc file? Actually, it allows the pyenv-virtualenv plugin to automatically activate/deactivate the environment when you enter/leave directory with the .python-version file.

Now imagine that you want to check if your Python 2 tool can run on Python 3. In order to do this, you would like to temporarily activate a Python 3 workspace and run your tool. If there are any issues you would like to reset the workspace to Python 2 automatically after your experiments. You can do this using the pyenv shell command:

~/projects/python2_proj$ pyenv shell 3.8.1 
~/projects/python2_proj$ pyenv versions
  system
  2.7.17
* 3.8.1 (set by PYENV_VERSION environment variable)

Once you execute the pyenv shell 3.8.1 command, pyenv for the current shell session will activate Python 3 workspace. However, once you close your shell session, (pyenv) will forget about this temporal activation.

Now, you should understand why and in what situations you may need different versions of Python interpreter, and how to switch them using pyenv.

Phase 2: Package Dependencies Management

Usually, a Python package depends on the functionality provided by third-party libraries maintained by different developers. All these libraries evolve with every new version: new functionality is added and deprecated code is removed. Now imagine that you want to install two packages into your system that depend on the same library but of different version, i.e., package 1 depends on the library version “1.0”, while package 2 depends on the same library but “2.0” version. The question is if you install both of these packages what version of the library should be placed into your system? Currently, AFAIK, in this situation Python package manager (pip) will use the library version that the latest installed package depends on. For instance, if you install package 1 at first and then package 2, the library version “2.0” will be placed into your system. If your package 1 depends on the functionality removed in the “2.0” library version, most probably it will not work.

This issue is not exclusive only for Python world. You can face with similar issues in other areas as well. For instance, “dll hell” is the problem of the same nature. The better solution for this issue is to create an isolated environment for every installed package, so that all the dependencies are contained only in this environment and do not influence on other packages. Nowadays, this containerisation idea is very popular. For instance, the Docker tool is built on this idea as well as the Linux snap and flatpak tools.

During the last several years, the Python community has been building a number of tools for virtual environments and package management: virtualenv, venv, virtualenvwrapper, pyenv-virtualenv, pipenv, poetry, hatch, pipx, etc. With such variety of tools, it is difficult to understand which to use and when. I have not found this information in one place, so here I shortly fill this gap.

virtualenv is a Python package that is used to create isolated Python virtual environments. Its popularity has led to the addition of venv module to the standard library in Python 3. The functionality of these tools is similar. The main difference is that with virtualenv you can choose what version of Python interpreter to use during the creation of a virtual environment.

The virtualenvwrapper tool has appeared to facilitate the process of managing virtual environments. By default, virtualenv creates a special directory inside a project directory to store all auxiliary data. If you have several projects then all of them will have such auxiliary directories inside their directories. If you forget to delete these directories with auxiliary data after the project is developed, this may waste your hard drive space. In order to prevent this, virtualenvwrapper stores all virtual environments directories in one place. Thus, you can easily manage them, e.g., you can easily delete the unused ones. Moreover, you project directory is not contaminated with auxiliary data. In addition, virtualenvwrapper provides commands to speed up switching between different projects and activating their environments. Not surprisingly that many developers like it. However, underneath virtualenvwrapper still relies on the functionality provided by virtualenv.

The pyenv-virtualenv tool is a pyenv plugin that also facilitates the process of managing Python virtual environments. pyenv-virtualenv also stores all auxiliary virtual environments directories in one place. Moreover, it can be used to activate automatically a virtual environment when you enter a project repository. In the previous section, we used the functionality provided by pyenv-virtualenv to activate a particular version of the Python interpreter. Actually, virtual environments for different Python interpreter and library versions are very similar things in pyenv-virtualenv, they even store the auxiliary data in the same directory.

The pipenv tool, besides helping to manage virtual environments, also facilitates package dependencies management. It uses special Pipfile and Pipfile.lock files. The former describes the dependencies and their version constraints. For instance, in this file you can specify that your project depends on the requests library of version “1.4” and higher. Then, during the development you can actually lock the exact version. The locking operation will create a Pipfile.lock file that will contain the exact version of the requests library that has been used during the development, e.g., “2.18.4”. Having this information, it is possible to repeat exactly the environment used for development so that you will not have dependency issues in your production environment. Moreover, using pipenv it is possible to mark some dependencies as development. Thus, during the packaging these dependencies will not be included into the final build.

The poetry and hatch tools go even further. Besides virtual environment and dependency management, they could be used to create a boilerplate for your new project, build a project, etc.

Lastly, pipx stands aside of other tools because it is only used to install different Python command line tools (CLI) into separate virtual environments. Thus, the dependencies of these tools will not interfere with each other, and you can run each tool without any fear.

My Toolset for Package Dependencies Management

It is quite hard to choose the right one from this list of different tools. Recently, I have experimented with some of them but in the end I have chosen virtualenv despite there are tools with more rich functionality. Below, I describe why I have made this choice.

Obviously, pipx, poetry and hatch cannot be used considering my requirements. Indeed, pipx is used to install CLI tools into separate virtual environments. The poetry and hatch tools cannot be used to specify dependencies when you work with Jupyter notebooks. Still, I could consider using them if I develop a Python package to be released because they can considerably ease this process.

Out of these tools, pipenv seems to cover all my needs. Indeed, it can be used to separate package and development dependencies, and to lock dependencies versions. It does impose where you have to use it, therefore it can be used also with Jupyter notebooks. Besides that, pipenv is a recommended tool to manage library dependencies when developing Python applications. Unfortunately, during my experiments with this tool I have found out that locking is very slow and sometimes even hangs in my setup. Not surprisingly, I decided to skip this tool, although on the first glance it fits all my needs. If you still want to try it, you can read a good article by Chris Liatas how to configure pyenv and pipenv together (I took some ideas from this article as well).

The virtualenvwrapper and pyenv-virtualenv tools create auxiliary virtual environment directories in one place. Unfortunately, VSCode, which I use for Python development, cannot automatically select the right virtual environment if they are all stored in one place. However, it can automatically activate the environment if it is stored in the .venv directory inside your project root. Therefore, I have chosen virtualenv.

My Configuration

In the previous sections, I have described what tools I use use for my Python development, why I have chosen them, and how they work. Now, I want to describe how I configure them. The articles of Henrique Bastos and Chris Liatas laid the basis of this part, so I recommend you referring to them if something is not clear.

Once pyenv is installed, you can use it to install different versions of Python. This tool is well-maintained, it is updated when a new version of a Python interpreter appears. Therefore, if you have just installed pyenv, all versions of the Python interpreters should be available for you to install. If you installed pyenv a while ago, you may need to update it using the following command:

$ pyenv update

If you haven’t installed required Python interpreters previously, this is the right time to do this. You can check the list of all possible targets to install using the pyenv install --list command. If you execute this command, you will find out that beside CPython it is possible to use pyenv to install conda, jython and other types of interpreters. For instance, this could be useful if you do data analysis and some packages are only available in the conda environment.

So as currently I use only CPython, I install two versions of this interpreter (one most recent Python 2 and one most recent Python 3) using the following command:

$ pyenv install 3.8.1
$ pyenv install 2.7.17

Now, we can create several global environments using pyenv-virtualenv.

$ pyenv virtualenv 3.8.1 jupyter
$ pyenv virtualenv 2.7.17 ipython2
$ pyenv virtualenv 3.8.1 tools3
$ pyenv virtualenv 2.7.17 tools2

As it is clear from the names, the jupyter environment is used to run Jupyter Notebook and related tools (e.g., jupyterlab or ipython) both for Python 2 and Python 3. During the creation of a notebook, you would be able to select what version of the interpreter to use. In the tools3 environment, I install all Python 3 CLI tools I use. Usually, these tools use the most recent versions of dependencies, therefore, it is unlikely that there will be any conflicts. However, if you’re afraid of these issues you can still use pipx to install every tool in a separate virtual environment. The tools2 environment is used to install old tools that are not ported on Python 3, e.g., rename. ipython2 is used to run iPython Console for Python 2 (however, if you do not need iPython Console, you may skip creation of this environment).

Now, let’s install necessary tools in each of these environments. At first, let’s install all necessary tools for our jupyter environment:

$ pyenv activate jupyter
$ pip install --upgrade pip
$ pip install jupyter
$ pip install jupyterlab
$ python -m ipykernel install --user
$ pip install ipywidgets
$ jupyter nbextension enable --py widgetsnbextension --sys-prefix
$ pyenv deactivate 

Similarly, let’s install all tools required to run iPython Console for Python 2:

$ pyenv activate ipython2
$ pip install --upgrade pip
$ pip install ipykernel
$ python -m ipykernel install --user
$ pyenv deactivate

Now, let’s proceed with our global environments for the Python 3 tools (you may have other tools):

$ pyenv activate tools3
$ pip install --upgrade pip
$ pip install ansible
$ pip install youtube-dl
$ pip install scrapy
$ pyenv deactivate

And tools for Python 2:

$ pyenv activate tools2
$ pip install --upgrade pip
$ pip install rename
$ pyenv deactivate

Finally, it’s time to activate all our environments in the right order so that we can work with the tools from them simultaneously. Without this step our interpreters and tools installed in the environments will not be available.

pyenv global 3.8.1 2.7.17 jupyter ipython2 tools3 tools2

Now, we can check that everything works as expected:

$ pyenv which python
/home/yury/.pyenv/versions/3.8.1/bin/python
$ pyenv which python3
/home/yury/.pyenv/versions/3.8.1/bin/python3
$ pyenv which python2
/home/yury/.pyenv/versions/2.7.17/bin/python2
$ pyenv which jupyter
/home/yury/.pyenv/versions/jupyter/bin/jupyter
$ pyenv which ipython2
/home/yury/.pyenv/versions/ipython2/bin/ipython2
$ pyenv which ansible
/home/yury/.pyenv/versions/tools3/bin/ansible
$ pyenv which rename
/home/yury/.pyenv/versions/tools2/bin/rename

So as we are going to use virtualenv to manage virtual environments, let’s install this package into our 3.8.1 environment.

$ pip install virtualenv
$ pyenv which virtualenv
/home/yury/.pyenv/versions/3.8.1/bin/virtualenv

So as we want our Jupyter notebooks to have access to the packages installed into the project virtual environment, let’s use the hack script developed by Henrique Bastos:

$ ipython profile create
$ curl -L http://hbn.link/hb-ipython-startup-script > ~/.ipython/profile_default/startup/00-venv-sitepackages.py

After all these configuration steps, you should have Python development workspace ready.

Development Workflows

After all our tools are configured, let’s consider how we can use them. Currently, I use Python mostly to write some proof-of-concept scripts and to do data analysis. I do not develop Python packages, therefore I do not need to deal with module packaging and installing. Therefore, these steps are not considered in this tutorial.

Python Development Workflow

Usually, I use Python to develop proof-of-concept scripts. Therefore, I want to be able to run these scripts after some time (in research, you can work in parallel on several projects, and there can be quite long periods of time when you do not touch a project). Moreover, my colleagues should be able to run these scripts after some time. With these requirements set, let’s consider how I reach this goal with my setup.

Let’s imagine that I start a new project, and I plan to develop a number of Python scripts related to it. Usually, I create a new directory for the project, and inside it I create a subdirectory where I store all source code related to the project. So as it is common to develop tools in different languages, I create a directory specifically for the Python scripts:

$ mkdir -p new_project/sources/python
$ cd new_project/sources/python

Usually, Python scripts stored in this directory are related and use the same dependencies, hence we can create only one virtual environment for all of them. If your packages are not related, you can still create a separate virtual environment for each script (however, you need to put them into different subdirectories). For the sake of example, I will create only one virtual environment for the python directory.

$ virtualenv .venv

This command creates a new virtual environment copying the auxiliary data from the default Python interpreter (according to our configuration this is Python 3.8.1) into the .venv directory. I recommend to use this destination because VSCode looks for virtual environment data there by default. Moreover, if you use the same directory name it is to exclude them from being committed to your version control system. If you do not need to package your code, you may opt for not installing setuptools and wheel packages (add --no-wheel --no-setuptools parameters to the command).

Now, you have a directory ready for development. At this point of time, I just run VSCode from this directory:

$ code .

Now, if you create a Python file (with the .py extension) in this or nested directory, VSCode will automatically activate the created virtual environment.

If you need to work with the environment outside VSCode, you can activate this environment manually. In order to do this, you have to run this command:

$ source .venv/bin/activate

After executing it, the virtual environment will be activated for your shell session. Once everything is done, you can deactivate it executing the deactivate command.

Sometimes, it may require to use the version of the Python interpreter that is not installed in your system. For instance, currently we use Python 3.8.1 to create virtual environments. However, it may happen that you need to work on a project that uses older Python version, e.g, Python 3.7.6. In this case, you can do the following: use pyenv to install the required version of the interpreter, activate both Python versions (3.7.6 and 3.8.1) in a shell session and then specify the required version during the creation of the virtual environment. The following commands exemplify this process:

$ pyenv install 3.7.6
$ pyenv shell 3.7.6 3.8.1
$ virtualenv .venv --python=python3.7
$ source .venv/bin/activate
(.venv) $ python --version
Python 3.7.6
(.venv) $ deactivate

Alternatively, you can also provide to virtualenv the full path to the interpreter:

$ pyenv shell 3.7.6 
$ pyenv which python
/home/yury/.pyenv/versions/3.7.6/bin/python
$ pyenv shell --unset # this command reverts pyenv shell command

$ pyenv which python
/home/yury/.pyenv/versions/3.8.1/bin/python
$ virtualenv .venv --python /home/yury/.pyenv/versions/3.7.6/bin/python
$ source .venv/bin/activate
(.venv) $ python --version
Python 3.7.6

Now, every time you activate your environment your interpreter version will be 3.7.6, and you should not experience any issues.

Dependency Management

Often, Python scripts depend on the external packages. Before start using them, you have to install them. The most comman way of installing Python dependencies is by using Python package manager called pip. The pip install command downloads the package from Python package index (Pypi) and installs it locally. However, this command does not register anywhere what package has been installed. Therefore, when you share your code, you need to specify somehow what modules your package depends on. Moreover, besides the names you should also specify the versions of the dependencies, because different versions may be incompatible.

The most common way of achieving this goal is by using the pip freeze command. This command lists all the packages with their versions installed in the current Python environment. Usually, developers create a virtual environment, install dependencies there, and dump information about currently installed packages into the requirements.txt text file using the command: pip freeze > requirements.txt. Thus, other developers may install exactly the same set of dependencies and their versions using the pip install -r requirements.txt command.

Unfortunately, this approach has a number of drawbacks. First, this approach dumps into the requirements.txt file not only the list of main dependencies but also all subdependencies. For instance, imagine that you are working on a machine learning script that depends only on the scikit-learn library. You install this library into your virtual environment. However, scikit-learn depends on the joblib, numpy and scipy packages. Thus, when you freeze your environment, your requirements.txt file will list all these packages although you script explicitely depends only on scikit-learn. This makes it quite hard to pick from the file all the dependencies you want to update (see more on this problem in the article “A Better Pip Workflow” by Kenneth Reitz). Second, during the development you usually use a lot of development tools. For instance, Python extension for VSCode by default requires the pylint linter to be installed in order to check your Python scripts. Thus, if you develop using this text editor (like me) you have to install this dependency into your virtual environment. However, it is not required to run the code on production, it is needed to facilitate only the development process. Some these development dependencies (e.g., pylint or flake8) are tied tightly to your Python interpreter version and thus, cannot be installed globally. Moreover, it may happen that for different projects you need different list of development dependencies. Therefore, it is convinient to list them separately and to not include into requirements.txt.

Other development dependencies that may help to develop your package include jedi (for autocompletion, does not depend on the interpreter version), black (to format code, does not depend on the interpreter version), isort (to sort imports, does not depend on the interpreter version), flake8 (style guide enforcement tool, depends on the version of the interpreter version), mypy (for type checking, does not depend on the interpreter version), pydocstyle (to enforce documentation style, does not depend on the interpreter version).

The pipenv tool is built to solve these issues. However, as I mentioned before it is not usable in my case, therefore I had to find an alternative solution. Recently, I started learning Rust, and found out that the state of the art approach to solve these issues is implemented in its cargo package manager. There, in the Cargo.toml you specify crate and development dependencies separately. In addition, you also specify the constraints on the dependencies’ versions. During the first build, cargo generates the Cargo.lock file where it lists all dependencies and exact their versions. Sharing this file allows your collaborators to use it to obtain the same build environment.

In order to have something similar in Python, I use the following approach. I store package and development dependencies in two different files, requirements.txt and requirements-dev.txt correspondingly. These files contain only top-level dependencies, for instance, in case of our example requirements.txt will contain only one entry: scikit-learn. The requirements-dev.txt file may be created just once with the list of the development dependencies you usually use. Then, you can copy this file into every project and install the development dependencies using the pip install -r requirements-dev.txt command.

However, if your project starts depending on many different packages, every time installing the package and adding it to the requirements file may be not very convinient. Therefore, I have developed a bash script that automates this process. In particular, it uses pip to install a package and based on the provided parameter (-d or --dev) it adds the name of the package to the requirements.txt or requirements-dev.txt. You can also provide the name of the file explicitely if you do not like default ones. Note that currently the functionality of checking if the dependency is already added to the file is very basic. It may happen that your requirements file may contain several entries related to the same library.

Here is the code of this function:

function pip-install() {
    packages=()
    dev_dependency=0
    requirements_file=
    while [ $# -gt 0 ]
    do
        case "$1" in
            -h|--help)
                echo "Usage: pip-install [-d|--dev] [-r|--req <file>] <package1> <package2> ..." 1>&2
                echo ""
                echo "This function installs provided Python packages using pip"
                echo "and adds this dependency to the file listing requirements."
                echo "The name of the package is added to the file without"
                echo "concreate version only if it is absent there." 
                echo ""
                echo "-h|--help        - prints this message and exits."
                echo "-d|--dev         - if the dependency is development."
                echo "-r|--req <file>  - in which file write the dependency."
                echo "    If the filename is not provided by default the function"
                echo "    writes this information to requirements.txt or to"
                echo "    requirements-dev.txt if -d parameter is provided."
                echo "<package1> <package2> ..."
                return 0
                ;;
            -d|--dev)
                shift
                dev_dependency=1
                ;;
            -r|--req)
                shift
                requirements_file="$1"
                echo "Requirements file specified: $requirements_file"
                shift
                ;;
            *)
                packages+=( "$1" )
                echo "$1"
                shift
                ;;
        esac
    done

    if ! [ -x "$(command -v pip)" ]; then
        echo "Cannot find pip tool. Aborting!"
        exit 1
    fi

    echo "Requirements file: $requirements_file"
    echo "Development dependencies: $dev_dependency"
    echo "Packages: ${packages[@]}"

    if [ -z "$requirements_file" ]; then
        if [ $dev_dependency -eq 0 ]; then
            requirements_file="requirements.txt"
        else
            requirements_file="requirements-dev.txt"
        fi
    fi

    for p in "${packages[@]}"
    do
        echo "Installing package: $p"
        pip install $p
        if [ $? -eq 0 ]; then
            echo "Package installed successfully"
            echo "$p" >> $requirements_file
            if [ $(grep -Ec "^$p([~=!<>]|$)" "$requirements_file") -eq 0 ]; then
                echo "$p" >> $requirements_file
            else
                echo "Package $p is already in $requirements_file"
            fi
        else
            echo "Cannot install package: $p"
        fi
    done
}

You can add this function to your .bashrc (if you use Ubuntu), and it will be available each time you open your shell. For instance, if you want to use scikit-learn, in your virtual environment you call this function in the following way:

(.venv) $ pip-install scikit-learn

After you call this command, the scikit-learn library will be installed into your virtual environment and the dependency will be added to the requirements.txt.

Once you need to share your code with the collaborators, you can generate a lock file that will contain all exact versions of the top-level dependencies. The following bash function facilitates this process:

function pip-freeze() {
    dump_all=0
    dev_dependency=0
    requirements_file=
    while [ $# -gt 0 ]
    do
        case "$1" in
            -h|--help)
                echo "Usage: pip-freeze [-a|--all] [-d|--dev] [-r|--req <file>]" 1>&2
                echo ""
                echo "This function freezes only the top-level dependencies listed"
                echo "in the <file> and writes the results to the <filename>.lock file."
                echo "Later, the data from this file can be used to install all"
                echo "top-level dependencies." 
                echo ""
                echo "-h|--help        - prints this message and exits."
                echo "-d|--dev         - if the dependency is development."
                echo "-a|--all         - if we should freeze all dependencies"
                echo "  (not only top-level)."
                echo "-r|--req <file>  - what file to use to look for the list of"
                echo "    top-level dependencies. The results will be written to"
                echo "    the \"<file>.lock\" file." 
                echo "    If the <file> is not provided by default the function"
                echo "    uses \"requirements.txt\" or \"requirements-dev.txt\""
                echo "    if -d parameter is provided and writes the results to the"
                echo "    \"requirements.txt.lock\" or \"requirements-dev.txt.lock\""
                echo "    correspondingly."
                return 0
                ;;
            -d|--dev)
                shift
                echo "Development dependency"
                dev_dependency=1
                ;;
            -a|--all)
                shift
                dump_all=1 
                ;;
            -r|--req)
                shift
                requirements_file="$1"
                echo "Requirements file specified: $requirements_file"
                shift
                ;;
        esac
    done

    if ! [ -x "$(command -v pip)" ]; then
        echo "Cannot find pip tool. Aborting!"
        exit 1
    fi

    if [ -z "$requirements_file" ]; then
        if [ $dev_dependency -eq 0 ]; then
            requirements_file="requirements.txt"
        else
            requirements_file="requirements-dev.txt"
        fi
    fi

    fullname=$(basename -- "$requirements_file")
    filename="${fullname%.*}"
    lock_file="$filename.lock"
    if [ $dump_all -eq 1 ] 
    then
        pip freeze > "$lock_file"
        if [ $? -eq 0 ]; then
            echo "Locked all dependencies to: $lock_file"
        else
            echo "Error happened while locking all dependencies"
        fi
    else
        cmd_output=$(pip freeze -r "$requirements_file")
        if [ $? -eq 0 ]; then
            > "$lock_file"
            while IFS= read -r line; do
                if [ "$line" = "## The following requirements were added by pip freeze:" ]; then
                    break
                fi
                echo "$line" >> "$lock_file"
            done <<< "$cmd_output"
        fi
    fi
}

Thus, if you run the pip-freeze command it will take all top-level dependencies from the requirements.txt file, lock their versions, and write the results into the requirements.lock file. Then, your collaborators would need just to run pip install -r requirements.lock to obtain the environment with the same versions of top-level dependencies. Note that all dependencies from the requirements.txt file should be installed in the current virtual environment, or otherwise the requirements.lock file will not list them.

I am not sure if all sub-dependencies have the same versions (maybe someone with deep knowledge of Python package process can prove this). However, for my projects this approach seems to work.

Data Analysis Workflow

The Data Analysis Workflow is similar to Python Development Workflow. Imagine that you work on a project that has a number of Jupyter notebooks. They all are stored in the same directory. Thus, in this directory you create a virtual environment, activate it, install there all the dependencies, and run Jupyter server.

$ virtualenv .venv
$ source .venv/bin/activate
(.venv) $ pip-install numpy
(.venv) $ jupyter notebook

Now, you can create a new notebook and check that everything is working by importing the numpy module and printing its version:

import numpy as np
print(np.__version__)

Conclusion

During the last several weeks, I was working on finding the better way to manage my Python interpreters and project dependencies. As a result of these explorations, this article was born. I am not saying that I have found the best solution, however it seems to me quite solid to be used at least by developers who have similar requirements.

I developed a bash script called py_env_config.sh that facilitates the process of configuring my Python environment. If you run it you should obtain the environment similar to the described in this article. You can find this script in the accompanying Github repository.

Similarly, I put pip-related functions into one file called pip_functions.sh. Just download this file, put it into the ~/.bash/ directory and add the following lines to your .bashrc to have these functions available for you:

if [ -f ~/.bash/pip_functions.sh ]; then
    source ~/.bash/pip_functions.sh
fi

Happy Python coding!

Related