Close

02/26/2019

How to prepare Ubuntu for Data Science: Torus, Cookiecutter and Docker in Ubuntu Server LTS 18.04

I needed a faster, always-on computer and more storage and got a no-so-old (okay, its old but has 4 cores and a 2TB HD) HP Proliant ML110 with Ubuntu Linux LTS 18.04.

I installed everything on my Mac, and doing it all over again made no sense. I also tried Anaconda (also a great tool to start but you still have to install several packages on every new install). Then I found Torus: A Toolkit For Docker-First Data Science which you can download here or just read further.

Torus and Cookiecutter

Using the project cookiecutter and Docker image together you can go from cold-steel to a new project working in a Jupyter notebook with all of the common libraries available to you in less than five minutes (and you didn’t have to pip install anything) Alex Ng

Cookiecutter Data Science

A logical, reasonably standardized, but flexible project structure for doing and sharing data science work. -from its repository

What is Torus?

Torus is based on Cookiecutter Data Science project.

Helping Data Science teams easily move to a Docker-first development workflow to iterate and deliver projects faster and more reliably.
The Torus 1.0 package contains a Python cookiecutter template and a public Docker image. The goal of Torus is to help data science teams adopt Docker and apply Development Operations (DevOps) best practices to streamline machine learning delivery pipelines. -From their website

In other words, install Torus and everything will be already working for you (pandas, Jupyter Notebook, sklearn, matplotlib, etc…).

Should you want any package not included, you can always install it. The docker image provided is running Ubuntu so there’s plenty of information out there. You can also install packages directly from Jupiter Notebooks using pip , which is recommended to have reproducible code.
Also, you will find the data (files and folders) in data in your project’s folder. The docker image mounts this folder as a local folder inside Torus as /mnt and thus you can use your normal tools (for example Sublime Text on my Mac) to edit files with are later run inside Torus. You can also access Jupiter Notebooks from your browser.

How to install Torus and Docker

These are the steps to prepare the system for Data Science stuff by installing Torus. Open your terminal and follow these steps:
1. Download Docker CE and Python PIP (this assumes you already have python installed (that’s the default).

sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository \
   "deb [arch=amd64] https://download.docker.com/linux/ubuntu \
   $(lsb_release -cs) \
   stable"
sudo apt-get install docker-ce docker-ce-cli containerd.io python-pip
  1. Install coockiecutter
sudo pip install cookiecutter
  1. Give permissions to your user (taken from here)
sudo usermod -a -G docker $USER
  1. Install docker-composer (taken from here)
sudo curl -L "https://github.com/docker/compose/releases/download/1.23.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose

  1. Run the following in the folder of your choice (you can change the folder if you want
mkdir Docker-DataScience && cd Docker-DataScience
cookiecutter https://github.com/manifoldai/docker-cookiecutter-data-science.git
  1. Enter the name of the Project. You can just user the following if you don’t know
  2. Enter the following (change data-science-docker for the name you gave your project). This will download around 430mb.
cd data-science-docker
./start.sh

Things you should know about TORUS

  1. All your notebooks are in \<your-project>/data
  2. You can read more about this project structure at http://drivendata.github.io/cookiecutter-data-science/#why-use-this-project-structure
  3. Get the port on which Torus is running to access Jupyter Notebooks and get the name of the container.
docker ps

  1. Docker uses resources, so your computer should be slower and may heat (depends on the computer). If you are not using TORUS remember to stop the container.
docker stop <name-of-the-container>

tip: if you write “docker stop “ and the first letter of the name of the container and press tab, it will fill the rest of the name
5. You can start the container with

docker start <name-of-the-container>

remember to check in wich port it’s available after starting. The port might change
6. Get the port on which Torus is running to access Jupyter Notebooks and get the name of the container.

docker ps
  1. Get a bash shell in the container. Replace
docker exec -it <container name> /bin/bash
  1. once in the bash shell, you can add a user (not a good idea to be always logged in as root
adduser <new_user>
  1. login as that new user
su -l <new_user>
%d bloggers like this: