How to prepare Ubuntu for Data Science: Torus, Cookiecutter and Docker in Ubuntu Server LTS 18.04
Important : since 2019-12 Tors is deprecated as stated in their GitHub Repository:
This project is deprecated and has been archived. A new, significantly improved, cookiecutter for Dockerized machine learning and data science can be found here: https://github.com/manifoldai/orbyter-cookiecutter.
I leave the original post just as a reference. ---
Original post:
I needed a faster, always-on computer and more storage and got a no-so-old (okay, it's old but has 4 cores and a 2 TB HD) HP Proliant ML110 with Ubuntu Linux LTS 18.04. I installed everything on my Mac, and doing it all over again made no sense. Not only that, but I also tried Anaconda (also a great tool to start, but you still have to install several packages on every new install). Then I found Torus: A Toolkit For Docker-First Data Science which you can download here or just read further.
Torus and Cookiecutter
Using the project cookiecutter and Docker image together you can go from cold-steel to a new project working in a Jupyter notebook with all the common libraries available to you in less than five minutes (and you didn’t have to pip install anything) -Alex Ng
Cookiecutter Data Science
A logical, reasonably standardized, but flexible project structure for doing and sharing data science work. -from its repository
What is Torus?
Torus is based on Cookiecutter Data Science project.
Helping Data Science teams easily move to a Docker-first development workflow to iterate and deliver projects faster and more reliably. The Torus 1.0 package contains a Python cookiecutter template and a public Docker image. The goal of Torus is to help data science teams adopt Docker and apply Development Operations (DevOps) best practices to streamline machine learning delivery pipelines. -From their website
In other words, install Torus and everything will be already working for you (pandas, Jupyter Notebook, sklearn, matplotlib, etc…). Should you want any package not included, you can always install it. The docker image provided is running Ubuntu, so there’s plenty of information out there. You can also install packages directly from Jupiter Notebooks using pip, which is recommended to have reproducible code. Also, you will find the data (files and folders) in data in your project’s folder. The docker image mounts this folder as a local folder inside Torus as /mnt, and thus you can use your normal tools (for example Sublime Text on my Mac) to edit files with are later run inside Torus. You can also access Jupiter Notebooks from your browser.
How to install Torus and Docker
These are the steps to prepare the system for Data Science stuffs by installing Torus. Open your terminal and follow these steps: 1. Download Docker CE and Python PIP (this assumes you already have python installed. That’s the default btw.).
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -sudo add-apt-repository \ "deb [arch=amd64] https://download.docker.com/linux/ubuntu \ $(lsb_release -cs) \ stable"sudo apt-get install docker-ce docker-ce-cli containerd.io python-pip
- Install coockiecutter
sudo pip install cookiecutter
- Give permissions to your user (taken from here)
sudo usermod -a -G docker $USER
- Install docker-composer (taken from here)
sudo curl -L "https://github.com/docker/compose/releases/download/1.23.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
- Run the following in the folder of your choice (you can change the folder if you want
mkdir Docker-DataScience && cd Docker-DataSciencecookiecutter https://github.com/manifoldai/docker-cookiecutter-data-science.git
- Enter the name of the Project. You can just use the following if you don’t know
- Enter the following (change data-science-docker for the name you gave your project). This will download around 430mb.
cd data-science-docker./start.sh
Things you should know about TORUS
- All your notebooks are in
/<your-project>/data
- You can read more about this project structure at http://drivendata.github.io/cookiecutter-data-science/#why-use-this-project-structure
- Get the port on which Torus is running to access Jupyter Notebooks and get the name of the container.
docker ps
- Docker uses resources, so your computer should be slower and may heat (depends on the computer). If you are not using TORUS remember to stop the container.
docker stop <name-of-the-container>
Tip: if you write “docker stop “ and the first letter of the name of the container and press tab, it will fill the rest with the name
- You can start the container with
docker start <name-of-the-container>
Remember to check in which port it’s available after starting. The port might change
- Get the port on which Torus is running to access Jupyter Notebooks and get the name of the container.
docker ps
- Get a bash shell in the container. Replace
docker exec -it <container name> /bin/bash
- once in the bash shell, you can add a user (not a good idea to be always logged in as root
adduser <new_user>
- login as that new user
su -l <new_user>
Comments
Comments powered by Disqus