This container was created to support various experimentations on Datascience, mainly in the context of Kaggle competitions.
Bundled tools:
- Based on Ubuntu 16.04
- Python 3
- Jupyter
- TensorFlow (CPU and GPU flavors)
- Spark driver (set SPARK_MASTER ENV pointing to your Spark Master)
- For creating a Spark Cluster, you can check https://github.com/flaviostutz/spark-swarm-cluster
 
- Scoop, h5py, pandas, scikit, TFLearn, plotly
- pyexcel-ods, pydicom, textblob, wavio, trueskill, cytoolz, ImageHash...
- 
CPU only: - create docker-compose.yml
 version: "3" services: datascience-tools: image: flaviostutz/datascience-tools ports: - 8888:8888 - 6006:6006 volumes: - /notebooks:/notebooks environment: - JUPYTER_TOKEN=flaviostutz- docker-compose up
 
- 
GPU support for TensorFlow: - Prepare host machine with NVIDIA Cuda drivers
- sudo apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub
- sudo sh -c 'echo "deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64 /" > /etc/apt/sources.list.d/cuda.list'
- sudo apt-get update && sudo apt-get install -y --no-install-recommends cuda-drivers
- http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/accelerated-computing-instances.html
 
- Install nvidia-docker and nvidia-docker-plugin
- wget -P /tmp https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.0/nvidia-docker_1.0.0-1_amd64.deb
- sudo dpkg -i /tmp/nvidia-docker*.deb && rm /tmp/nvidia-docker*.deb
- Install nvidia-docker (https://github.com/NVIDIA/nvidia-docker)
 
- nvidia-docker run -d -v /root:/notebooks -v /root/input:/notebooks/input -v /root/output:/notebooks/output -p 8888:8888 -p 6006:6006 --name jupyter flaviostutz/datascience-tools:latest-gpu
 
- Prepare host machine with NVIDIA Cuda drivers
- 
If you wish this container to run automatically on host boot, add these lines to /etc/rc.local: - cd /root/datascience-tools/run ./boot.sh >> /var/log/boot-script
- Change "/root/datascience-tools" to where you cloned this repo
 
Access:
- http://[ip]:8888 for Jupyter
- http://[ip]:6006 for TensorBoard
- When this container starts, it runs:
- Jupyter Notebook server on port 8888
- TensorBoard server on port 6006
- A custom script located at /notebooks/autorun.sh
- If autorun.sh doesn't exist, it is ignored
- If it exists, everytime you start/restart the container it will be run once
- You can use this script when running large batch processes on servers that could boot/shutdown at random (like what happens when using AWS Spot Instances), so that when the server restarts this script can resume previous work
- Make sure you control partial save/resume for optimal computing usage
- On the host OS, you have to run this docker container with "--restart=always" so that it will be started automatically during boot
- It is possible to edit this file with Jupyter editor
- Example script:
- #!/bin/bash python test.py
 
 
 
- docker build . -f Dockerfile
- docker build . -f Dockerfile-gpu
- 
A good practice is to store your notebook scripts in a git repository 
- 
Run datascience-tools container and map the volume "/notebooks", inside the container, to the path you cloned your git repository in your computer 
- 
You can edit/save/run the scripts from the web interface (http://localhost:8888) or directly with other tools on your computer. You can commit and push your code to the repository directly (no copy from/to container is needed because the volume is mapped) 
version: "3"
services:
   datascience-tools:
      image: flaviostutz/datascience-tools
      ports:
      - 8888:8888
      - 6006:6006
      volumes:
      - /Users/flaviostutz/Documents/development/flaviostutz/puzzler/notebooks:/notebooks
- For running in production, create a new container with "FROM flaviostutz/datascience-tools" and add your script files to "/notebooks" so when you run the container it will have your custom scripts embedded into it. No "volume" mapping is needed for this container. During container startup, script /notebooks/autorun.sh will run if present.
- 
JUPYTER_TOKEN - token needed for the users to open Jupyter. defaults to '', so that no token or password will asked to the user 
- 
SPARK_MASTER - Spark master address. Used if you want to send jobs to an external Spark cluster and still control the whole job from Jupyter Notebook itself.