Apache Toree is a Juypter Notebook kernel. The main goal of Toree is to provide the foundation for interactive applications that connect to and use Apache Spark using Scala language.
Toree provides an interface that allows clients to interact with a Spark Cluster. Clients can send libraries and snippets of code that are interpreted and executed using a preconfigured Spark context. These snippets can do a variety of things:
- Define and run spark jobs of all kinds
- Collect results from spark and push them to the client
- Load necessary dependencies for the running code
- Start and monitor a stream
- ...
Apache Toree supports the Scala
programming language. It implements the latest Jupyter message protocol (5.0),
so it can easily plug into the latest releases of Jupyter/IPython (3.2.x+ and up) for quick, interactive data exploration.
This project uses make
as the entry point for build, test, and packaging. To perform a local build, you need to
install sbt
, jupyter/ipython
, and other development requirements locally on your machine.
To build and interact with Toree using Jupyter, run
make dev
This will start a Jupyter notebook server. Depending on your mode, it will be accessible at http://localhost:8888
or http://192.168.44.44:8888
. From here you can create notebooks that use Toree configured for Spark local mode.
Tests can be run by doing make test
.
NOTE: Do not use
sbt
directly.
To build and package up Toree, run
make release
This results in 2 packages.
./dist/toree-<VERSION>-binary-release.tar.gz
is a simple package that contains JAR and executable./dist/toree-<VERSION>.tar.gz
is apip
installable package that adds Toree as a Jupyter kernel.
NOTE: make release
uses docker
. Please refer to docker
installation instructions for your system.
To build just the main Toree assembly jar (without spark-monitor-plugin):
sbt assembly
This creates: target/scala-2.12/toree-assembly-<VERSION>.jar
To build the spark-monitor-plugin as a separate jar:
sbt sparkMonitorPlugin/assembly
This creates: spark-monitor-plugin/target/scala-2.12/spark-monitor-plugin-<VERSION>.jar
To compile all projects including both the main assembly and spark-monitor-plugin:
sbt compile
Note: The spark-monitor-plugin is now built as a separate jar and is not included in the main Toree assembly.
To enable the Spark Monitor Plugin in your Toree application, you need to specify the path to the plugin JAR when starting Toree:
# Start Toree with spark-monitor-plugin enabled
java -jar target/scala-2.12/toree-assembly-<VERSION>.jar --magic-url file:///path/to/spark-monitor-plugin/target/scala-2.12/spark-monitor-plugin-<VERSION>.jar [other-options]
When installing Toree as a Jupyter kernel, you can specify the plugin:
jupyter toree install --spark_home=<YOUR_SPARK_PATH> --kernel_name=toree_with_monitor --toree_opts="--magic-url file:///path/to/spark-monitor-plugin-<VERSION>.jar"
You can also specify the plugin in a configuration file and use the --profile
option:
{
"magic_urls": ["file:///path/to/spark-monitor-plugin-<VERSION>.jar"]
}
Then start with: java -jar toree-assembly.jar --profile config.json
Important:
- Make sure to use the absolute path to the spark-monitor-plugin JAR file and ensure the JAR is accessible from the location where Toree is running.
- The JAR file name does not contain "toree" prefix to avoid automatic loading as an internal plugin. This allows you to control when the SparkMonitorPlugin is enabled via the
--magic-url
parameter.
To play with the example notebooks, run
make jupyter
A notebook server will be launched in a Docker
container with Toree and some other dependencies installed.
Refer to your Docker
setup for the ip address. The notebook will be at http://<ip>:8888/
.
This requires you to have a distribution of Apache Spark downloaded to the system where Apache Toree will run. The following commands will install Apache Toree.
pip install --upgrade toree
jupyter toree install --spark_home=<YOUR_SPARK_PATH>
Dev snapshots of Toree are located at https://dist.apache.org/repos/dist/dev/incubator/toree. To install using one of those packages, you can use the following:
pip install <PIP_RELEASE_URL>
jupyter toree install --spark_home=<YOUR_SPARK_PATH>
where PIP_RELEASE_URL
is one of the pip
packages. For example:
pip install https://dist.apache.org/repos/dist/dev/incubator/toree/0.2.0/snapshots/dev1/toree-pip/toree-0.2.0.dev1.tar.gz
jupyter toree install --spark_home=<YOUR_SPARK_PATH>
Refer to and open issue here
You can reach us through gitter or our mailing list
We are working on publishing binary releases of Toree soon. As part of our move into Apache Incubator, Toree will start a new version sequence starting at 0.1
.
Our goal is to keep master
up to date with the latest version of Spark. When new versions of Spark require specific code changes to Toree, we will branch out older Spark version support.
As it stands, we maintain several branches for legacy versions of Spark. The table below shows what is available now.
Branch | Apache Spark Version |
---|---|
master | 3.x.x |
0.4.x | 2.x.x |
0.1.x | 1.6+ |
Please note that for the most part, new features will mainly be added to the master
branch.
We are currently enhancing our documentation, which is available in our website.