Skip to content

Commit 1b659dd

Browse files
cmccarthy1Dianeoddmorgankx
authored
Addition of AutoML/Clustering notebooks (#10)
* updated nlp * additional utils and graphics for clustering and automl * new clustering and automl notebooks * changed namespace from .aml to .automl * updated clustering notebook for new algo implementations * updated automl notebook with telco data * updated automl notebook with telco data * addition of clustering and automl to readme * Minor updates to wording in automl notebook * typo * automl demonstration added * minor automl fixes * automl updates * clustering updates * readme and requirement updates for clustering * dendrogram removal from graphics * pred true corrections * Reviewed clustering * updated notebooks and added changes for clustering * updated clustering * update links * removed unused code * tensorflow chk Co-authored-by: Deanna Morgan <[email protected]> Co-authored-by: dmorgankx <[email protected]> Co-authored-by: Conor McCarthy <[email protected]> Co-authored-by: Dianeod <[email protected]> Co-authored-by: Deanna Morgan <[email protected]> Co-authored-by: dmorgankx <[email protected]>
1 parent e589613 commit 1b659dd

16 files changed

+12270
-745
lines changed

README.md

Lines changed: 37 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,28 +1,51 @@
1-
# Example notebooks
1+
# Kx Machine Learning Notebooks
22

3-
Throughout the machine learning notebooks we showcase the benefits of using Embedpy and JupyterQ to solve a range of machine learning problems, from feature engineering to the training and testing models.
3+
The example machine learning notebooks demonstrate the benefits of using kdb+/q alongside the Kx interfaces embedPy and JupyterQ, the Kx Natural Language Processing (NLP), Machine Learning Toolkit (ML-Toolkit) and Automated Machine Learning libraries. These notebooks showcase how to solve a range of machine learning problems, from feature engineering and neural network design to the model training and testing.
44

5-
EmbedPy allows users to access the rich eco-system of machine learning and visual libraries available in Python, while JupyterQ allows users to display results in a range of ways, giving a better undertanding of the data and results produced using kdb+/q.
5+
## embedPy
66

7+
embedPy is part of the fusion for kdb+ initiative and allows the application of Python functions on kdb+ data within a q process. Python and kdb+/q developers can leverage the benefits of both technologies, pairing kdb+’s high-speed analytics with Python’s rich ecosystem of machine learning libraries including but not limited to scikit-learn, matplotlib and Tensorflow.
8+
9+
## JupyterQ
10+
11+
JupyterQ is also part of the fusion for kdb+ initiative and provides users with a kdb+ kernel for the Jupyter project. This kernel allows users to create Jupyter Notebooks and additionally to leverage JupyterHub and JupyterLab. These technologies are ubiquitous within the data science community.
12+
13+
## NLP
14+
15+
The Kx NLP library can be used to answer a variety of questions about unstructured text and can therefore be used to preprocess text data in preparation for model training. Input text data, in the form of emails, tweets, articles or novels, can be transformed to vectors, dictionaries and symbols which can be handled very effectively by q.
16+
17+
## ML-Toolkit
18+
19+
The toolkit contains libraries and scripts that provide kdb+/q users with general-use functions and procedures to perform machine-learning tasks on a wide variety of datasets. This includes utility functions, the FRESH (FeatuRe Extraction and Scalable Hypothesis testing) algorithm, cross validation and grid search procedures, and clustering algorithms.
20+
21+
## AutoML
22+
23+
The Automated Machine Learning framework provides users with the ability to automate the process of applying machine learning techniques to real-world problems in kdb+/q. The pipeline comprises preprocessing, feature engineering, cross validation, model selection, hyperparameter tuning and report generation. As shown in the associated notebook, this framework is designed to be flexible to users with both novice and expert kdb+ or machine learning engineers alike.
24+
25+
## Notebooks
726
The contents of the notebooks are as follows:
827

928
1. **Decision Trees**: A decision tree is trained to detect if a patient has either benign or malignant cancer. The performance of the model is measured by computing a confusion matrix and ROC curve.
1029

11-
2. **Random Forests**: Random forest and XGBoost classifiers are trained to identify satisfied and unsatisfied bank clients. Different parameters are tuned and tested and the classifier performance is evaluated using the ROC curve.
30+
2. **Random Forests**: Random forest and XGBoost classifiers are trained to identify satisfied and unsatisfied financial clients. Different parameters are tuned and tested, with classifier performance evaluated using the ROC curve.
31+
32+
3. **Neural Networks**: A neural network is trained to identify samples of handwritten digits from the MNIST database. Performance is calculated for a test set of images, with a variety of plots used to show the results.
33+
34+
4. **Dimensionality Reduction**: Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE) are used to reduce the dimensionality of the original dataset. Several plots are used to visualize reduced features and infer differences between the distinct groups present in the data.
1235

13-
3. **Neural Networks**: A neural network is trained to identify handwritten digits in a set of training images. Once the neural network has been trained, the performance is measured on the test dataset and different plots are used to show the results.
36+
5. **Feature Engineering**: Examples of data preprocessing that can highly affect the performance of a model are demonstrated. The first section of the notebook focuses on the robustness of different scalers against k-nearest neighbours, while the second section demonstrates the importance of one-hot encoding labels when training a neural network.
1437

15-
4. **Dimensionality Reduction**: Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE) are used to try and reduce the dimensionality of the original dataset. Several plots are also employed to visualize the obtained reduced features and infer whether they are able to catch differences between the distict groups present in the data.
38+
6. **Feature Extraction and Selection**: The three examples provided explain how to effectively use the FRESH (FeatuRe Extraction and Scalable Hypothesis testing) algorithm to extract features and determine how significant each feature is in predicting a target vector. The examples make use of both random forest and gradient boosting models.
1639

17-
5. **Feature Engineering**: Examples of data preprocessing, such as feature scaling and one-hot categorical encoding, that can highly affect the performance of a model are demonstrated. The robustness of different scalers against KNN are demonstrated in the first part of the notebook while in a second part, the importance of one-hot encoding labels when training a neural network is shown.
40+
7. **Cross Validation**: Cross validation procedures are demonstrated against a random forest classifier, with the aim of classifying breast cancer data. Results produced for the different cross validation methods available in the toolkit are compared.
1841

19-
6. **Feature Extraction and Selection**: 3 examples are provided explaining how to effectively use the FRESH (FeatuRe Extraction and Scalable Hypothesis testing) algorithm to extract features and determine how significant each feature is in predicting a target vector. Random forest are train in the first and third examples, which a gradient boosting model is used in the second.
42+
8. **Natural Language Processing**: Parsing, clustering, sentiment analysis and outlier detection are demonstrated on a range of corpora, including the novel Moby Dick, the emails of the Enron CEOs and the 2014 IEEE Vast Challenge articles.
2043

21-
7. **Cross Validation**: Different cross validation methods are used with a random forest classifer to see how results compare across the methods when classifying breast cancer data.
44+
9. **K-Nearest Neighbours**: The notebook details the steps to follow in a machine learning problem, prior to model training. These include feature scaling, data splitting and parameter tuning - performed by measuring the accuracy of a k-nearest neighbours model for different values of parameter k.
2245

23-
8. **Natural Language Processing**: Parsing, clustering, sentiment analysis and outlier detection are demonstated on a range of corpora, including the novel *Moby Dick*, the emails of the Enron CEOs, and the 2014 IEEE Vast Challenge articles.
46+
10. **Automated Machine Learning**: The notebook looks at predicting how likely a telecommunications customer is to churn based on behaviour. The data and associated target is used throughout the notebook and is passed into the AutoML pipeline in both its default configuration and custom user-defined configuration, with the steps in the pipeline explained throughout.
2447

25-
9. **K Nearest Neighbours**: The basic steps to follow in a standard machine learning problem previous to final model training are performed: features are scaled, data is split into training and test datasets and parameter tuning is done by measuring accuracy of a K-Nearest Neighbours model for different values of parameter K.
48+
11. **Clustering**: Examples of how to use the k-means, DBSCAN, affinity propagation, hierarchical and CURE algorithms available within the ML-Toolkit are provided. The notebook demonstrates how to effectively visualize results produced and make use of scoring functions contained within the toolkit. A real-world application is also included.
2649

2750
## Requirements
2851

@@ -32,8 +55,9 @@ The contents of the notebooks are as follows:
3255
- [JupyterQ](https://github.com/KxSystems/jupyterq)
3356
- [NLP library](https://github.com/KxSystems/nlp) (v0.1.2)
3457
- [ML-Toolkit](https://github.com/KxSystems/ml) (v0.3.x)
58+
- [AutoML](https://github.com/KxSystems/automl) (v.0.1.0)
3559

36-
### Dependencies
60+
## Dependencies
3761

3862
Install the Python dependencies with
3963

@@ -64,4 +88,4 @@ For subsequent runs, you will not be prompted to redo the license setup when cal
6488
docker start -ai mymlnotebooks
6589

6690

67-
**N.B.** [build instructions for the image are available](docker/README.md)
91+
**N.B.** [build instructions for the image are available](docker/README.md)

0 commit comments

Comments
 (0)