You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+37-13Lines changed: 37 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,28 +1,51 @@
1
-
# Example notebooks
1
+
# Kx Machine Learning Notebooks
2
2
3
-
Throughout the machine learning notebooks we showcase the benefits of using Embedpy and JupyterQto solve a range of machine learning problems, from feature engineering to the training and testing models.
3
+
The example machine learning notebooks demonstrate the benefits of using kdb+/q alongside the Kx interfaces embedPy and JupyterQ, the Kx Natural Language Processing (NLP), Machine Learning Toolkit (ML-Toolkit) and Automated Machine Learning libraries. These notebooks showcase how to solve a range of machine learning problems, from feature engineering and neural network design to the model training and testing.
4
4
5
-
EmbedPy allows users to access the rich eco-system of machine learning and visual libraries available in Python, while JupyterQ allows users to display results in a range of ways, giving a better undertanding of the data and results produced using kdb+/q.
5
+
## embedPy
6
6
7
+
embedPy is part of the fusion for kdb+ initiative and allows the application of Python functions on kdb+ data within a q process. Python and kdb+/q developers can leverage the benefits of both technologies, pairing kdb+’s high-speed analytics with Python’s rich ecosystem of machine learning libraries including but not limited to scikit-learn, matplotlib and Tensorflow.
8
+
9
+
## JupyterQ
10
+
11
+
JupyterQ is also part of the fusion for kdb+ initiative and provides users with a kdb+ kernel for the Jupyter project. This kernel allows users to create Jupyter Notebooks and additionally to leverage JupyterHub and JupyterLab. These technologies are ubiquitous within the data science community.
12
+
13
+
## NLP
14
+
15
+
The Kx NLP library can be used to answer a variety of questions about unstructured text and can therefore be used to preprocess text data in preparation for model training. Input text data, in the form of emails, tweets, articles or novels, can be transformed to vectors, dictionaries and symbols which can be handled very effectively by q.
16
+
17
+
## ML-Toolkit
18
+
19
+
The toolkit contains libraries and scripts that provide kdb+/q users with general-use functions and procedures to perform machine-learning tasks on a wide variety of datasets. This includes utility functions, the FRESH (FeatuRe Extraction and Scalable Hypothesis testing) algorithm, cross validation and grid search procedures, and clustering algorithms.
20
+
21
+
## AutoML
22
+
23
+
The Automated Machine Learning framework provides users with the ability to automate the process of applying machine learning techniques to real-world problems in kdb+/q. The pipeline comprises preprocessing, feature engineering, cross validation, model selection, hyperparameter tuning and report generation. As shown in the associated notebook, this framework is designed to be flexible to users with both novice and expert kdb+ or machine learning engineers alike.
24
+
25
+
## Notebooks
7
26
The contents of the notebooks are as follows:
8
27
9
28
1.**Decision Trees**: A decision tree is trained to detect if a patient has either benign or malignant cancer. The performance of the model is measured by computing a confusion matrix and ROC curve.
10
29
11
-
2.**Random Forests**: Random forest and XGBoost classifiers are trained to identify satisfied and unsatisfied bank clients. Different parameters are tuned and tested and the classifier performance is evaluated using the ROC curve.
30
+
2.**Random Forests**: Random forest and XGBoost classifiers are trained to identify satisfied and unsatisfied financial clients. Different parameters are tuned and tested, with classifier performance evaluated using the ROC curve.
31
+
32
+
3.**Neural Networks**: A neural network is trained to identify samples of handwritten digits from the MNIST database. Performance is calculated for a test set of images, with a variety of plots used to show the results.
33
+
34
+
4.**Dimensionality Reduction**: Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE) are used to reduce the dimensionality of the original dataset. Several plots are used to visualize reduced features and infer differences between the distinct groups present in the data.
12
35
13
-
3.**Neural Networks**: A neural network is trained to identify handwritten digits in a set of training images. Once the neural network has been trained, the performance is measured on the test dataset and different plots are used to show the results.
36
+
5.**Feature Engineering**: Examples of data preprocessing that can highly affect the performance of a model are demonstrated. The first section of the notebook focuses on the robustness of different scalers against k-nearest neighbours, while the second section demonstrates the importance of one-hot encoding labels when training a neural network.
14
37
15
-
4.**Dimensionality Reduction**: Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE) are used to try and reduce the dimensionality of the original dataset. Several plots are also employed to visualize the obtained reduced features and infer whether they are able to catch differences between the distict groups present in the data.
38
+
6.**Feature Extraction and Selection**: The three examples provided explain how to effectively use the FRESH (FeatuRe Extraction and Scalable Hypothesis testing) algorithm to extract features and determine how significant each feature is in predicting a target vector. The examples make use of both random forest and gradient boosting models.
16
39
17
-
5.**Feature Engineering**: Examples of data preprocessing, such as feature scaling and one-hot categorical encoding, that can highly affect the performance of a model are demonstrated. The robustness of different scalers against KNN are demonstrated in the first part of the notebook while in a second part, the importance of one-hot encoding labels when training a neural network is shown.
40
+
7.**Cross Validation**: Cross validation procedures are demonstrated against a random forest classifier, with the aim of classifying breast cancer data. Results produced for the different cross validation methods available in the toolkit are compared.
18
41
19
-
6.**Feature Extraction and Selection**: 3 examples are provided explaining how to effectively use the FRESH (FeatuRe Extraction and Scalable Hypothesis testing) algorithm to extract features and determine how significant each feature is in predicting a target vector. Random forest are train in the first and third examples, which a gradient boosting model is used in the second.
42
+
8.**Natural Language Processing**: Parsing, clustering, sentiment analysis and outlier detection are demonstrated on a range of corpora, including the novel Moby Dick, the emails of the Enron CEOs and the 2014 IEEE Vast Challenge articles.
20
43
21
-
7.**Cross Validation**: Different cross validation methods are used with a random forest classifer to see how results compare across the methods when classifying breast cancer data.
44
+
9.**K-Nearest Neighbours**: The notebook details the steps to follow in a machine learning problem, prior to model training. These include feature scaling, data splitting and parameter tuning - performed by measuring the accuracy of a k-nearest neighbours model for different values of parameter k.
22
45
23
-
8.**Natural Language Processing**: Parsing, clustering, sentiment analysis and outlier detection are demonstated on a range of corpora, including the novel *Moby Dick*, the emails of the Enron CEOs, and the 2014 IEEE Vast Challenge articles.
46
+
10.**Automated Machine Learning**: The notebook looks at predicting how likely a telecommunications customer is to churn based on behaviour. The data and associated target is used throughout the notebook and is passed into the AutoML pipeline in both its default configuration and custom user-defined configuration, with the steps in the pipeline explained throughout.
24
47
25
-
9.**K Nearest Neighbours**: The basic steps to follow in a standard machine learning problem previous to final model training are performed: features are scaled, data is split into training and test datasets and parameter tuning is done by measuring accuracy of a K-Nearest Neighbours model for different values of parameter K.
48
+
11.**Clustering**: Examples of how to use the k-means, DBSCAN, affinity propagation, hierarchical and CURE algorithms available within the ML-Toolkit are provided. The notebook demonstrates how to effectively visualize results produced and make use of scoring functions contained within the toolkit. A real-world application is also included.
26
49
27
50
## Requirements
28
51
@@ -32,8 +55,9 @@ The contents of the notebooks are as follows:
0 commit comments