diff --git a/CONTRIBUTING.rst b/CONTRIBUTING.rst index 08a20fe..ab39be4 100644 --- a/CONTRIBUTING.rst +++ b/CONTRIBUTING.rst @@ -11,7 +11,7 @@ Contributing Contributions from the community are highly appreciated. Even small contributions improve the software's quality. -Even if you are not familiar with programming languages and tools, +If you are not familiar with programming languages and tools, you may contribute by filing bugs or any problems as a `GitHub issue `_. @@ -25,13 +25,14 @@ If you are not familiar with git, there are lots of tutorials on All the important basics are covered in the `GitHub Git handbook `_. -Development of `scikit-hubness` (mostly) follows the -`git flow branching model `_. -There are two main branches: master and develop. +There is one main branches: ``main``. For any changes, a new branch should be created. -If you want to add a new feature, fix a noncritical bug, etc. one should -branch off `develop`. -Only if you want to fix a critical bug, branch off `master`. +If you want to add a new feature, fix a noncritical bug, fix a critical bug, +branch off ``main``, introduce your changes, and create a pull request. + +(Previously, development of `scikit-hubness` (mostly) followed the +`git flow branching model `_. +This was found to be unnecessarily complicated for a project of this size). Workflow @@ -55,10 +56,10 @@ you can - of course - directly submit a pull request (PR). #. Create feature/bugfix branch. In case of feature or noncritical bugfix: - $ ``git checkout develop && git checkout -b featureXYZ develop`` + $ ``git checkout main && git checkout -b featureXYZ`` In case of critical bug: - $ ``git checkout -b bugfix123 master`` + $ ``git checkout -b bugfix123 main`` #. Implement feature/fix bug/fix typo/... Happy coding! @@ -76,13 +77,13 @@ you can - of course - directly submit a pull request (PR). #. Wait... Several devops checks will be performed automatically - (e.g. continuous integration (CI) with Travis, AppVeyor). + (e.g. continuous integration (CI) with Github Actions). The authors will get in contact with you, and may ask for changes. #. Respond to code review. - If there were issues with continous integration, + If there were issues with continuous integration, or the authors asked for changes, please create a new commit locally, and simply push again to GitHub as you did before. The PR will be updated automatically. @@ -123,9 +124,9 @@ Code style and further guidelines Testing ======= -In `scikit-hubness`, we aim for high code coverage. As of September 2019, -between 98% and 99% of all code lines are visited at least once when -running the complete test suite. This is primarily to ensure: +In `scikit-hubness`, we aim for high code coverage. Between 90% and 100% of all code lines +should be visited at least once when running the complete test suite. +This is primarily to ensure: * correctness of the code (to some extent) and * maintainability (new changes don't break old code). diff --git a/docs/changelog.md b/docs/changelog.md index 40d956c..ad971f2 100644 --- a/docs/changelog.md +++ b/docs/changelog.md @@ -3,10 +3,37 @@ ## [Next release] ... + +## [0.30.0] - 2022-04-xx + +### Major changes +- Compatibility with up-to-date scikit-learn versions +- Building upon the KNeighborsTransformer API as outlined in + [our paper's Outlook section](https://joss.theoj.org/papers/10.21105/joss.01957). +- `skhubness.neighbors` rewritten from scratch. Previously, this was a drop-in + replacement for `sklearn.neighbors` heavily relying on a specific scikit-learn + version. This was hard to maintain. Now, this package only contains lightweight + wrappers for approximate nearest neighbor search tools (`nmslib`, `ngt`, etc.) + that return `KNeighborsTransformer`-compatible k-neighbors graphs. These can be + reused in numerous scikit-learn classes and functions. +- `skhubness.analysis` uses `KNeighborsTransformer`-compatible k-neighbors graphs. +- `skhubness.reduction` uses `KNeighborsTransformer`-compatible k-neighbors graphs. + ### Added or enhanced +- Python 3.9 and Python 3.10 support +- Additional metrics available for ANN search with `nmslib` #87 +- Additional metrics available for ANN search with `ngtpy` #95 - Lower memory footprint for sparse targets in multilabel classification (previously converted to dense arrays) #61 +### Removed +- `falconn` removed (not maintained for 5+ years; index structures cannot be serialized; no Windows support) #94 +- Radius neighbor search. Few ANN packages provide radius search. No currently supported + ANN package supports this on Windows. Radius search is not of particular interest to + hubness research. Thus, we decided to drop radius search for the time being to speed up + development. Later releases might re-introduce radius search. Users interested in this + are asked to file an Issue at Github + ### Fixes - Hubness estimation could fail when ANN does not return enough neighbors #59 - Heuristic to choose memory for Puffinn LSH. @@ -14,6 +41,10 @@ ### Maintenance - Switch to modern Python packaging with `pyproject.toml` and `setup.cfg` - Switch to Github Actions, dropping Travis CI and AppVeyor +- Renamed 0.22 to 0.30. Previous versions reflected the compatibility with specific + scikit-learn versions by matching version numbers. The bump to 0.30 indicates that + this tight coupling is gone and future scikit-hubness releases should be compatible + with multiple scikit-learn versions. ## [0.21.2] - 2020-01-14 @@ -100,7 +131,8 @@ It already contains the following features: * HNSW provided by [nmslib](https://github.com/nmslib/nmslib) * LSH provided by [falconn](https://github.com/FALCONN-LIB/FALCONN) -[Next release]: https://github.com/VarIr/scikit-hubness/compare/v0.21.2...HEAD +[Next release]: https://github.com/VarIr/scikit-hubness/compare/v0.30.0...HEAD +[0.30.0]: https://github.com/VarIr/scikit-hubness/releases/tag/v0.30.0 [0.21.2]: https://github.com/VarIr/scikit-hubness/releases/tag/v0.21.2 [0.21.1]: https://github.com/VarIr/scikit-hubness/releases/tag/v0.21.1 [0.21.0]: https://github.com/VarIr/scikit-hubness/releases/tag/v0.21.0 diff --git a/docs/conf.py b/docs/conf.py index 4763c27..0d9e0a2 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -14,21 +14,22 @@ import sys sys.path.insert(0, os.path.abspath('../')) -import mock -MOCK_MODULES = ['falconn', - 'nmslib', - 'annoy', - 'ngt', - 'ngtpy', - 'puffinn', - ] +from unittest.mock import Mock +MOCK_MODULES = [ + 'nmslib', + 'annoy', + 'ngt', + 'ngtpy', + 'numba', + 'puffinn', +] for mod_name in MOCK_MODULES: - sys.modules[mod_name] = mock.Mock() + sys.modules[mod_name] = Mock() # -- Project information ----------------------------------------------------- project = 'scikit-hubness' -copyright = '2020, Roman Feldbauer' +copyright = '2022, Roman Feldbauer' author = 'Roman Feldbauer' # The full version, including alpha/beta/rc tags @@ -41,24 +42,25 @@ # Add any Sphinx extension module names here, as strings. They can be # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom # ones. -extensions = ['recommonmark', - 'numpydoc', - 'sphinx_automodapi.automodapi', - 'sphinx.ext.autodoc', - 'sphinx.ext.autosectionlabel', - 'sphinx.ext.autosummary', - 'sphinx.ext.graphviz', - 'sphinx.ext.inheritance_diagram', - 'sphinx.ext.todo', - 'sphinx.ext.napoleon', - 'sphinx.ext.githubpages', - 'sphinx.ext.mathjax', - 'sphinx.ext.doctest', - 'sphinx.ext.intersphinx', - 'sphinx.ext.linkcode', - 'sphinx_gallery.gen_gallery', # to automatically generate example pages from scripts - 'sphinx_search.extension', # readthedocs-sphinx-search with ElasticSearch - ] +extensions = [ + 'recommonmark', + 'numpydoc', + 'sphinx_automodapi.automodapi', + 'sphinx.ext.autodoc', + 'sphinx.ext.autosectionlabel', + 'sphinx.ext.autosummary', + 'sphinx.ext.graphviz', + 'sphinx.ext.inheritance_diagram', + 'sphinx.ext.todo', + 'sphinx.ext.napoleon', + 'sphinx.ext.githubpages', + 'sphinx.ext.mathjax', + 'sphinx.ext.doctest', + 'sphinx.ext.intersphinx', + 'sphinx.ext.linkcode', + 'sphinx_gallery.gen_gallery', # to automatically generate example pages from scripts + 'sphinx_search.extension', # readthedocs-sphinx-search with ElasticSearch +] # Due to sphinx-automodapi numpydoc_show_class_members = False @@ -75,8 +77,11 @@ # List of patterns, relative to source directory, that match files and # directories to ignore when looking for source files. # This pattern also affects html_static_path and html_extra_path. -exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store', - ] +exclude_patterns = [ + '_build', + 'Thumbs.db', + '.DS_Store', +] # Mock packages that are not installed on rtd autodoc_mock_imports = MOCK_MODULES @@ -90,25 +95,24 @@ # The following is used by sphinx.ext.linkcode to provide links to github from docs.github_link import make_linkcode_resolve -linkcode_resolve = make_linkcode_resolve('skhubness', - 'https://github.com/VarIr/' - 'scikit-hubness/blob/{revision}/' - '{package}/{path}#L{lineno}') +linkcode_resolve = make_linkcode_resolve( + 'skhubness', 'https://github.com/VarIr/scikit-hubness/blob/{revision}/{package}/{path}#L{lineno}', +) # sphinx gallery: where to take scripts from and where to save output to sphinx_gallery_conf = { - 'examples_dirs': # path to your example scripts: - ['../examples/sklearn', - '../examples/hubness_reduction', - '../examples/approximate_neighbors', - '../examples/approximate_hub_red', - ], - 'gallery_dirs': # path to where to save gallery generated output: - ['documentation/auto_examples', - 'documentation/auto_examples_hr', - 'documentation/auto_examples_ann', - 'documentation/auto_examples_ahr', - ], + 'examples_dirs': [ # path to your example scripts: + '../examples/sklearn', + '../examples/hubness_reduction', + '../examples/approximate_neighbors', + '../examples/approximate_hub_red', + ], + 'gallery_dirs': [ # path to where to save gallery generated output: + 'documentation/auto_examples', + 'documentation/auto_examples_hr', + 'documentation/auto_examples_ann', + 'documentation/auto_examples_ahr', + ], } # suppress numerous "duplicate label" warnings from sphinx-gallery diff --git a/docs/development/contributing.rst b/docs/development/contributing.rst index 8962395..ab39be4 100644 --- a/docs/development/contributing.rst +++ b/docs/development/contributing.rst @@ -11,7 +11,7 @@ Contributing Contributions from the community are highly appreciated. Even small contributions improve the software's quality. -Even if you are not familiar with programming languages and tools, +If you are not familiar with programming languages and tools, you may contribute by filing bugs or any problems as a `GitHub issue `_. @@ -25,13 +25,14 @@ If you are not familiar with git, there are lots of tutorials on All the important basics are covered in the `GitHub Git handbook `_. -Development of `scikit-hubness` (mostly) follows the -`git flow branching model `_. -There are two main branches: master and develop. +There is one main branches: ``main``. For any changes, a new branch should be created. -If you want to add a new feature, fix a noncritical bug, etc. one should -branch off `develop`. -Only if you want to fix a critical bug, branch off `master`. +If you want to add a new feature, fix a noncritical bug, fix a critical bug, +branch off ``main``, introduce your changes, and create a pull request. + +(Previously, development of `scikit-hubness` (mostly) followed the +`git flow branching model `_. +This was found to be unnecessarily complicated for a project of this size). Workflow @@ -55,10 +56,10 @@ you can - of course - directly submit a pull request (PR). #. Create feature/bugfix branch. In case of feature or noncritical bugfix: - $ ``git checkout develop && git checkout -b featureXYZ develop`` + $ ``git checkout main && git checkout -b featureXYZ`` In case of critical bug: - $ ``git checkout -b bugfix123 master`` + $ ``git checkout -b bugfix123 main`` #. Implement feature/fix bug/fix typo/... Happy coding! @@ -76,7 +77,7 @@ you can - of course - directly submit a pull request (PR). #. Wait... Several devops checks will be performed automatically - (e.g. continuous integration (CI) with GitHub Actions). + (e.g. continuous integration (CI) with Github Actions). The authors will get in contact with you, and may ask for changes. @@ -123,9 +124,9 @@ Code style and further guidelines Testing ======= -In `scikit-hubness`, we aim for high code coverage. As of September 2019, -between 98% and 99% of all code lines are visited at least once when -running the complete test suite. This is primarily to ensure: +In `scikit-hubness`, we aim for high code coverage. Between 90% and 100% of all code lines +should be visited at least once when running the complete test suite. +This is primarily to ensure: * correctness of the code (to some extent) and * maintainability (new changes don't break old code). diff --git a/docs/getting_started/example.rst b/docs/getting_started/example.rst index dba9431..3b58677 100644 --- a/docs/getting_started/example.rst +++ b/docs/getting_started/example.rst @@ -7,6 +7,7 @@ Users of ``scikit-hubness`` typically want to 1. analyse, whether their data show hubness 2. reduce hubness 3. perform learning (classification, regression, ...) +4. or simply perform fast approximate nearest neighbor search regardless of hubness The following example shows all these steps for an example dataset from the text domain (dexter). @@ -27,20 +28,24 @@ Therefore, we assess the actual degree of hubness. .. code-block:: python - from skhubness import LegacyHubness - hub = LegacyHubness(k=10, metric='cosine') + from skhubness import Hubness + hub = Hubness(k=10, metric='cosine') hub.fit(X) k_skew = hub.score() print(f'Skewness = {k_skew:.3f}') + As a rule-of-thumb, skewness > 1.2 indicates significant hubness. Additional hubness indices are available, for example: .. code-block:: python - print(f'Robin hood index: {hub.robinhood_index:.3f}') - print(f'Antihub occurrence: {hub.antihub_occurrence:.3f}') - print(f'Hub occurrence: {hub.hub_occurrence:.3f}') + hub = Hubness(k=10, return_value="all", metric='cosine') + scores = hub.fit(X).score() + print(f'Robin hood index: {scores.get("robinhood"):.3f}') + print(f'Antihub occurrence: {scores.get("antihub_occurrence"):.3f}') + print(f'Hub occurrence: {scores.get("hub_occurrence"):.3f}') + There is considerable hubness in dexter. Let's see, whether hubness reduction can improve @@ -49,18 +54,26 @@ kNN classification performance. .. code-block:: python from sklearn.model_selection import cross_val_score - from skhubness.neighbors import KNeighborsClassifier + from sklearn.neighbors import KNeighborsClassifier, KNeighborsTransformer + + from skhubness.neighbors import NMSlibTransformer + from skhubness.reduction import MutualProximity + + + knn = KNeighborsTransformer(n_neighbors=50, metric="cosine") + # Alternatively, create an approximate KNeighborsTransformer, e.g., + # knn = NMSlibTransformer(n_neighbors=50, metric="cosine") + kneighbors_graph = knn.fit_transform(X, y) - # vanilla kNN - knn_standard = KNeighborsClassifier(n_neighbors=5, - metric='cosine') - acc_standard = cross_val_score(knn_standard, X, y, cv=5) + # vanilla kNN without hubness reduction + clf = KNeighborsClassifier(n_neighbors=5, metric='precomputed') + acc_standard = cross_val_score(clf, kneighbors_graph, y, cv=5) - # kNN with hubness reduction (mutual proximity) - knn_mp = KNeighborsClassifier(n_neighbors=5, - metric='cosine', - hubness='mutual_proximity') - acc_mp = cross_val_score(knn_mp, X, y, cv=5) + # kNN with hubness reduction (mutual proximity) reuses the + # precomputed graph and works in sklearn workflows: + mp = MutualProximity(method="normal") + mp_graph = mp.fit_transform(kneighbors_graph) + acc_mp = cross_val_score(clf, mp_graph, y, cv=5) print(f'Accuracy (vanilla kNN): {acc_standard.mean():.3f}') print(f'Accuracy (kNN with hubness reduction): {acc_mp.mean():.3f}') @@ -71,26 +84,13 @@ But did MP actually reduce hubness? .. code-block:: python - hub_mp = LegacyHubness(k=10, metric='cosine', - hubness='mutual_proximity') - hub_mp.fit(X) - k_skew_mp = hub_mp.score() - print(f'Skewness after MP: {k_skew_mp:.3f} ' - f'(reduction of {k_skew - k_skew_mp:.3f})') - print(f'Robin hood: {hub_mp.robinhood_index:.3f} ' - f'(reduction of {hub.robinhood_index - hub_mp.robinhood_index:.3f})') + mp_scores = hub.fit(mp_graph).score() + print(f'k-skewness after MP: {mp_scores.get("k_skewness"):.3f} ' + f'(reduction of {scores.get("k_skewness") - mp_scores.get("k_skewness"):.3f})') + print(f'Robinhood after MP: {mp_scores.get("robinhood"):.3f} ' + f'(reduction of {scores.get("robinhood") - mp_scores.get("robinhood"):.3f})') Yes! -The neighbor graph can also be created directly, -with or without hubness reduction: - -.. code-block:: python - - from skhubness.neighbors import kneighbors_graph - neighbor_graph = kneighbors_graph(X, - n_neighbors=5, - hubness='mutual_proximity') - -You may want to precompute the graph like this, -in order to avoid computing it repeatedly for subsequent hubness estimation and learning. +The neighbor graphs can be reused for various purposes, like classification, hubness estimation, +hubness reduction, etc. This avoids expensive re-calculation for each individual step. diff --git a/docs/getting_started/installation.rst b/docs/getting_started/installation.rst index cfa9ce0..89d96b1 100644 --- a/docs/getting_started/installation.rst +++ b/docs/getting_started/installation.rst @@ -26,7 +26,6 @@ Building and installing is straight-forward: git clone https://github.com/puffinn/puffinn.git cd puffinn - python3 setup.py build pip install . @@ -65,13 +64,11 @@ All exact nearest neighbor algorithms (as provided by scikit-learn) are availabl +---------+-------------+-------+-------+---------+ | library | algorithm | Linux | MacOS | Windows | +---------+-------------+-------+-------+---------+ -| nmslib | hnsw | x | x | x | +| nmslib | hnsw, ... | x | x | x | +---------+-------------+-------+-------+---------+ | annoy | rptree | x | x | x | +---------+-------------+-------+-------+---------+ -| ngtpy | nng | x | x | | -+---------+-------------+-------+-------+---------+ -| falconn | falconn_lsh | x | x | | +| ngtpy | onng, ... | x | x | | +---------+-------------+-------+-------+---------+ | puffinn | lsh | x | x | | +---------+-------------+-------+-------+---------+ diff --git a/docs/index.rst b/docs/index.rst index cdd99e5..48768f0 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -6,10 +6,10 @@ `scikit-hubness`: high-dimensional data mining ================================================ -``scikit-hubness`` is a Python package for analysis of hubness -in high-dimensional data. It provides hubness reduction and -approximate nearest neighbor search via a drop-in replacement for -`sklearn.neighbors `_. +``scikit-hubness`` is a Python package for analysis and reduction of hubness +in high-dimensional data. It also provides approximate nearest neighbor search +compatible with scikit-learn's `KNeighborsTransformer +`_. .. toctree:: :maxdepth: 1 diff --git a/requirements-rtd.txt b/requirements-rtd.txt index 5aec3d5..beb4a8a 100644 --- a/requirements-rtd.txt +++ b/requirements-rtd.txt @@ -9,11 +9,10 @@ pytest-cov codecov nose flake8 -git+https://github.com/readthedocs/readthedocs-sphinx-search@master # TODO update to PyPI when it becomes available +readthedocs-sphinx-search sphinx>=2.1 sphinx-automodapi sphinx-gallery sphinx-pdj-theme -mock graphviz numpydoc