Lazy KNN Search

This web page contains supplementary material for the conference paper:

Mülâyim, M.O., Arcos, J.L. (2018): Perks of Being Lazy: Boosting Retrieval Performance. In M. T. Cox, P. Funk, & S. Begum (Eds.), International Conference on Case-Based Reasoning: Vol. LNAI,11156 (pp. 309–322)

Abstract. Case-Based Reasoning (CBR) is a lazy learning method and, being such, when a new query is made to a CBR system, the swiftness of its retrieval phase proves to be very important for the overall system performance. The availability of ubiquitous data today is an opportunity for CBR systems as it implies more cases to reason with. Nevertheless, this availability also introduces a challenge for the CBR retrieval since distance calculations become computationally expensive. A good example of a domain where the case base is subject to substantial growth over time is the health records of patients where a query is typically an incremental update to prior cases. To deal with the retrieval performance challenge in such domains where cases are sequentially related, we introduce a novel method which significantly reduces the number of cases assessed in the search of exact nearest neighbors (NNs). In particular, when distance measures are metrics, they satisfy the triangle inequality and our method leverages this property to use it as a cutoff in NN search. Specifically, the retrieval is conducted in a lazy manner where only the cases that are true NN candidates for a query are evaluated. We demonstrate how a considerable number of unnecessary distance calculations is avoided in synthetically built domains which exhibit different problem feature characteristics and different cluster diversity.

Keywords. Exact nearest neighbor search; Lazy retrieval; Triangle inequality

Execution example
# A visual example to step by step execution of the algorithm.
Poster
# ICCBR 2018 Poster.
Datasets
# See the online repository below for the datasets.
Follow-up article
# Lazy KNN's anytime version Anytime Lazy KNN (ALK) is introduced in the article:

Mülâyim, M. O., & Arcos, J. L. (2020). Fast Anytime Retrieval with Confidence in Large-Scale Temporal Case Bases. Knowledge-Based Systems, 206, 106374.

Abstract. ALK finds exact kNNs when allowed to run to completion with the same gain as its predecessor Lazy KNN in terms of avoided similarity assessments. For applications where the gain in exact kNN search may not suffice, ALK can be interrupted earlier and it returns best-so-far kNNs together with a confidence value attached to each neighbor. Furthermore, it can automatically interrupt the search upon reaching a given confidence threshold and resume if so asked. For experimentation, ALK uses publicly available time series datasets to generate temporal case bases.

Keywords. Large-scale case-based reasoning; Exact and approximate k-nearest neighbor search; Anytime algorithms

PhD thesis
# The work started with Lazy KNN culminated in the PhD thesis:

Mülâyim, M. O. (2020). Anytime Case-Based Reasoning in Large-Scale Temporal Case Bases. PhD Thesis, Universitat Autònoma de Barcelona.

Abstract. Case-Based Reasoning (CBR) methodology's approach to problem-solving that "similar problems have similar solutions" has proved quite favorable for many industrial artificial intelligence applications. However, CBR's very advantages hinder its performance as case bases (CBs) grow larger than moderate sizes. Searching similar cases is expensive. This handicap often makes CBR less appealing for today's ubiquitous data environments while, actually, there is ever more reason to benefit from this effective methodology. Accordingly, CBR community's traditional approach of controlling CB growth to maintain performance is shifting towards finding new ways to deal with abundant data. As a contribution to these efforts, this thesis aims to speed up CBR by leveraging both problem and solution spaces in large-scale CBs that are composed of temporally related cases, as in the example of electronic health records. For the occasions when the speed-up we achieve for exact results may still not be feasible, we endow the CBR system with anytime algorithm capabilities to provide approximate results with confidence upon interruption. Exploiting the temporality of cases allows us to reach superior gains in execution time for CBs of millions of cases. Experiments with publicly available real-world datasets encourage the continued use of CBR in domains where it historically excels like healthcare; and this time, not suffering from, but enjoying big data.

Supervisor. Josep Lluís Arcos

Online code repository
# Complete source code of ALK, ALK Classifier and alternative search methods for kNN candidates together with simple instructions to launch the scripts to repeat all experiments and reproduce all plots and result tables that are detailed in the journal article and thesis are publicly available at the online repository: github.com/IIIA-ML/alk