Class: Hanny::LSHIndex

Inherits:

Object

Object
Hanny::LSHIndex

show all

Defined in:: lib/hanny/lsh_index.rb

Overview

LSHIndex is a class that builds a search index with Locality Sensitive Hashing (LSH) [1]. It is known that if the code length is sufficiently long (ex. greater than 128-bit), LSH can obtain higher search performance than many popular hashing methods [2]. In search process, LSHIndex obtains search results by sorting the data stored in hash table with Hamming distances between query binary code and binary hash keys.

References:

Moses S. Charikar, “Similarity Estimation Techniques from Rounding Algorithms,” Proc. of the 34-th Annual ACM Symposium on Theory of Computing, pp. 380–388, (2002).
Deng Cai, “A Revisit of Hashing Algorithms for Approximate Nearest Neighbor Search,” CoRR abs/1612.07545 (2016).

Examples:

# Prepare vector data for search targets and queries with Numo::DFloat (shape: [n_samples, n_features]).
targets = Numo::DFloat.new(5000, 512).rand
queries = Numo::DFloat.new(10, 512).rand

# Build a search index with 256-bit binary code via LSH.
# Although LSHIndex works without setting random_seed, it recommends setting random_seed for reproducibility.
index = Hanny::LSHIndex.new(code_length: 256, random_seed: 1)
index.build_index(targets)

# Obtain the Array<Integer> that has the data indices of 10-neighbors for each query.
candidates = index.search_knn(queries, n_neighbors: 10)

# Save and load the search index with Marshal.
File.open('index.dat', 'wb') { |f| f.write(Marshal.dump(index)) }
index = Marshal.load(File.binread('index.dat'))

Instance Attribute Summary collapse

#code_length ⇒ Integer readonly

Return the code length of hash key.
#hash_codes ⇒ Numo::Bit readonly

Return the binary hash codes.
#hash_table ⇒ Hash readonly

Return the hash table.
#n_features ⇒ Integer readonly

Return the number of features of indexed data.
#n_keys ⇒ Integer readonly

Return the number of hash keys.
#n_samples ⇒ Integer readonly

Return the number of samples of indexed data.
#random_seed ⇒ Integer readonly

Return the seed to initialize random number generator.
#rng ⇒ Random readonly

Return the random generator to generate random matrix.

Instance Method Summary collapse

#append_data(x) ⇒ Array<Integer>

Append new data to the search index.
#build_index(x) ⇒ LSHIndex

Build a search index.
#hash_function(x) ⇒ Numo::Bit

Convert data into binary codes.
#initialize(code_length: 256, random_seed: nil) ⇒ LSHIndex constructor

Create a new nearest neighbor index.
#remove_data(data_ids) ⇒ Array<Integer>

Remove data from the search index.
#search_knn(q, n_neighbors: 10) ⇒ Array<Integer>

Perform k-nearest neighbor search.
#search_radius(q, radius: 1.0) ⇒ Array<Integer>

Perform hamming radius nearest neighbor search.

Constructor Details

#initialize(code_length: 256, random_seed: nil) ⇒ `LSHIndex`

Create a new nearest neighbor index.

Parameters:

code_length (Integer) (defaults to: 256) —

The length of binary code for hash key.
random_seed (Integer/NilClass) (defaults to: nil) —

The seed value using to initialize the random generator.

# File 'lib/hanny/lsh_index.rb', line 66

def initialize(code_length: 256, random_seed: nil)
  @code_length = code_length
  @last_id = nil
  @weight_mat = nil
  @random_seed = random_seed
  @random_seed ||= srand
  @rng = Random.new(@random_seed)
end

Instance Attribute Details

#code_length ⇒ `Integer` (readonly)

Return the code length of hash key.

Returns:

(Integer)



33
34
35

# File 'lib/hanny/lsh_index.rb', line 33

def code_length
  @code_length
end

#hash_codes ⇒ `Numo::Bit` (readonly)

Return the binary hash codes.

Returns:

(Numo::Bit)



53
54
55

# File 'lib/hanny/lsh_index.rb', line 53

def hash_codes
  @hash_codes
end

#hash_table ⇒ `Hash` (readonly)

Return the hash table.

Returns:

(Hash)



49
50
51

# File 'lib/hanny/lsh_index.rb', line 49

def hash_table
  @hash_table
end

#n_features ⇒ `Integer` (readonly)

Return the number of features of indexed data.

Returns:

(Integer)



41
42
43

# File 'lib/hanny/lsh_index.rb', line 41

def n_features
  @n_features
end

#n_keys ⇒ `Integer` (readonly)

Return the number of hash keys.

Returns:

(Integer)



45
46
47

# File 'lib/hanny/lsh_index.rb', line 45

def n_keys
  @n_keys
end

#n_samples ⇒ `Integer` (readonly)

Return the number of samples of indexed data.

Returns:

(Integer)



37
38
39

# File 'lib/hanny/lsh_index.rb', line 37

def n_samples
  @n_samples
end

#random_seed ⇒ `Integer` (readonly)

Return the seed to initialize random number generator.

Returns:

(Integer)



57
58
59

# File 'lib/hanny/lsh_index.rb', line 57

def random_seed
  @random_seed
end

#rng ⇒ `Random` (readonly)

Return the random generator to generate random matrix.

Returns:

(Random)



61
62
63

# File 'lib/hanny/lsh_index.rb', line 61

def rng
  @rng
end

Instance Method Details

#append_data(x) ⇒ `Array<Integer>`

Append new data to the search index.

Parameters:

x (Numo::DFloat) —

(shape: [n_samples, n_features]) The dataset to append to search index.

Returns:

(Array<Integer>) —

The indices of appended data in search index

# File 'lib/hanny/lsh_index.rb', line 114

def append_data(x)
  # Initialize some variables.
  n_new_samples = x.shape[0]
  bin_x = hash_function(x)
  added_data_ids = []
  # Store samples to binary hash table.
  new_codes = []
  n_new_samples.times do |m|
    bin_code = bin_x[m, true]
    hash_key = symbolized_hash_key(bin_code)
    unless @hash_table.key?(hash_key)
      new_codes.push(bin_code.to_a)
      @hash_table[hash_key] = []
    end
    new_data_id = @last_id + m
    @hash_table[hash_key].push(new_data_id)
    added_data_ids.push(new_data_id)
  end
  # Update hash codes.
  unless new_codes.empty?
    new_codes = Numo::Bit.cast(new_codes)
    @hash_codes = @hash_codes.concatenate(new_codes)
    @n_keys = @hash_codes.shape[0]
  end
  @last_id += n_new_samples
  @n_samples += n_new_samples
  added_data_ids
end

#build_index(x) ⇒ `LSHIndex`

Build a search index.

Parameters:

x (Numo::DFloat) —

(shape: [n_samples, n_features]) The dataset for building search index.

Returns:

(LSHIndex) —

The search index itself that has constructed the hash table.

# File 'lib/hanny/lsh_index.rb', line 85

def build_index(x)
  # Initialize some variables.
  @n_samples = x.shape[0]
  @n_features = x.shape[1]
  @hash_table = {}
  @weight_mat = Utils.rand_normal([@n_features, @code_length], @rng)
  # Convert samples to binary codes.
  bin_x = hash_function(x)
  # Store samples to binary hash table.
  codes = []
  @n_samples.times do |m|
    bin_code = bin_x[m, true]
    hash_key = symbolized_hash_key(bin_code)
    unless @hash_table.key?(hash_key)
      codes.push(bin_code.to_a)
      @hash_table[hash_key] = []
    end
    @hash_table[hash_key].push(m)
  end
  @hash_codes = Numo::Bit.cast(codes)
  # Update some variables.
  @n_keys = @hash_codes.shape[0]
  @last_id = @n_samples
  self
end

#hash_function(x) ⇒ `Numo::Bit`

Convert data into binary codes.

Parameters:

x (Numo::DFloat) —

(shape: [n_samples, n_features]) The data to be converted to binary codes.

Returns:

(Numo::Bit) —

The binary codes converted from given data.



78
79
80

# File 'lib/hanny/lsh_index.rb', line 78

def hash_function(x)
  x.dot(@weight_mat).ge(0.0)
end

#remove_data(data_ids) ⇒ `Array<Integer>`

Remove data from the search index. The indices of removed data will never be assigned unless the search index is rebuilt.

Parameters:

data_ids (Array<Integer>) —

The data indices to be removed.

Returns:

(Array<Integer>) —

The indices of removed data in search index

# File 'lib/hanny/lsh_index.rb', line 147

def remove_data(data_ids)
  removed_data_ids = []
  data_ids.each do |query_id|
    # Remove data id from hash table.
    hash_key = @hash_table.keys.find { |k| @hash_table[k].include?(query_id) }
    next if hash_key.nil?

    @hash_table[hash_key].delete(query_id)
    removed_data_ids.push(query_id)
    # Remove the hash key if there is no data.
    next unless @hash_table[hash_key].empty?

    target_id = distances_to_hash_codes(decoded_hash_key(hash_key)).index(0)
    @hash_codes = @hash_codes.delete(target_id, 0)
  end
  @n_samples -= removed_data_ids.size
  removed_data_ids
end

#search_knn(q, n_neighbors: 10) ⇒ `Array<Integer>`

Perform k-nearest neighbor search.

Parameters:

q (Numo::DFloat) —

(shape: [n_queries, n_features]) The data for search queries.
n_neighbors (Integer) (defaults to: 10) —

The number of neighbors.

Returns:

(Array<Integer>) —

The data indices of search result.

# File 'lib/hanny/lsh_index.rb', line 170

def search_knn(q, n_neighbors: 10)
  # Initialize some variables.
  n_queries = q.shape[0]
  candidates = Array.new(n_queries) { [] }
  # Binarize queries.
  bin_q = hash_function(q)
  # Find k-nearest neighbors for each query.
  n_queries.times do |m|
    sort_with_index(distances_to_hash_codes(bin_q[m, true])).each do |d, n|
      candidates[m] = candidates[m] | @hash_table[symbolized_hash_key(@hash_codes[n, true])]
      # TODO: Investigate the cause of the steep Ruby::BreakTypeMismatch error.
      # break if candidates[m].size >= n_neighbors
      break [[d, n]] if candidates[m].size >= n_neighbors
    end
    candidates[m] = candidates[m].shift(n_neighbors)
  end
  candidates
end

#search_radius(q, radius: 1.0) ⇒ `Array<Integer>`

Perform hamming radius nearest neighbor search.

Parameters:

q (Numo::DFloat) —

(shape: [n_queries, n_features]) The data for search queries.
radius (Float) (defaults to: 1.0) —

The hamming radius for search range.

Returns:

(Array<Integer>) —

The data indices of search result.

# File 'lib/hanny/lsh_index.rb', line 193

def search_radius(q, radius: 1.0)
  # Initialize some variables.
  n_queries = q.shape[0]
  candidates = Array.new(n_queries) { [] }
  # Binarize queries.
  bin_q = hash_function(q)
  # Find k-nearest neighbors for each query.
  n_queries.times do |m|
    sort_with_index(distances_to_hash_codes(bin_q[m, true])).each do |d, n|
      # TODO: Investigate the cause of the steep Ruby::BreakTypeMismatch error.
      # break if d > radius
      break [[d, n]] if d > radius

      candidates[m] = candidates[m] | @hash_table[symbolized_hash_key(@hash_codes[n, true])]
    end
  end
  candidates
end

Class: Hanny::LSHIndex

Overview

Examples:

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(code_length: 256, random_seed: nil) ⇒ LSHIndex

Instance Attribute Details

#code_length ⇒ Integer (readonly)

#hash_codes ⇒ Numo::Bit (readonly)

#hash_table ⇒ Hash (readonly)

#n_features ⇒ Integer (readonly)

#n_keys ⇒ Integer (readonly)

#n_samples ⇒ Integer (readonly)

#random_seed ⇒ Integer (readonly)

#rng ⇒ Random (readonly)

Instance Method Details

#append_data(x) ⇒ Array<Integer>

#build_index(x) ⇒ LSHIndex

#hash_function(x) ⇒ Numo::Bit

#remove_data(data_ids) ⇒ Array<Integer>

#search_knn(q, n_neighbors: 10) ⇒ Array<Integer>

#search_radius(q, radius: 1.0) ⇒ Array<Integer>

#initialize(code_length: 256, random_seed: nil) ⇒ `LSHIndex`

#code_length ⇒ `Integer` (readonly)

#hash_codes ⇒ `Numo::Bit` (readonly)

#hash_table ⇒ `Hash` (readonly)

#n_features ⇒ `Integer` (readonly)

#n_keys ⇒ `Integer` (readonly)

#n_samples ⇒ `Integer` (readonly)

#random_seed ⇒ `Integer` (readonly)

#rng ⇒ `Random` (readonly)

#append_data(x) ⇒ `Array<Integer>`

#build_index(x) ⇒ `LSHIndex`

#hash_function(x) ⇒ `Numo::Bit`

#remove_data(data_ids) ⇒ `Array<Integer>`

#search_knn(q, n_neighbors: 10) ⇒ `Array<Integer>`

#search_radius(q, radius: 1.0) ⇒ `Array<Integer>`