Class: Hanny::LSHIndex

Inherits:
Object
  • Object
show all
Defined in:
lib/hanny/lsh_index.rb

Overview

LSHIndex is a class that builds a search index with Locality Sensitive Hashing (LSH) [1]. It is known that if the code length is sufficiently long (ex. greater than 128-bit), LSH can obtain higher search performance than many popular hashing methods [2]. In search process, LSHIndex obtains search results by sorting the data stored in hash table with Hamming distances between query binary code and binary hash keys.

References:

  1. Moses S. Charikar, “Similarity Estimation Techniques from Rounding Algorithms,” Proc. of the 34-th Annual ACM Symposium on Theory of Computing, pp. 380–388, (2002).

  2. Deng Cai, “A Revisit of Hashing Algorithms for Approximate Nearest Neighbor Search,” CoRR abs/1612.07545 (2016).

Examples:

# Prepare vector data for search targets and queries with Numo::DFloat (shape: [n_samples, n_features]).
targets = Numo::DFloat.new(5000, 512).rand
queries = Numo::DFloat.new(10, 512).rand

# Build a search index with 256-bit binary code via LSH.
# Although LSHIndex works without setting random_seed, it recommends setting random_seed for reproducibility.
index = Hanny::LSHIndex.new(code_length: 256, random_seed: 1)
index.build_index(targets)

# Obtain the Array<Integer> that has the data indices of 10-neighbors for each query.
candidates = index.search_knn(queries, n_neighbors: 10)

# Save and load the search index with Marshal.
File.open('index.dat', 'wb') { |f| f.write(Marshal.dump(index)) }
index = Marshal.load(File.binread('index.dat'))

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(code_length: 256, random_seed: nil) ⇒ LSHIndex

Create a new nearest neighbor index.

Parameters:

  • code_length (Integer) (defaults to: 256)

    The length of binary code for hash key.

  • random_seed (Integer/NilClass) (defaults to: nil)

    The seed value using to initialize the random generator.



66
67
68
69
70
71
72
73
# File 'lib/hanny/lsh_index.rb', line 66

def initialize(code_length: 256, random_seed: nil)
  @code_length = code_length
  @last_id = nil
  @weight_mat = nil
  @random_seed = random_seed
  @random_seed ||= srand
  @rng = Random.new(@random_seed)
end

Instance Attribute Details

#code_lengthInteger (readonly)

Return the code length of hash key.

Returns:

  • (Integer)


33
34
35
# File 'lib/hanny/lsh_index.rb', line 33

def code_length
  @code_length
end

#hash_codesNumo::Bit (readonly)

Return the binary hash codes.

Returns:

  • (Numo::Bit)


53
54
55
# File 'lib/hanny/lsh_index.rb', line 53

def hash_codes
  @hash_codes
end

#hash_tableHash (readonly)

Return the hash table.

Returns:

  • (Hash)


49
50
51
# File 'lib/hanny/lsh_index.rb', line 49

def hash_table
  @hash_table
end

#n_featuresInteger (readonly)

Return the number of features of indexed data.

Returns:

  • (Integer)


41
42
43
# File 'lib/hanny/lsh_index.rb', line 41

def n_features
  @n_features
end

#n_keysInteger (readonly)

Return the number of hash keys.

Returns:

  • (Integer)


45
46
47
# File 'lib/hanny/lsh_index.rb', line 45

def n_keys
  @n_keys
end

#n_samplesInteger (readonly)

Return the number of samples of indexed data.

Returns:

  • (Integer)


37
38
39
# File 'lib/hanny/lsh_index.rb', line 37

def n_samples
  @n_samples
end

#random_seedInteger (readonly)

Return the seed to initialize random number generator.

Returns:

  • (Integer)


57
58
59
# File 'lib/hanny/lsh_index.rb', line 57

def random_seed
  @random_seed
end

#rngRandom (readonly)

Return the random generator to generate random matrix.

Returns:

  • (Random)


61
62
63
# File 'lib/hanny/lsh_index.rb', line 61

def rng
  @rng
end

Instance Method Details

#append_data(x) ⇒ Array<Integer>

Append new data to the search index.

Parameters:

  • x (Numo::DFloat)

    (shape: [n_samples, n_features]) The dataset to append to search index.

Returns:

  • (Array<Integer>)

    The indices of appended data in search index



114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
# File 'lib/hanny/lsh_index.rb', line 114

def append_data(x)
  # Initialize some variables.
  n_new_samples = x.shape[0]
  bin_x = hash_function(x)
  added_data_ids = []
  # Store samples to binary hash table.
  new_codes = []
  n_new_samples.times do |m|
    bin_code = bin_x[m, true]
    hash_key = symbolized_hash_key(bin_code)
    unless @hash_table.key?(hash_key)
      new_codes.push(bin_code.to_a)
      @hash_table[hash_key] = []
    end
    new_data_id = @last_id + m
    @hash_table[hash_key].push(new_data_id)
    added_data_ids.push(new_data_id)
  end
  # Update hash codes.
  unless new_codes.empty?
    new_codes = Numo::Bit.cast(new_codes)
    @hash_codes = @hash_codes.concatenate(new_codes)
    @n_keys = @hash_codes.shape[0]
  end
  @last_id += n_new_samples
  @n_samples += n_new_samples
  added_data_ids
end

#build_index(x) ⇒ LSHIndex

Build a search index.

Parameters:

  • x (Numo::DFloat)

    (shape: [n_samples, n_features]) The dataset for building search index.

Returns:

  • (LSHIndex)

    The search index itself that has constructed the hash table.



85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
# File 'lib/hanny/lsh_index.rb', line 85

def build_index(x)
  # Initialize some variables.
  @n_samples = x.shape[0]
  @n_features = x.shape[1]
  @hash_table = {}
  @weight_mat = Utils.rand_normal([@n_features, @code_length], @rng)
  # Convert samples to binary codes.
  bin_x = hash_function(x)
  # Store samples to binary hash table.
  codes = []
  @n_samples.times do |m|
    bin_code = bin_x[m, true]
    hash_key = symbolized_hash_key(bin_code)
    unless @hash_table.key?(hash_key)
      codes.push(bin_code.to_a)
      @hash_table[hash_key] = []
    end
    @hash_table[hash_key].push(m)
  end
  @hash_codes = Numo::Bit.cast(codes)
  # Update some variables.
  @n_keys = @hash_codes.shape[0]
  @last_id = @n_samples
  self
end

#hash_function(x) ⇒ Numo::Bit

Convert data into binary codes.

Parameters:

  • x (Numo::DFloat)

    (shape: [n_samples, n_features]) The data to be converted to binary codes.

Returns:

  • (Numo::Bit)

    The binary codes converted from given data.



78
79
80
# File 'lib/hanny/lsh_index.rb', line 78

def hash_function(x)
  x.dot(@weight_mat).ge(0.0)
end

#remove_data(data_ids) ⇒ Array<Integer>

Remove data from the search index. The indices of removed data will never be assigned unless the search index is rebuilt.

Parameters:

  • data_ids (Array<Integer>)

    The data indices to be removed.

Returns:

  • (Array<Integer>)

    The indices of removed data in search index



147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
# File 'lib/hanny/lsh_index.rb', line 147

def remove_data(data_ids)
  removed_data_ids = []
  data_ids.each do |query_id|
    # Remove data id from hash table.
    hash_key = @hash_table.keys.find { |k| @hash_table[k].include?(query_id) }
    next if hash_key.nil?

    @hash_table[hash_key].delete(query_id)
    removed_data_ids.push(query_id)
    # Remove the hash key if there is no data.
    next unless @hash_table[hash_key].empty?

    target_id = distances_to_hash_codes(decoded_hash_key(hash_key)).index(0)
    @hash_codes = @hash_codes.delete(target_id, 0)
  end
  @n_samples -= removed_data_ids.size
  removed_data_ids
end

#search_knn(q, n_neighbors: 10) ⇒ Array<Integer>

Perform k-nearest neighbor search.

Parameters:

  • q (Numo::DFloat)

    (shape: [n_queries, n_features]) The data for search queries.

  • n_neighbors (Integer) (defaults to: 10)

    The number of neighbors.

Returns:

  • (Array<Integer>)

    The data indices of search result.



170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
# File 'lib/hanny/lsh_index.rb', line 170

def search_knn(q, n_neighbors: 10)
  # Initialize some variables.
  n_queries = q.shape[0]
  candidates = Array.new(n_queries) { [] }
  # Binarize queries.
  bin_q = hash_function(q)
  # Find k-nearest neighbors for each query.
  n_queries.times do |m|
    sort_with_index(distances_to_hash_codes(bin_q[m, true])).each do |d, n|
      candidates[m] = candidates[m] | @hash_table[symbolized_hash_key(@hash_codes[n, true])]
      # TODO: Investigate the cause of the steep Ruby::BreakTypeMismatch error.
      # break if candidates[m].size >= n_neighbors
      break [[d, n]] if candidates[m].size >= n_neighbors
    end
    candidates[m] = candidates[m].shift(n_neighbors)
  end
  candidates
end

#search_radius(q, radius: 1.0) ⇒ Array<Integer>

Perform hamming radius nearest neighbor search.

Parameters:

  • q (Numo::DFloat)

    (shape: [n_queries, n_features]) The data for search queries.

  • radius (Float) (defaults to: 1.0)

    The hamming radius for search range.

Returns:

  • (Array<Integer>)

    The data indices of search result.



193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
# File 'lib/hanny/lsh_index.rb', line 193

def search_radius(q, radius: 1.0)
  # Initialize some variables.
  n_queries = q.shape[0]
  candidates = Array.new(n_queries) { [] }
  # Binarize queries.
  bin_q = hash_function(q)
  # Find k-nearest neighbors for each query.
  n_queries.times do |m|
    sort_with_index(distances_to_hash_codes(bin_q[m, true])).each do |d, n|
      # TODO: Investigate the cause of the steep Ruby::BreakTypeMismatch error.
      # break if d > radius
      break [[d, n]] if d > radius

      candidates[m] = candidates[m] | @hash_table[symbolized_hash_key(@hash_codes[n, true])]
    end
  end
  candidates
end