Class: Hanny::LSHIndex
- Inherits:
-
Object
- Object
- Hanny::LSHIndex
- Defined in:
- lib/hanny/lsh_index.rb
Overview
LSHIndex is a class that builds a search index with Locality Sensitive Hashing (LSH) [1]. It is known that if the code length is sufficiently long (ex. greater than 128-bit), LSH can obtain higher search performance than many popular hashing methods [2]. In search process, LSHIndex obtains search results by sorting the data stored in hash table with Hamming distances between query binary code and binary hash keys.
References:
-
Moses S. Charikar, “Similarity Estimation Techniques from Rounding Algorithms,” Proc. of the 34-th Annual ACM Symposium on Theory of Computing, pp. 380–388, (2002).
-
Deng Cai, “A Revisit of Hashing Algorithms for Approximate Nearest Neighbor Search,” CoRR abs/1612.07545 (2016).
Instance Attribute Summary collapse
-
#code_length ⇒ Integer
readonly
Return the code length of hash key.
-
#hash_codes ⇒ Numo::Bit
readonly
Return the binary hash codes.
-
#hash_table ⇒ Hash
readonly
Return the hash table.
-
#n_features ⇒ Integer
readonly
Return the number of features of indexed data.
-
#n_keys ⇒ Integer
readonly
Return the number of hash keys.
-
#n_samples ⇒ Integer
readonly
Return the number of samples of indexed data.
-
#random_seed ⇒ Integer
readonly
Return the seed to initialize random number generator.
-
#rng ⇒ Random
readonly
Return the random generator to generate random matrix.
Instance Method Summary collapse
-
#append_data(x) ⇒ Array<Integer>
Append new data to the search index.
-
#build_index(x) ⇒ LSHIndex
Build a search index.
-
#hash_function(x) ⇒ Numo::Bit
Convert data into binary codes.
-
#initialize(code_length: 256, random_seed: nil) ⇒ LSHIndex
constructor
Create a new nearest neighbor index.
-
#remove_data(data_ids) ⇒ Array<Integer>
Remove data from the search index.
-
#search_knn(q, n_neighbors: 10) ⇒ Array<Integer>
Perform k-nearest neighbor search.
-
#search_radius(q, radius: 1.0) ⇒ Array<Integer>
Perform hamming radius nearest neighbor search.
Constructor Details
#initialize(code_length: 256, random_seed: nil) ⇒ LSHIndex
Create a new nearest neighbor index.
66 67 68 69 70 71 72 73 |
# File 'lib/hanny/lsh_index.rb', line 66 def initialize(code_length: 256, random_seed: nil) @code_length = code_length @last_id = nil @weight_mat = nil @random_seed = random_seed @random_seed ||= srand @rng = Random.new(@random_seed) end |
Instance Attribute Details
#code_length ⇒ Integer (readonly)
Return the code length of hash key.
33 34 35 |
# File 'lib/hanny/lsh_index.rb', line 33 def code_length @code_length end |
#hash_codes ⇒ Numo::Bit (readonly)
Return the binary hash codes.
53 54 55 |
# File 'lib/hanny/lsh_index.rb', line 53 def hash_codes @hash_codes end |
#hash_table ⇒ Hash (readonly)
Return the hash table.
49 50 51 |
# File 'lib/hanny/lsh_index.rb', line 49 def hash_table @hash_table end |
#n_features ⇒ Integer (readonly)
Return the number of features of indexed data.
41 42 43 |
# File 'lib/hanny/lsh_index.rb', line 41 def n_features @n_features end |
#n_keys ⇒ Integer (readonly)
Return the number of hash keys.
45 46 47 |
# File 'lib/hanny/lsh_index.rb', line 45 def n_keys @n_keys end |
#n_samples ⇒ Integer (readonly)
Return the number of samples of indexed data.
37 38 39 |
# File 'lib/hanny/lsh_index.rb', line 37 def n_samples @n_samples end |
#random_seed ⇒ Integer (readonly)
Return the seed to initialize random number generator.
57 58 59 |
# File 'lib/hanny/lsh_index.rb', line 57 def random_seed @random_seed end |
#rng ⇒ Random (readonly)
Return the random generator to generate random matrix.
61 62 63 |
# File 'lib/hanny/lsh_index.rb', line 61 def rng @rng end |
Instance Method Details
#append_data(x) ⇒ Array<Integer>
Append new data to the search index.
114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 |
# File 'lib/hanny/lsh_index.rb', line 114 def append_data(x) # Initialize some variables. n_new_samples = x.shape[0] bin_x = hash_function(x) added_data_ids = [] # Store samples to binary hash table. new_codes = [] n_new_samples.times do |m| bin_code = bin_x[m, true] hash_key = symbolized_hash_key(bin_code) unless @hash_table.key?(hash_key) new_codes.push(bin_code.to_a) @hash_table[hash_key] = [] end new_data_id = @last_id + m @hash_table[hash_key].push(new_data_id) added_data_ids.push(new_data_id) end # Update hash codes. unless new_codes.empty? new_codes = Numo::Bit.cast(new_codes) @hash_codes = @hash_codes.concatenate(new_codes) @n_keys = @hash_codes.shape[0] end @last_id += n_new_samples @n_samples += n_new_samples added_data_ids end |
#build_index(x) ⇒ LSHIndex
Build a search index.
85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 |
# File 'lib/hanny/lsh_index.rb', line 85 def build_index(x) # Initialize some variables. @n_samples = x.shape[0] @n_features = x.shape[1] @hash_table = {} @weight_mat = Utils.rand_normal([@n_features, @code_length], @rng) # Convert samples to binary codes. bin_x = hash_function(x) # Store samples to binary hash table. codes = [] @n_samples.times do |m| bin_code = bin_x[m, true] hash_key = symbolized_hash_key(bin_code) unless @hash_table.key?(hash_key) codes.push(bin_code.to_a) @hash_table[hash_key] = [] end @hash_table[hash_key].push(m) end @hash_codes = Numo::Bit.cast(codes) # Update some variables. @n_keys = @hash_codes.shape[0] @last_id = @n_samples self end |
#hash_function(x) ⇒ Numo::Bit
Convert data into binary codes.
78 79 80 |
# File 'lib/hanny/lsh_index.rb', line 78 def hash_function(x) x.dot(@weight_mat).ge(0.0) end |
#remove_data(data_ids) ⇒ Array<Integer>
Remove data from the search index. The indices of removed data will never be assigned unless the search index is rebuilt.
147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 |
# File 'lib/hanny/lsh_index.rb', line 147 def remove_data(data_ids) removed_data_ids = [] data_ids.each do |query_id| # Remove data id from hash table. hash_key = @hash_table.keys.find { |k| @hash_table[k].include?(query_id) } next if hash_key.nil? @hash_table[hash_key].delete(query_id) removed_data_ids.push(query_id) # Remove the hash key if there is no data. next unless @hash_table[hash_key].empty? target_id = distances_to_hash_codes(decoded_hash_key(hash_key)).index(0) @hash_codes = @hash_codes.delete(target_id, 0) end @n_samples -= removed_data_ids.size removed_data_ids end |
#search_knn(q, n_neighbors: 10) ⇒ Array<Integer>
Perform k-nearest neighbor search.
170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 |
# File 'lib/hanny/lsh_index.rb', line 170 def search_knn(q, n_neighbors: 10) # Initialize some variables. n_queries = q.shape[0] candidates = Array.new(n_queries) { [] } # Binarize queries. bin_q = hash_function(q) # Find k-nearest neighbors for each query. n_queries.times do |m| sort_with_index(distances_to_hash_codes(bin_q[m, true])).each do |d, n| candidates[m] = candidates[m] | @hash_table[symbolized_hash_key(@hash_codes[n, true])] # TODO: Investigate the cause of the steep Ruby::BreakTypeMismatch error. # break if candidates[m].size >= n_neighbors break [[d, n]] if candidates[m].size >= n_neighbors end candidates[m] = candidates[m].shift(n_neighbors) end candidates end |
#search_radius(q, radius: 1.0) ⇒ Array<Integer>
Perform hamming radius nearest neighbor search.
193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 |
# File 'lib/hanny/lsh_index.rb', line 193 def search_radius(q, radius: 1.0) # Initialize some variables. n_queries = q.shape[0] candidates = Array.new(n_queries) { [] } # Binarize queries. bin_q = hash_function(q) # Find k-nearest neighbors for each query. n_queries.times do |m| sort_with_index(distances_to_hash_codes(bin_q[m, true])).each do |d, n| # TODO: Investigate the cause of the steep Ruby::BreakTypeMismatch error. # break if d > radius break [[d, n]] if d > radius candidates[m] = candidates[m] | @hash_table[symbolized_hash_key(@hash_codes[n, true])] end end candidates end |