Class: Rumale::FeatureExtraction::FeatureHasher

Inherits:
Base::Estimator show all
Includes:
Base::Transformer
Defined in:
rumale-feature_extraction/lib/rumale/feature_extraction/feature_hasher.rb

Overview

Encode array of feature-value hash to vectors with feature hashing (hashing trick). This encoder turns array of mappings (Array<Hash>) with pairs of feature names and values into Numo::NArray. This encoder employs signed 32-bit Murmurhash3 as the hash function.

Examples:

require 'rumale/feature_extraction/feature_hasher'

encoder = Rumale::FeatureExtraction::FeatureHasher.new(n_features: 10)
x = encoder.transform([
  { dog: 1, cat: 2, elephant: 4 },
  { dog: 2, run: 5 }
])

# > pp x
# Numo::DFloat#shape=[2,10]
# [[0, 0, -4, -1, 0, 0, 0, 0, 0, 2],
#  [0, 0, 0, -2, -5, 0, 0, 0, 0, 0]]

Instance Attribute Summary

Attributes inherited from Base::Estimator

#params

Instance Method Summary collapse

Constructor Details

#initialize(n_features: 1024, alternate_sign: true) ⇒ FeatureHasher

Create a new encoder for converting array of hash consisting of feature names and values to vectors with feature hashing algorith.

Parameters:

  • n_features (Integer) (defaults to: 1024)

    The number of features of encoded samples.

  • alternate_sign (Boolean) (defaults to: true)

    The flag indicating whether to reflect the sign of the hash value to the feature value.



35
36
37
38
39
40
41
# File 'rumale-feature_extraction/lib/rumale/feature_extraction/feature_hasher.rb', line 35

def initialize(n_features: 1024, alternate_sign: true)
  super()
  @params = {
    n_features: n_features,
    alternate_sign: alternate_sign
  }
end

Instance Method Details

#fit(x) ⇒ FeatureHasher

This method does not do anything. The encoder does not require training.

Parameters:

  • x (Array<Hash>)

    (shape: [n_samples]) The array of hash consisting of feature names and values.

Returns:



48
49
50
# File 'rumale-feature_extraction/lib/rumale/feature_extraction/feature_hasher.rb', line 48

def fit(_x = nil, _y = nil)
  self
end

#fit_transform(x) ⇒ Numo::DFloat

Encode given the array of feature-value hash. This method has the same output as the transform method because the encoder does not require training.

Returns (shape: [n_samples, n_features]) The encoded sample array.

Parameters:

  • x (Array<Hash>)

    (shape: [n_samples]) The array of hash consisting of feature names and values.

Returns:

  • (Numo::DFloat)

    (shape: [n_samples, n_features]) The encoded sample array.



59
60
61
# File 'rumale-feature_extraction/lib/rumale/feature_extraction/feature_hasher.rb', line 59

def fit_transform(x, _y = nil)
  fit(x).transform(x)
end

#transform(x) ⇒ Numo::DFloat

Encode given the array of feature-value hash.

Parameters:

  • x (Array<Hash>)

    (shape: [n_samples]) The array of hash consisting of feature names and values.

Returns:

  • (Numo::DFloat)

    (shape: [n_samples, n_features]) The encoded sample array.



67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
# File 'rumale-feature_extraction/lib/rumale/feature_extraction/feature_hasher.rb', line 67

def transform(x)
  x = [x] unless x.is_a?(Array)
  n_samples = x.size

  z = Numo::DFloat.zeros(n_samples, n_features)

  x.each_with_index do |f, i|
    f.each do |k, v|
      k = "#{k}=#{v}" if v.is_a?(String)
      val = v.is_a?(String) ? 1 : v
      next if val.zero?

      h = Mmh3.hash32(k)
      fid = h.abs % n_features
      val *= h >= 0 ? 1 : -1 if alternate_sign?
      z[i, fid] = val
    end
  end

  z
end