Class: Rumale::FeatureExtraction::HashVectorizer

Inherits:
Base::Estimator show all
Includes:
Base::Transformer
Defined in:
rumale-feature_extraction/lib/rumale/feature_extraction/hash_vectorizer.rb

Overview

Encode array of feature-value hash to vectors. This encoder turns array of mappings (Array<Hash>) with pairs of feature names and values into Numo::NArray.

Examples:

require 'rumale/feature_extraction/hash_vectorizer'

encoder = Rumale::FeatureExtraction::HashVectorizer.new
x = encoder.fit_transform([
  { foo: 1, bar: 2 },
  { foo: 3, baz: 1 }
])

# > pp x
# Numo::DFloat#shape=[2,3]
# [[2, 0, 1],
#  [0, 1, 3]]

x = encoder.fit_transform([
  { city: 'Dubai',  temperature: 33 },
  { city: 'London', temperature: 12 },
  { city: 'San Francisco', temperature: 18 }
])

# > pp x
# Numo::DFloat#shape=[3,4]
# [[1, 0, 0, 33],
#  [0, 1, 0, 12],
#  [0, 0, 1, 18]]
# > pp encoder.inverse_transform(x)
# [{:city=>"Dubai", :temperature=>33.0},
#  {:city=>"London", :temperature=>12.0},
#  {:city=>"San Francisco", :temperature=>18.0}]

Instance Attribute Summary collapse

Attributes inherited from Base::Estimator

#params

Instance Method Summary collapse

Constructor Details

#initialize(separator: '=', sort: true) ⇒ HashVectorizer

Create a new encoder for converting array of hash consisting of feature names and values to vectors.

Parameters:

  • separator (String) (defaults to: '=')

    The separator string used for constructing new feature names for categorical feature.

  • sort (Boolean) (defaults to: true)

    The flag indicating whether to sort feature names.



55
56
57
58
59
60
61
# File 'rumale-feature_extraction/lib/rumale/feature_extraction/hash_vectorizer.rb', line 55

def initialize(separator: '=', sort: true)
  super()
  @params = {
    separator: separator,
    sort: sort
  }
end

Instance Attribute Details

#feature_namesArray (readonly)

Return the list of feature names.

Returns:

  • (Array)

    (size: [n_features])



45
46
47
# File 'rumale-feature_extraction/lib/rumale/feature_extraction/hash_vectorizer.rb', line 45

def feature_names
  @feature_names
end

#vocabularyHash (readonly)

Return the hash consisting of pairs of feature names and indices.

Returns:

  • (Hash)

    (size: [n_features])



49
50
51
# File 'rumale-feature_extraction/lib/rumale/feature_extraction/hash_vectorizer.rb', line 49

def vocabulary
  @vocabulary
end

Instance Method Details

#fit(x) ⇒ HashVectorizer

Fit the encoder with given training data.

Parameters:

  • x (Array<Hash>)

    (shape: [n_samples]) The array of hash consisting of feature names and values.

Returns:



68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
# File 'rumale-feature_extraction/lib/rumale/feature_extraction/hash_vectorizer.rb', line 68

def fit(x, _y = nil)
  @feature_names = []
  @vocabulary = {}

  x.each do |f|
    f.each do |k, v|
      k = :"#{k}#{separator}#{v}" if v.is_a?(String)
      next if @vocabulary.key?(k)

      @feature_names.push(k)
      @vocabulary[k] = @vocabulary.size
    end
  end

  if sort_feature?
    @feature_names.sort!
    @feature_names.each_with_index { |k, i| @vocabulary[k] = i }
  end

  self
end

#fit_transform(x) ⇒ Numo::DFloat

Fit the encoder with given training data, then return encoded data.

Returns (shape: [n_samples, n_features]) The encoded sample array.

Parameters:

  • x (Array<Hash>)

    (shape: [n_samples]) The array of hash consisting of feature names and values.

Returns:

  • (Numo::DFloat)

    (shape: [n_samples, n_features]) The encoded sample array.



95
96
97
# File 'rumale-feature_extraction/lib/rumale/feature_extraction/hash_vectorizer.rb', line 95

def fit_transform(x, _y = nil)
  fit(x).transform(x)
end

#inverse_transform(x) ⇒ Array<Hash>

Decode sample matirx to the array of feature-value hash.

Parameters:

  • x (Numo::DFloat)

    (shape: [n_samples, n_features]) The encoded sample array.

Returns:

  • (Array<Hash>)

    The array of hash consisting of feature names and values.



126
127
128
129
130
131
132
133
134
135
136
137
138
139
# File 'rumale-feature_extraction/lib/rumale/feature_extraction/hash_vectorizer.rb', line 126

def inverse_transform(x)
  n_samples = x.shape[0]
  reconst = []

  n_samples.times do |i|
    f = {}
    x[i, true].each_with_index do |el, j|
      feature_key_val(@feature_names[j], el).tap { |k, v| f[k.to_sym] = v } unless el.zero?
    end
    reconst.push(f)
  end

  reconst
end

#transform(x) ⇒ Numo::DFloat

Encode given the array of feature-value hash.

Parameters:

  • x (Array<Hash>)

    (shape: [n_samples]) The array of hash consisting of feature names and values.

Returns:

  • (Numo::DFloat)

    (shape: [n_samples, n_features]) The encoded sample array.



103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
# File 'rumale-feature_extraction/lib/rumale/feature_extraction/hash_vectorizer.rb', line 103

def transform(x)
  x = [x] unless x.is_a?(Array)
  n_samples = x.size
  n_features = @vocabulary.size
  z = Numo::DFloat.zeros(n_samples, n_features)

  x.each_with_index do |f, i|
    f.each do |k, v|
      if v.is_a?(String)
        k = :"#{k}#{separator}#{v}"
        v = 1
      end
      z[i, @vocabulary[k]] = v if @vocabulary.key?(k)
    end
  end

  z
end