Class: Rumale::FeatureExtraction::TfidfTransformer

Inherits:
Base::Estimator show all
Includes:
Base::Transformer
Defined in:
rumale-feature_extraction/lib/rumale/feature_extraction/tfidf_transformer.rb

Overview

Transform sample matrix with term frequecy (tf) to a normalized tf-idf (inverse document frequency) reprensentation.

Reference

  • Manning, C D., Raghavan, P., and Schutze, H., “Introduction to Information Retrieval,” Cambridge University Press., 2008.

Examples:

require 'rumale/feature_extraction/hash_vectorizer'
require 'rumale/feature_extraction/tfidf_transformer'

encoder = Rumale::FeatureExtraction::HashVectorizer.new
x = encoder.fit_transform([
  { foo: 1, bar: 2 },
  { foo: 3, baz: 1 }
])

# > pp x
# Numo::DFloat#shape=[2,3]
# [[2, 0, 1],
#  [0, 1, 3]]

transformer = Rumale::FeatureExtraction::TfidfTransformer.new
x_tfidf = transformer.fit_transform(x)

# > pp x_tfidf
# Numo::DFloat#shape=[2,3]
# [[0.959056, 0, 0.283217],
#  [0, 0.491506, 0.870874]]

Instance Attribute Summary collapse

Attributes inherited from Base::Estimator

#params

Instance Method Summary collapse

Constructor Details

#initialize(norm: 'l2', use_idf: true, smooth_idf: false, sublinear_tf: false) ⇒ TfidfTransformer

Create a new transfomer for converting tf vectors to tf-idf vectors.

Parameters:

  • norm (String) (defaults to: 'l2')

    The normalization method to be used (‘l1’, ‘l2’ and ‘none’).

  • use_idf (Boolean) (defaults to: true)

    The flag indicating whether to use inverse document frequency weighting.

  • smooth_idf (Boolean) (defaults to: false)

    The flag indicating whether to apply idf smoothing by log((n_samples + 1) / (df + 1)) + 1.

  • sublinear_tf (Boolean) (defaults to: false)

    The flag indicating whether to perform subliner tf scaling by 1 + log(tf).



49
50
51
52
53
54
55
56
57
# File 'rumale-feature_extraction/lib/rumale/feature_extraction/tfidf_transformer.rb', line 49

def initialize(norm: 'l2', use_idf: true, smooth_idf: false, sublinear_tf: false)
  super()
  @params = {
    norm: norm,
    use_idf: use_idf,
    smooth_idf: smooth_idf,
    sublinear_tf: sublinear_tf
  }
end

Instance Attribute Details

#idfNumo::DFloat (readonly)

Return the vector consists of inverse document frequency.

Returns:

  • (Numo::DFloat)

    (shape: [n_features])



41
42
43
# File 'rumale-feature_extraction/lib/rumale/feature_extraction/tfidf_transformer.rb', line 41

def idf
  @idf
end

Instance Method Details

#fit(x) ⇒ TfidfTransformer

Calculate the inverse document frequency for weighting.

Parameters:

  • x (Numo::DFloat)

    (shape: [n_samples, n_features]) The samples to calculate the idf values.

Returns:



65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
# File 'rumale-feature_extraction/lib/rumale/feature_extraction/tfidf_transformer.rb', line 65

def fit(x, _y = nil)
  return self unless @params[:use_idf]

  n_samples = x.shape[0]
  df = x.class.cast(x.gt(0.0).count(0))

  if @params[:smooth_idf]
    df += 1
    n_samples += 1
  end

  @idf = Numo::NMath.log(n_samples / df) + 1

  self
end

#fit_transform(x) ⇒ Numo::DFloat

Calculate the idf values, and then transfrom samples to the tf-idf representation.

Parameters:

  • x (Numo::DFloat)

    (shape: [n_samples, n_features]) The samples to calculate idf and be transformed to tf-idf representation.

Returns:

  • (Numo::DFloat)

    The transformed samples.



87
88
89
# File 'rumale-feature_extraction/lib/rumale/feature_extraction/tfidf_transformer.rb', line 87

def fit_transform(x, _y = nil)
  fit(x).transform(x)
end

#transform(x) ⇒ Numo::DFloat

Perform transforming the given samples to the tf-idf representation.

Parameters:

  • x (Numo::DFloat)

    (shape: [n_samples, n_features]) The samples to be transformed.

Returns:

  • (Numo::DFloat)

    The transformed samples.



95
96
97
98
99
100
101
102
103
104
105
106
107
108
# File 'rumale-feature_extraction/lib/rumale/feature_extraction/tfidf_transformer.rb', line 95

def transform(x)
  z = x.dup

  z[z.ne(0)] = Numo::NMath.log(z[z.ne(0)]) + 1 if @params[:sublinear_tf]
  z *= @idf if @params[:use_idf]
  case @params[:norm]
  when 'l2'
    ::Rumale::Utils.normalize(z, 'l2')
  when 'l1'
    ::Rumale::Utils.normalize(z, 'l1')
  else
    z
  end
end