categoricalInformationValue

導入: v20.1.0

二値の目的変数に対するカテゴリ型特徴量の Information Value (IV) を計算します。

各カテゴリごとに、関数は次を計算します: (P(tag = 1) - P(tag = 0)) × (log(P(tag = 1)) - log(P(tag = 0)))

ここで:

P(tag = 1) は、指定されたカテゴリにおいて目的変数が 1 となる確率
P(tag = 0) は、指定されたカテゴリにおいて目的変数が 0 となる確率

Information Value は、予測モデルにおいて、カテゴリ型特徴量と二値の目的変数との関係の強さを測定するために用いられる統計量です。絶対値が大きいほど、予測力が高いことを示します。

この関数の結果は、tag の値を予測する学習モデルに対して、各離散 (カテゴリ型) 特徴量 [category1, category2, ...] がどの程度寄与しているかを示します。

構文

categoricalInformationValue(category1[, category2, ...,]tag)

引数

category1, category2, ... — 解析する 1つ以上のカテゴリカル特徴量。各カテゴリは離散値を含む必要があります。UInt8
tag — 予測対象となるバイナリ目的変数。0 と 1 の値を含む必要があります。UInt8

戻り値

各カテゴリの一意な組み合わせに対する情報値 (information value) を表す Float64 値の配列を返します。各値は、そのカテゴリの組み合わせが目的変数に対して持つ予測力の強さを示します。Array(Float64)

例

年齢層とモバイル利用状況を解析する基本的な使用例

-- Using the metrica.hits dataset (available on https://sql.clickhouse.com/) to analyze age-mobile relationship
SELECT categoricalInformationValue(Age < 15, IsMobile)
FROM metrica.hits;

[0.0014814694805292418]

ユーザー属性を含む複数のカテゴリカル特徴量

SELECT categoricalInformationValue(
    Sex,                 -- 0=male, 1=female
    toUInt8(Age < 25),   -- 0=25+, 1=under 25
    toUInt8(IsMobile)    -- 0=desktop, 1=mobile
) AS iv_values
FROM metrica.hits
WHERE Sex IN (0, 1);

[0.00018965785460692887,0.004973668839403392]

categoricalInformationValue​

categoricalInformationValue