Onehot Encoding ์—ญ๋ณ€ํ™˜

Onehot Encoding ์—ญ๋ณ€ํ™˜ (Inverse Transform)

๊ธฐ๋ณธ์ ์œผ๋กœ python์—์„œ pandas๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์•„๋ž˜์™€ ๊ฐ™์€ ๋ฐฉ์‹์œผ๋กœ OneHot Encoding์„ ํ•œ๋‹ค.

1
pd.get_dummies(df)

๊ทธ๋Ÿฐ๋ฐ ์ด๋ฅผ ์—ญ๋ณ€ํ™˜ํ•˜๋Š” ํ•จ์ˆ˜๋Š” ํŒจํ‚ค์ง€๋ฅผ ํ†ตํ•ด ์ œ๊ณต๋˜๊ณ  ์žˆ์ง€๋Š” ์•Š๋‹ค. ๊ทธ๋ ‡์ง€๋งŒ ์•„๋ž˜์˜ reference์—์„œ ์ข‹์€ ํ•จ์ˆ˜๋ฅผ ๋งŒ๋“ค์–ด์„œ ๊ณต์œ ํ•ด์ฃผ์…”์„œ ์ด๋ฅผ ๊ณต์œ ํ•˜๊ณ ์ž ํ•œ๋‹ค.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
def undummify(df, prefix_sep="_"):
    cols2collapse = {
        item.split(prefix_sep)[0]: (prefix_sep in item) for item in df.columns
    }
    series_list = []
    for col, needs_to_collapse in cols2collapse.items():
        if needs_to_collapse:
            undummified = (
                df.filter(like=col)
                .idxmax(axis=1)
                .apply(lambda x: x.split(prefix_sep, maxsplit=1)[1])
                .rename(col)
            )
            series_list.append(undummified)
        else:
            series_list.append(df[col])
    undummified_df = pd.concat(series_list, axis=1)
    return undummified_df

์œ„์—์„œ prefix_sep="_"๋กœ ์„ค์ •ํ•ด๋‘” ์ด์œ ๋Š”, pd.get_dummies()์—์„œ ๋ช…๋ชฉํ˜• ๋ณ€์ˆ˜๋ฅผ ๋ณ€์ˆ˜๋ช…_๋‚ด์šฉ์œผ๋กœ ๋ณ€ํ™˜ํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์ฆ‰, ํ˜น์‹œ ๋ชจ๋ฅผ ์—๋Ÿฌ๋ฅผ ๋ฐฉ์ง€ ํ•˜๊ธฐ ์œ„ํ•ด์„œ ์•„๋ž˜ ์ฝ”๋“œ๋ฅผ ์‹ค์‹œํ•ด์ค„ ํ•„์š”์„ฑ์ด ์žˆ๋‹ค.

1
2
3
data.columns = ['.'.join(col_split) for col_split in data.columns.to_series().str.rsplit('_')]
# data column ์ค‘์—์„œ ์ค‘๊ฐ„์— '_'๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ๋Š” ๊ฒฝ์šฐ '.'๋กœ ๊ต์ฒดํ•ด์ค€๋‹ค.
# ์ถ”ํ›„ one_hot encoding inverse_transform์—์„œ ์—๋Ÿฌ๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•จ.

Reference

[1] https://preservsun.tistory.com/entry/%EB%8D%94%EB%AF%B8%EB%B3%80%EC%88%98-%EC%A0%84%ED%99%98-%EC%A0%84%ED%99%98-%EB%90%98%EB%8F%8C%EB%A6%AC%EA%B8%B0-python-code