NA Imputation

๋ฐ์ดํ„ฐ ์„ค๋ช…

dataset containing demographic data and laboratory data of 857 patients with acute coronary syndrome(ACS).

# ๋ณ€์ˆ˜๋ณ„ NA๊ฐ’ ํ™•์ธ
colSums(is.na(acs))
##              age              sex cardiogenicShock            entry 
##                0                0                0                0 
##               Dx               EF           height           weight 
##                0              134               93               91 
##              BMI          obesity               TC             LDLC 
##               93                0               23               24 
##             HDLC               TG               DM              HBP 
##               23               15                0                0 
##          smoking 
##                0
colSums(is.na(acs))[colSums(is.na(acs))>0]
##     EF height weight    BMI     TC   LDLC   HDLC     TG 
##    134     93     91     93     23     24     23     15
na.var <- names(colSums(is.na(acs))[colSums(is.na(acs))>0])

# ๊ทธ๋ž˜ํ”„๋กœ ๋ณด๊ธฐ
aggr(acs, prop=FALSE) 

# ์ƒ๊ด€๊ด€๊ณ„
acs.na <- is.na(acs[,na.var])
round(cor(acs.na),2)
##          EF height weight  BMI   TC LDLC HDLC   TG
## EF     1.00   0.46   0.45 0.46 0.13 0.12 0.13 0.11
## height 0.46   1.00   0.99 1.00 0.20 0.19 0.20 0.21
## weight 0.45   0.99   1.00 0.99 0.20 0.19 0.20 0.21
## BMI    0.46   1.00   0.99 1.00 0.20 0.19 0.20 0.21
## TC     0.13   0.20   0.20 0.20 1.00 0.98 1.00 0.75
## LDLC   0.12   0.19   0.19 0.19 0.98 1.00 0.98 0.73
## HDLC   0.13   0.20   0.20 0.20 1.00 0.98 1.00 0.75
## TG     0.11   0.21   0.21 0.21 0.75 0.73 0.75 1.00

Missing Data ์ข…๋ฅ˜

  1. MCAR (missing completely at random): ๋ณ€์ˆ˜์˜ ์ข…๋ฅ˜์™€ ๊ฐ’ ๋ชจ๋‘์™€ ๋ฌด๊ด€ํ•œ ๊ฒฝ์šฐ
  2. MAR (missing at random): ๋ˆ„๋ฝ์ด ๋ณ€์ˆ˜์™€๋Š” ๊ด€๋ จ์žˆ์ง€๋งŒ ๊ทธ ๊ฐ’๊ณผ๋Š” ๊ด€๊ณ„ ์—†๋Š” ๊ฒฝ์šฐ
  3. MNAR (missing at not random): ๋ˆ„๋ฝ์˜ ์›์ธ์ด ์žˆ๋Š” ๊ฒฝ์šฐ
# na.omit๊ณผ complete.cases๋Š” ๊ฐ™์€ ์—ญํ• ์„ ํ•œ๋‹ค.
nrow(na.omit(acs)) == nrow(acs[complete.cases(acs),])
## [1] TRUE

์ถ”๊ฐ€๋กœ ์•Œ์•„๋ณผ ๋งŒํ•œ ์ฃผ์ œ

  1. NA imputation with Gibbs Sampler
  2. NA imputation with GAN(Generative Adversarial Network)
๋ชฉ์ฐจ