๋ฐ์ดํฐ ์ค๋ช
dataset containing demographic data and laboratory data of 857 patients with acute coronary syndrome(ACS).
# ๋ณ์๋ณ NA๊ฐ ํ์ธ
colSums(is.na(acs))
## age sex cardiogenicShock entry
## 0 0 0 0
## Dx EF height weight
## 0 134 93 91
## BMI obesity TC LDLC
## 93 0 23 24
## HDLC TG DM HBP
## 23 15 0 0
## smoking
## 0
colSums(is.na(acs))[colSums(is.na(acs))>0]
## EF height weight BMI TC LDLC HDLC TG
## 134 93 91 93 23 24 23 15
na.var <- names(colSums(is.na(acs))[colSums(is.na(acs))>0])
# ๊ทธ๋ํ๋ก ๋ณด๊ธฐ
aggr(acs, prop=FALSE)

# ์๊ด๊ด๊ณ
acs.na <- is.na(acs[,na.var])
round(cor(acs.na),2)
## EF height weight BMI TC LDLC HDLC TG
## EF 1.00 0.46 0.45 0.46 0.13 0.12 0.13 0.11
## height 0.46 1.00 0.99 1.00 0.20 0.19 0.20 0.21
## weight 0.45 0.99 1.00 0.99 0.20 0.19 0.20 0.21
## BMI 0.46 1.00 0.99 1.00 0.20 0.19 0.20 0.21
## TC 0.13 0.20 0.20 0.20 1.00 0.98 1.00 0.75
## LDLC 0.12 0.19 0.19 0.19 0.98 1.00 0.98 0.73
## HDLC 0.13 0.20 0.20 0.20 1.00 0.98 1.00 0.75
## TG 0.11 0.21 0.21 0.21 0.75 0.73 0.75 1.00
Missing Data ์ข ๋ฅ
- MCAR (missing completely at random): ๋ณ์์ ์ข ๋ฅ์ ๊ฐ ๋ชจ๋์ ๋ฌด๊ดํ ๊ฒฝ์ฐ
- MAR (missing at random): ๋๋ฝ์ด ๋ณ์์๋ ๊ด๋ จ์์ง๋ง ๊ทธ ๊ฐ๊ณผ๋ ๊ด๊ณ ์๋ ๊ฒฝ์ฐ
- MNAR (missing at not random): ๋๋ฝ์ ์์ธ์ด ์๋ ๊ฒฝ์ฐ
# na.omit๊ณผ complete.cases๋ ๊ฐ์ ์ญํ ์ ํ๋ค.
nrow(na.omit(acs)) == nrow(acs[complete.cases(acs),])
## [1] TRUE
์ถ๊ฐ๋ก ์์๋ณผ ๋งํ ์ฃผ์
- NA imputation with Gibbs Sampler
- NA imputation with GAN(Generative Adversarial Network)