[Question 1]

code:
features <- training[, !colnames(training) %in% nonpredictors]
labels <- as.factor(training$label)

output:
none


[Question 2]

code:
table(labels)

output:
labels
  0   1
500 500


[Question 3]

code:
library(caret)

set.seed(0)
model <- train(
    x = features,
    y = labels,
    method = 'rpart',
    trControl = trainControl(
        method = 'boot', # this is the default and can be excluded
        number = 10
        )
    )
model

output:
CART

1000 samples
 408 predictor
   2 classes: '0', '1'

No pre-processing
Resampling: Bootstrapped (5 reps)
Summary of sample sizes: 1000, 1000, 1000, 1000, 1000
Resampling results across tuning parameters:

  cp     Accuracy   Kappa
  0.028  0.5945917  0.18955871
  0.061  0.5647587  0.13218871
  0.188  0.5182033  0.04084188

Accuracy was used to select the optimal model using  the largest value.
The final value used for the model was cp = 0.028.


[Question 4]

code:
print(varImp(model), top = 10)

output:
rpart variable importance

  only 10 most important variables shown (out of 408)

                  Overall
H3K36me3..window.  100.00
CTCF..window.       95.29
SPI1..window.       88.81
SMC3..window.       88.39
RAD21..window.      85.40
H4K20me1..window.   81.11
CTCFL..window.      79.53
CTCF..promoter.     65.12
SMC3..promoter.     52.47
SMC3..enhancer.     49.01


[Question 5]

code:
set.seed(0)
model <- train(
    x = features,
    y = labels,
    method = 'rf',
    ntree = 50,
    trControl = trainControl(
        method = 'cv',
        number = 5
        )
    )
model

output:
Random Forest

1000 samples
 408 predictor
   2 classes: '0', '1'

No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 800, 800, 800, 800, 800
Resampling results across tuning parameters:

  mtry  Accuracy  Kappa
    2   0.733     0.466
  205   0.734     0.468
  408   0.751     0.502

Accuracy was used to select the optimal model using  the largest value.
The final value used for the model was mtry = 408.


[Question 6]

code:
print(varImp(model), top = 5)

output:
rf variable importance

  only 5 most important variables shown (out of 408)

                  Overall
CTCF..window.      100.00
CTCF..promoter.     55.87
H4K20me1..window.   53.53
SMC3..promoter.     52.12
RAD21..window.      46.48


[Question 7]

CTCF, SMC3, RAD21


[Question 8]

False positives would likely increase genome wide and cause a large decrease in precision since precision = tp / (tp + fp).