Dummy variable coding

A dummy variable is a binary flag (0 or 1) that designates the presence or absence of a feature. If you are using dummy variables, you will need to accommodate for as many levels as there are of the dummy variable, minus 1. For example, if you have a category designating humidity with only two levels, such as High and Low, you only need to create one dummy variable. Let's say it is called is.humid. If the value of humidity is High, is.humid=1. If humidity is Low, is.humid=0. However, many predictive analytics functions handle the creation of a dummy variable internally, so there is not as much use for coding dummy variables manually as there used to be. But you still may want to create flags that designate the levels of a categorical variable, which can be useful for plotting, creating customized transformations, and for manually creating interactions in a statistical model. There are several ways to do this; you can use Dummies package, which will create dummy variables automatically. But you can also accomplish this via code by using the Model Matrix function.

This takes the Segment categorical variable , which contains five levels (A-E), and expands it into four separate dummy variables:

set.seed(10) 
model <- data.frame(y=runif(10), x=runif(10), segment=as.factor(sample(LETTERS[1:5])))
head(model)

A <- model.matrix(y ~ x + segment,model)
head(A)

> head(model)
y x segment
1 0.50747820 0.6516557 E
2 0.30676851 0.5677378 C
3 0.42690767 0.1135090 D
4 0.69310208 0.5959253 A
5 0.08513597 0.3580500 B
6 0.22543662 0.4288094 E

> A <- model.matrix(y ~ x + segment,model)

> head(A)
(Intercept) x segmentB segmentC segmentD segmentE
1 1 0.6516557 0 0 0 1
2 1 0.5677378 0 1 0 0
3 1 0.1135090 0 0 1 0
4 1 0.5959253 0 0 0 0
5 1 0.3580500 1 0 0 0
6 1 0.4288094 0 0 0 1