Generating the data

Next, generate the membership file with 1,000 members and assign it to the member dataframe. The wakefield package uses specialized functions to generate typical values for each of the specified variables:

  • The gender function will generate M, or F with a 50% chance of any individual row being a male or female
  • The set.seed(1010) directive guarantees that the results will be the same no matter how many times you run the code
  • The r_sample_replace() function will generate a unique member ID with a value from 1 to 1000
  • The income, children, employment, level, grad, year, state, and zip_code variables are all randomly generated without supplying any specialized parameters

Open up a new script window, and run the following code:

#GENERATE MEMBER
set.seed(1010)
member <- r_data_frame(
n=1000,
r_sample_replace(x = 1:1000,replace=FALSE,name="memberid"),
age,
gender(x = c("M","F"), prob = c(.5,.5),name="Gender"),
dob,
income,
children,
employment,
level,
grade,
year,
state,
zip_code
)

Similarly, generate the purchases file, and assign it to the purchases dataframe. The total amount purchases will come from a normal distribution with a mean purchases amount of 20,000 and a standard deviation of 1,000. The Product variable is a random product name coming from the letters A-Z.

This next code snippet will generate the purchases:

 
#GENERATE PURCHASES
set.seed(1010)

purchases <- r_data_frame(
n=1000,
r_sample_replace(x = 1:1000,replace=TRUE,name="memberid2"),
purch=rpois(lambda=1),
normal(mean = 20000, sd = 1000, min = NULL, max = NULL, name = "TotalAmount"),
upper(x = LETTERS, k=3, prob = NULL, name = "Product")
)
purchases$purch <- purchases$purch + 1
str(purchases)

After generating the purchases dataframe, we will write it to an external CSV file. The reason we are doing this is to demonstrate how we can filter rows from an external file using read.csv.sql:

#WRITE PURCHASES TO FILE

write.csv(purchases, "purchases.csv", quote = FALSE, row.names = FALSE)

Now we will read the purchases.csv back into R. But instead of reading the whole file, we will only read in those records where the TotalAmount > 20,000:

 
#read it back in

purchases_filtered <- read.csv.sql("purchases.csv",sql = "select * from file where TotalAmount > 20000 ")

You may be asking why we did it this way instead of just reading the whole file and then filtering it later. The answer is efficiency. Let's assume that in reality, the purchases file would be much larger than the 1,000 records it really is, and would contain lots of small purchases. We wouldn't want to take up processing time by reading all of these small transactions, since we are only interested in looking at high-value members. So, you can filter the data while reading the file, and only read in those members who had purchases greater than 20,000.