Preamble: I will use a couple of stylistic conventions in this code

  1. variables will have names in all lower case with words separated by underscores, e.g. test_var
  2. I will use = for assignment like a reasonable person
  3. I will generally use single quotes for strings unless I need to include literal single quote characters in a string

SECTION 1 - Set-up

  1. Open a new R Script file in RStudio by clicking on the white square/green cross button in the upper left-hand corner.

  2. Start your file with a comment describing what the code is, your name, and today’s date.

  3. Look at your folder structure in your computer - Identify where your data file is.

SECTION 2 - Loading Data

Load the StateIncomeData.csv file into memory

NOTE: Pay attention to which directory R is currently treating as its working directory and where the data file is stored

## Look at the current working directory
curr_dir = getwd()

Assuming your data is in your working directory, load it.

If it isn’t, set your working directory to where your data is located at.

The “./” directory label tells R to look in the current working directory

By default, the read.csv function tries to infer the type of each column in a table. This can be dangerous. You can use the “colClasses” argument to force R to use specific types for each column.

#setwd("")
# change this to where your data is located at.

df_from_csv = read.csv(file='C:/Users/juliegil/University of Michigan Dropbox/SPH-MICOM/PROJECTS/MISUPPORT/workshop_set/StateIncomeData.csv')

## Examine the dataset. Make sure every column looks the way it's supposed to
# Note the column names. 
print(head(df_from_csv))
##   Rank                 State Per.capita.income Median.household.income
## 1    0 District of Columbia              45877                   71648
## 2    1          Connecticut              39373                   70048
## 3    2           New Jersey              37288                   69160
## 4    3        Massachusetts              36593                   71919
## 5    4             Maryland              36338                   73971
## 6    5        New Hampshire              34691                   66532
##   Median.family.income Population Number.of.households Number.of.families
## 1                84094     658893               277378             117864
## 2                88819    3596677              1355817             887263
## 3                87951    8938175              2549336            1610581
## 4                88419    6938608              3194844            2203675
## 5                89678    5976407              2165438            1445972
## 6                80581    1326813               519756             345901
## also take note of the column types
# check a single column
class(df_from_csv$Per.capita.income) # and/or
## [1] "numeric"
typeof(df_from_csv$Per.capita.income)
## [1] "double"
# check all the columns
str(df_from_csv)
## 'data.frame':    51 obs. of  8 variables:
##  $ Rank                   : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ State                  : chr  "District of Columbia " " Connecticut " " New Jersey " " Massachusetts " ...
##  $ Per.capita.income      : num  45877 39373 37288 36593 36338 ...
##  $ Median.household.income: num  71648 70048 69160 71919 73971 ...
##  $ Median.family.income   : num  84094 88819 87951 88419 89678 ...
##  $ Population             : int  658893 3596677 8938175 6938608 5976407 1326813 8326289 19746227 739482 736732 ...
##  $ Number.of.households   : int  277378 1355817 2549336 3194844 2165438 519756 3083820 7282398 305431 249659 ...
##  $ Number.of.families     : int  117864 887263 1610581 2203675 1445972 345901 2058820 4621954 187800 165015 ...
# if you need to change a column type, try as.numeric() or as.character()

What if your data is an excel spreadsheet? You may need additional packages for different file formats The code below checks whether a package is installed, installs it if it isn’t, and loads it if it is. We’ll talk about how this sort of logic works later

if (!require('readxl')) {
    install.packages('readxl')
}
## Loading required package: readxl
## Warning: package 'readxl' was built under R version 4.3.2

Once you install a package, you don’t have to install it again. From then on, you just need to load it in when you want to use it (or directly reference it).

Side note on packages: Sometimes, it’s difficult to know which packages you’ll need ahead of time! Usually, I start with some basics that I know I use a lot. Then I add additional ones in later. If you’re using code from someone else and it uses functions from a package you don’t have installed, a banner will show up in RStudio with a notification on that.

library(readxl)

The read_excel function from the readxl package will parse .xls and .xlsx files You can specify which sheet you want to load from a multi-sheet document using the “sheet” argument

The :: indicates that you’re using a function from a specific package:

package::function

I like to use it so I don’t lose track of which function came from which package

#setwd("")
# change this to where your data is located at.

df_from_xls = readxl::read_excel(path='C:/Users/juliegil/University of Michigan Dropbox/SPH-MICOM/PROJECTS/MISUPPORT/workshop_set/StateIncomeData.xlsx')

# but the code will also work without directly referencing the package:
# df_from_xls = read_excel(path='./StateIncomeData.xlsx')

# Note that the two dataframes should be exactly the same provided the input
# data in each was the same. 

SECTION 2b - Saving and re-opening an R Script file

  1. Save your R Script.

  2. Close RStudio.

  3. Open RStudio again (and check out what still is present - your file, and what has gone away - Environment is cleared)

  4. Now close your RScript by clicking the small x on the right-side of the tab of your script file.

  5. Two options for re-opening your file: (1) Using the File > Open File feature in RStudio or opening directly from the folder location on your computer.

And now, run all your code again, so everything is loaded back to where we were at!

(Note: If you haven’t talked about it already, talk about how to run a single line of code vs. a block of code using RStudio buttons & keyboard shortcuts)

SECTION 2c - Accessing data in a dataframe

Accessing columns:

You can access a column in a dataframe either using its name, or its index (its number starting from 1 for the leftmost column)

The $ indicates that you’re accessing a column in a dataframe:

df$colname

income_per_cap_by_name = df_from_csv$Per.capita.income

Per capita income is the 3rd column in the dataframe

Use square brackets to access elements of a dataframe by index.

The first argument in brackets is the row index, the second is the column index

If you leave either position blank, you’ll get every element in that axis

e.g. df[,1] gets the first column, df[3,] gets the third row

income_per_cap_by_idx = df_from_csv[,3]

Creating columns

You can make new columns in a dataframe pretty easily

Let’s add a column that multiplies income per capita by 2

# First I'll do this using the vector I assigned from the per capita income col
df_from_csv$multiplied_income_from_vec = income_per_cap_by_name*2

# Next I'll do it by accessing the income per capita column directly
df_from_csv$multiplied_income_from_col = df_from_csv$Per.capita.income*2

Making new columns is also extremely useful as an intermediate step in data analysis

For example, here I’ll make a column that has the value TRUE for a state with a median household income over $60,000 and FALSE otherwise

df_from_csv$over_60k = df_from_csv$Median.household.income > 60000

Sometimes later though, you’ll want to re-name a column. There are lots of ways to do that, but one is:

names(df)[names(df) == 'old.var.name'] <- 'new.var.name'

This code pretty much does the following:

names(df) looks into all the names in the data frame [names(df) == old.var.name] extracts the variable name you want to check <- 'new.var.name' assigns the new variable name.

[http://stackoverflow.com/questions/7531868/how-to-rename-a-single-column-in-a-data-frame]

names(df_from_csv)[names(df_from_csv) == 'over_60k'] <- 'median_over_60k'

You can also access elements in a column by using vectors containing TRUE and FALSE. The result will subset elements corresponding to the TRUEs

In this example I’ll get all the states with median household incomes over $60k

states_over_60k = df_from_csv$State[df_from_csv$median_over_60k]

SECTION 3a - Loops

Often we need to do the same task many times Loops are programming structures that let us accomplish this

A for loop uses some special syntax:

The first variable in the for loop parentheses is called the contro.l variable - it tells the loop when to start and when to stop. We can also use it within the loop itself

In this case we’ll use the control variable to keep track of the iteration of the loop I’m on

Let’s write a for loop that squares a sequence of numbers

seq = c(5,6,7,8,9,10,11,12,13)

# In this example 1:length(seq) represents the range of values that i can
# take during the loop. length(seq) gets the length of the variable seq
# so this loop will go from 1 to 9, as there are 9 elements in seq

for (i in 1:length(seq)) {
    print(seq[i]^2)
}
## [1] 25
## [1] 36
## [1] 49
## [1] 64
## [1] 81
## [1] 100
## [1] 121
## [1] 144
## [1] 169

SECTION 3b - Conditional Statements

Let’s use a conditional statement to check whether a variable is between greater than 5

The syntax for a conditional statement looks a bit like a for loop

The check itself is in parenthesis, then the action that the code should take is within curly braces

An else statement tells R what to do if the check evaluates to FALSE

x = 7
check_var = NA

if (x > 5) {
    check_var = TRUE
} else {
    check_var = FALSE
}

SECTION 3c - Functions

Let’s write a function to convert knots to mph

We define a function like we’re defining a variable

The name of the function goes on the left, then the function() statement tells R that we’re defining a function within the following curly braces.

The variable name(s) within parentheses is the argument of the function it’s the input that the function uses to generate its output.

Functions can have as many arguments as you want!

The return statement tells the function what should come out the other end

kn_to_mph = function(mph) {
    kn = 1.15078*mph
    return(kn)
}

Make sure to define your function before you use it in your code.

Otherwise, functions you write behave just like functions that are built in to R or that come with packages.

speed_in_knots = 32
speed_in_mph = kn_to_mph(speed_in_knots)