1 Level 1: Diamonds are Forever

Use the diamonds data from ggplot2 package to perform the following tasks:

1.1 Load the diamonds data and tidyverse package

1.2 What proportion of diamonds are between .40 and 1.04 carats?

1.3 How many of the diamonds have equal x and y dimensions?

1.4 How many of the diamonds have a depth less than the mean?

1.5 How many diamonds have a Very Good cut or better?

1.6 Which diamond has the highest price per carat? What is the value?

1.7 Make boxplots of the diamond price for each cut.

1.8 Find the 95th percentile for diamond price.

1.9 What proportion of the diamonds with a price above the 95th percentile have the color D or J?

1.10 What proportion of diamonds with a clarity of VS2 have a Fair cut and a table below 56.1?

1.11 What is the average price per carat for each cut?


2 Level 2: Diamonds Keep their Value

Use the diamonds data from the ggplot2 package to answer the following questions.

2.1 Group diamonds by their cut and display the average price of each group.

2.2 Then create a visualization for the last exercise to show the average price of each group.

2.3 Group diamonds by color and display the average depth and average table of each group.

2.4 Then create a visualization for the last exercise to show the average depth and average table of each group.

2.5 Add two columns to the diamonds data set. The first column should give the average depth of diamonds in the diamond's color group. The second column should give the average depth of diamonds in the diamond's cut group. Show only the columns cut, color, and the two new columns.

2.6 Group diamonds by cut, color, and clarity. Show the average price of the diamonds in each group. Arrange by average price (highest to lowest).

2.7 What is the average price of the diamonds with the best cut, color, and clarity? Do the results from the previous question show that the diamonds with the best attributes (i.e., best cut, color, and clarity) have the highest average price? Please explain. See ?diamonds for description of the characteristics.

2.8 Add another column to the diamonds data named ppc that records the price per carat of each diamond. Then group the diamonds by cut, color, and clarity and display the average price and average ppc in each group. Arrange by average ppc (highest to lowest).

2.9 Create a scatterplot to show the relationship between carat and ppc. Also, create another scatterplot to show the relationship between price and ppc. Does carat or price have more predictive power?

2.10 Create a rough confidence interval for the true mean ppc in each group by showing 2 standard errors above and below the sample average ppc. Recall the standard error is standard deviation divided by square root of sample size (\(se(\bar{x}) = s_x/\sqrt{n}\)). Specifically, create new columns named lower and upper that give 2 standard errors below and above (respectively) the sample average ppc. Arrange by lower (highest to lowest).

2.11 Add a column named quartile to the diamonds data set that displays the quartile of diamond's size (in carats). Label the quartiles Q1, Q2, Q3, Q4. Display all columns except x, y, and z.

2.12 Show the number and percentage of diamonds in each quartile (from previous question).

2.13 Make a boxplot of diamond price for each quartile of carat.

2.14 What is the average and minimum price per carat of the diamonds that cost more than $10000?

2.15 Of the diamonds costing less than $5000, what cut, color, and clarity combination has the most diamonds? What percentage of diamonds costing less than $5000 does this represent?


3 Exercises from R4DS book

3.1 R4DS: 5.7.1 Exercises

  1. Refer back to the lists of useful mutate and filtering functions. Describe how each operation changes when you combine it with grouping.

  2. Which plane (tailnum) has the worst on-time record?

  3. What time of day should you fly if you want to avoid delays as much as possible?

  4. For each destination, compute the total minutes of delay. For each flight, compute the proportion of the total delay for its destination.

  5. Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave. Using lag(), explore how the delay of a flight is related to the delay of the immediately preceding flight.

  6. Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent a potential data entry error). Compute the air time a flight relative to the shortest flight to that destination. Which flights were most delayed in the air?

  7. Find all destinations that are flown by at least two carriers. Use that information to rank the carriers.

  8. For each plane, count the number of flights before the first delay of greater than 1 hour.

3.2 R4DS: 7.5.1.1 Exercises

  1. Use what you’ve learned to improve the visualisation of the departure times of cancelled vs. non-cancelled flights.

  2. What variable in the diamonds dataset is most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?

  1. Compare and contrast geom_violin() with a facetted geom_histogram(), or a coloured geom_freqpoly(). What are the pros and cons of each method?

3.3 R4DS: 7.5.2.1 Exercises

  1. How could you rescale the count dataset above to more clearly show the distribution of cut within colour, or colour within cut?

  2. Use geom_tile() together with dplyr to explore how average flight delays vary by destination and month of year. What makes the plot difficult to read? How could you improve it?

  3. Why is it slightly better to use aes(x = color, y = cut) rather than aes(x = cut, y = color) in the example above?

3.4 R4DS: 7.5.3.1 Exercises

  1. Instead of summarising the conditional distribution with a boxplot, you could use a frequency polygon. What do you need to consider when using cut_width() vs cut_number()? How does that impact a visualisation of the 2d distribution of carat and price?

  2. Visualise the distribution of carat, partitioned by price.

  3. How does the price distribution of very large diamonds compare to small diamonds? Is it as you expect, or does it surprise you?

  4. Combine two of the techniques you’ve learned to visualise the combined distribution of cut, carat, and price.

3.5 R4DS: 15.3.1 Exercises

  1. Explore the distribution of rincome (reported income). What makes the default bar chart hard to understand? How could you improve the plot?

  2. What is the most common relig in this survey? What’s the most common partyid?

  3. Which relig does denom (denomination) apply to? How can you find out with a table? How can you find out with a visualisation?

3.6 R4DS: 15.4.1 Exercises

  1. There are some suspiciously high numbers in tvhours. Is the mean a good summary?

  2. For each factor in gss_cat identify whether the order of the levels is arbitrary or principled.

  3. Why did moving “Not applicable” to the front of the levels move it to the bottom of the plot?

3.7 R4DS: 15.5.1 Exercises

  1. How have the proportions of people identifying as Democrat, Republican, and Independent changed over time?

  2. How could you collapse rincome into a small set of categories?

3.8 R4DS: 16.3.4

  1. How does the distribution of flight times within a day change over the course of the year?

  2. Compare dep_time, sched_dep_time and dep_delay. Are they consistent? Explain your findings.

  3. Compare air_time with the duration between the departure and arrival. Explain your findings. (Hint: consider the location of the airport.)

  4. How does the average delay time change over the course of a day? Should you use dep_time or sched_dep_time? Why?

  5. On what day of the week should you leave if you want to minimise the chance of a delay?

  6. What makes the distribution of diamonds$carat and flights$sched_dep_time similar?

  7. Confirm my hypothesis that the early departures of flights in minutes 20-30 and 50-60 are caused by scheduled flights that leave early. Hint: create a binary variable that tells you whether or not a flight was delayed.

3.9 R4DS: 19.2.1 Practice

  1. Why is TRUE not a parameter to rescale01()? What would happen if x contained a single missing value, and na.rm was FALSE?

  2. In the second variant of rescale01(), infinite values are left unchanged. Rewrite rescale01() so that -Inf is mapped to 0, and Inf is mapped to 1.

  3. Practice turning the following code snippets into functions. Think about what each function does. What would you call it? How many arguments does it need? Can you rewrite it to be more expressive or less duplicative?

    mean(is.na(x))
    
    x / sum(x, na.rm = TRUE)
    
    sd(x, na.rm = TRUE) / mean(x, na.rm = TRUE)
  4. Follow http://nicercode.github.io/intro/writing-functions.html to write your own functions to compute the variance and skew of a numeric vector.