Use the diamonds
data from ggplot2
package to perform the following tasks:
diamonds
data and tidyverse
packageVery Good
cut or better?quantile()
function.D
or J
?Use the diamonds
data from the ggplot2
package to answer the following questions.
cut
, color
, and the two new columns.?diamonds
for description of the characteristics.ppc
that records the price per carat of each diamond. Then group the diamonds by cut, color, and clarity and display the average price and average ppc in each group. Arrange by average ppc (highest to lowest).carat
and ppc
. Also, create another scatterplot to show the relationship between price
and ppc
. Does carat
or price
have more predictive power?lower
and upper
that give 2 standard errors below and above (respectively) the sample average ppc. Arrange by lower
(highest to lowest).quartile
to the diamonds data set that displays the quartile of diamond's size (in carats). Label the quartiles Q1
, Q2
, Q3
, Q4
. Display all columns except x, y, and z
.cut_number()
.Refer back to the lists of useful mutate and filtering functions. Describe how each operation changes when you combine it with grouping.
Which plane (tailnum
) has the worst on-time record?
What time of day should you fly if you want to avoid delays as much as possible?
For each destination, compute the total minutes of delay. For each flight, compute the proportion of the total delay for its destination.
Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave. Using lag()
, explore how the delay of a flight is related to the delay of the immediately preceding flight.
Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent a potential data entry error). Compute the air time a flight relative to the shortest flight to that destination. Which flights were most delayed in the air?
Find all destinations that are flown by at least two carriers. Use that information to rank the carriers.
For each plane, count the number of flights before the first delay of greater than 1 hour.
Use what you’ve learned to improve the visualisation of the departure times of cancelled vs. non-cancelled flights.
What variable in the diamonds dataset is most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?
geom_violin()
with a facetted geom_histogram()
, or a coloured geom_freqpoly()
. What are the pros and cons of each method?How could you rescale the count dataset above to more clearly show the distribution of cut within colour, or colour within cut?
Use geom_tile()
together with dplyr to explore how average flight delays vary by destination and month of year. What makes the plot difficult to read? How could you improve it?
Why is it slightly better to use aes(x = color, y = cut)
rather than aes(x = cut, y = color)
in the example above?
Instead of summarising the conditional distribution with a boxplot, you could use a frequency polygon. What do you need to consider when using cut_width() vs cut_number()? How does that impact a visualisation of the 2d distribution of carat and price?
Visualise the distribution of carat, partitioned by price.
How does the price distribution of very large diamonds compare to small diamonds? Is it as you expect, or does it surprise you?
Combine two of the techniques you’ve learned to visualise the combined distribution of cut, carat, and price.
Explore the distribution of rincome
(reported income). What makes the default bar chart hard to understand? How could you improve the plot?
What is the most common relig
in this survey? What’s the most common partyid
?
Which relig
does denom
(denomination) apply to? How can you find out with a table? How can you find out with a visualisation?
There are some suspiciously high numbers in tvhours
. Is the mean a good summary?
For each factor in gss_cat
identify whether the order of the levels is arbitrary or principled.
Why did moving “Not applicable” to the front of the levels move it to the bottom of the plot?
How have the proportions of people identifying as Democrat, Republican, and Independent changed over time?
How could you collapse rincome
into a small set of categories?
How does the distribution of flight times within a day change over the course of the year?
Compare dep_time
, sched_dep_time
and dep_delay
. Are they consistent? Explain your findings.
Compare air_time
with the duration between the departure and arrival. Explain your findings. (Hint: consider the location of the airport.)
How does the average delay time change over the course of a day? Should you use dep_time
or sched_dep_time
? Why?
On what day of the week should you leave if you want to minimise the chance of a delay?
What makes the distribution of diamonds$carat
and flights$sched_dep_time
similar?
Confirm my hypothesis that the early departures of flights in minutes 20-30 and 50-60 are caused by scheduled flights that leave early. Hint: create a binary variable that tells you whether or not a flight was delayed.
Why is TRUE
not a parameter to rescale01()
? What would happen if x
contained a single missing value, and na.rm
was FALSE
?
In the second variant of rescale01()
, infinite values are left unchanged. Rewrite rescale01()
so that -Inf
is mapped to 0, and Inf
is mapped to 1.
Practice turning the following code snippets into functions. Think about what each function does. What would you call it? How many arguments does it need? Can you rewrite it to be more expressive or less duplicative?
mean(is.na(x))
x / sum(x, na.rm = TRUE)
sd(x, na.rm = TRUE) / mean(x, na.rm = TRUE)
Follow http://nicercode.github.io/intro/writing-functions.html to write your own functions to compute the variance and skew of a numeric vector.