Level 1: Diamonds are Forever
Use the diamonds
data from ggplot2
package to perform the following tasks (when appropriate, use round()
to show 3 decimal places):
library(tidyverse)
data(diamonds) # load the diamonds data
What proportion of diamonds are between .40 and 1.04 carats?
How many of the diamonds have equal x and y dimensions?
How many of the diamonds have a depth less than the mean?
How many diamonds have a Very Good
cut or better?
- Note that cut is an ordered factor so the levels are in order.
Which diamond has the highest price per carat? What is the value?
Make boxplots of the diamond price for each cut.
Find the 95th percentile for diamond price.
- Try the
quantile()
function.
What proportion of the diamonds with a price above the 95th percentile have the color D
or J
?
Notice that this is asking for a conditional probability: \(\Pr(color = D\, \cup \,J | X > q95)\)
What proportion of diamonds with a clarity of VS2 have a Fair cut and a table below 56.1?
Notice that this is asking for a conditional probability: \(\Pr(cut = Fair \cap table<56.1 | clarity = VS2)\)
What is the average price per carat for each cut?
There are two ways to look at this problem. One is to consider price per carat as a measure of each diamond and simply compute the mean for each group. The second way is to take the total price for each group and divide by the total carats.
Level 2: Diamonds Keep their Value
Use the diamonds
data from the ggplot2
package to answer the following questions.
library(tidyverse)
data(diamonds) # load the diamonds data
Group diamonds by their cut and display the average price of each group.
Then create a visualization for the last exercise to show the average price of each group. (Stats students are encouraged to include uncertainty.)
Basic Bar Plot
Confidence Intervals
Group diamonds by color and display the average depth and average table of each group.
Then create a visualization for the last exercise to show the average depth and average table of each group.
Add two columns to the diamonds data set. The first column should give the average depth of diamonds in the diamond's color group. The second column should give the average depth of diamonds in the diamond's cut group. Show only the columns cut
, color
, and the two new columns.
Group diamonds by cut, color, and clarity. Show the average price of the diamonds in each group. Arrange by average price (highest to lowest).
What is the average price of the diamonds with the best cut, color, and clarity? Do the results from the previous question show that the diamonds with the best attributes (i.e., best cut, color, and clarity) have the highest average price? Please explain. See ?diamonds
for description of the characteristics.
Add another column to the diamonds data named ppc
that records the price per carat of each diamond. Then group the diamonds by cut, color, and clarity and display the average price and average ppc in each group. Arrange by average ppc (highest to lowest).
Create a scatterplot to show the relationship between carat
and ppc
. Also, create another scatterplot to show the relationship between price
and ppc
. Does carat
or price
have more predictive power?
Carat and ppc
Price and ppc
Create a rough confidence interval for the true mean ppc in each group by showing 2 standard errors above and below the sample average ppc. Recall the standard error is standard deviation divided by square root of sample size (\(se(\bar{x}) = s_x/\sqrt{n}\)). Specifically, create new columns named lower
and upper
that give 2 standard errors below and above (respectively) the sample average ppc. Arrange by lower
(highest to lowest).
Add a column named quartile
to the diamonds data set that displays the quartile of diamond's size (in carats). Label the quartiles Q1
, Q2
, Q3
, Q4
. Display all columns except x, y, and z
.
- Hint: use the function
cut_number()
.
Show the number and percentage of diamonds in each quartile (from previous question).
Make a boxplot of diamond price for each quartile of carat.
What is the average and minimum price per carat of the diamonds that cost more than $10000?
Of the diamonds costing less than $5000, what cut, color, and clarity combination has the most diamonds? What percentage of diamonds costing less than $5000 does this represent?
Level 3: Finding Inequalities
Universities (and corporations) are often concerned about gender inequality. The file <“http://mdporter.github.io/ST597/data/ucb.csv”> contains a sample of \(n=4526\) students admittance decisions at a university by academic department.
Task 1
Using only aggregate data (i.e., ignore department), do you think there is evidence of gender discrimination?
- Read in the data using
read_csv()
.
- Create a new data frame that contains frequency and relative frequency of admittance by
Gender
. Show the data frame.
- Make a relative frequency bar graph of admittance by
Gender
- Based on this information, indicate if you think there is gender discrimination. Provide justification.
- Especially for statistics students: Use statistical methods (confidence intervals or hypothesis test) to support your claims. See:
binom.test()
and prop.test()
This code will read in the data:
library(tidyverse)
ucb = read_csv("http://mdporter.github.io/ST597/data/ucb.csv")
Task 2
Now use the department information. Do you find evidence of gender discrimination for a particular department?
- create a new data frame that contains frequency and relative frequency of admittance by
Gender
for each Dept
. Show the full data frame.
- Make a bar graph with
Gender
on the x-axis, proportion admitted on the y-axis, and faceted by Dept
- Make a side-by-side bar graph with
Dept
on the x-axis, proportion admitted on the y-axis, and Gender
as the side-by-side.
- Based on the Department level information, indicate if you think there is gender discrimination. If so, list the guilty departments.
- Hint: Always keep uncertainty of the estimates in mind (and hence sample size).
Task 3
Summarize your findings.