Recently Published
Data 606 Final Project
The final report should be presented in more formal format. Consider your audience to be non data analysts. Fellow data analysts (i.e. students) will be able to access your R Markdown file for details on the analysis. Submit a Zip file with your R Markdown file, the HTML output, and any supplementary files (e.g. data, figures, etc.). You must address the five following sections:
Introduction: What is your research question? Why do you care? Why should others care?
Data: Write about the data from your proposal in text form. Address the following points:
Data collection: Describe how the data were collected.
Cases: What are the cases? (Remember: case = units of observation or units of experiment)
Variables: What are the two variables you will be studying? State the type of each variable.
Type of study: What is the type of study, observational or an experiment? Explain how you’ve arrived at your conclusion using information on the sampling and/or experimental design.
Scope of inference - generalizability: Identify the population of interest, and whether the findings from this analysis can be generalized to that population, or, if not, a subsection of that population. Explain why or why not. Also discuss any potential sources of bias that might prevent generalizability.
Scope of inference - causality: Can these data be used to establish causal links between the variables of interest? Explain why or why not.
Exploratory data analysis: Perform relevant descriptive statistics, including summary statistics and visualization of the data. Also address what the exploratory data analysis suggests about your research question.
Inference: If your data fails some conditions and you can’t use a theoretical method, then you should use simulation. If you can use both methods, then you should use both methods. It is your responsibility to figure out the appropriate methodology.
Check conditions
Theoretical inference (if possible) - hypothesis test and confidence interval
Simulation based inference - hypothesis test and confidence interval
Brief description of methodology that reflects your conceptual understanding
Conclusion: Write a brief summary of your findings without repeating your statements from earlier. Also include a discussion of what you have learned about your research question and the data you collected. You may also want to include ideas for possible future research.
Data 607 Final Project
Data 607 Final Project
Data 607_DS in Context
Data Science in Context Presenation for Data 607
Data 606_Ch 7 Homework
Linear Regression
Data 607 Project 4
For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).
Data 606_Ch 6 Homework
Inference for Categorical Data
Data 606: Lab6 Inference for Categorical Data
Inference for categorical data
Data 607- Assignment 9- WEB API's
The New York Times web site provides a rich set of APIs, as described here: http://developer.nytimes.com/docs
- You’ll need to start by signing up for an API key.
- Your task is to choose one of the New York Times APIs, construct an interface in R to read in the JSON data, and transform it to an R dataframe
Data606_ Ch5 HW
Exercises from the following text:
OpenIntro Statistics 3rd Ed. Chapter 5:Inference for Numerical Data
Data 606 Lab 5: Inference for numerical data
Inference for numerical data
Data 607: Project 2
The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work. Your task is to: 1. Choose any three of the “wide” datasets. For each of the three chosen datasets:
- Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.
- Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]
- Perform the analysis requested in the discussion item.
- Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis,and conclusions.
DATA 607_week7_web technologies
Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting. Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats.
Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frame
s identical?
Your deliverable is the three source files and the R code. If you can, package your assignment solution up into an .Rmd file and publish to rpubs.com. [This will also require finding a way to make your three text files accessible from the web].
Data606 Assignment5- Ch4: Foundations for Inference
Foundations for Inference
Data606 Lab4b: Confidence Levels
Confidence Levels
Data606_Lab4a: Sampling Distributions
In this lab, we investigate the ways in which the statistics from a random sample of data can serve as point estimates for population parameters. We’re interested in formulating a sampling distribution of our estimate in order to learn about the properties of the estimate, such as its distribution.
Data606_Lab3: Normal Distributions
In this lab we’ll investigate the probability distribution that is most central to statistics: the normal distribution. If we are confident that our data are nearly normal, that opens the door to many powerful statistical methods. Here we’ll use the graphical tools of R to assess the normality of our data and also learn how to generate random numbers from a normal distribution.
Data607_Assignment4 (week5): Tidy and Transform Data
1. Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.
2. Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
3. Perform analysis to compare the arrival delays for the two airlines.
Data607_Project1-Chess
n this project, you’re given a text file with chess tournament results where the information has some structure. Your job is to create an R Markdown file that generates a .CSV file with the following information for all of the players: Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents.