Recently Published
Document
Finding environmental differences/distance.
Circular Statistics to Understand Random Expectations for Directional Data
Kristen Noelle Finch
3/28/2019
Ideas for this analysis were obtained from this website: astrostatistics.psu.edu/RLectures/day5.pdf via Rich Cronn which has no author or citation listed.
Data
I’m using data from predicted origins of individual Cedrela odorata s. s. trees based on 119 SNP genotypes. Error of origin estimation is the Haversine distance converted to km between the true origin and the predicted origin. I also calculated a bearing or angular direction from the true origin to the estimated origin. Both calculations were completed with the R package geoshpere.
Assumption
I assume that the observed origin estimations will be significantly different than a uniform distribution of bearings because the origin estimations are bias by the geographical distribution of my specimens. Similarly, the origin estimations I generated with randomized genotypes should also be affected by this bias.
Question
Are our origin estimations from observed genotypes and randomized genotypes different from a uniform distibution of angles (or bearings) around any point?
Analyses
Raleigh test - “This test is based on the fact that if the angles are equally scattered in all directions, then the resultant should be close to zero.” or “tests uniformity as opposed to too many angles in one direction.”
Watson’s test - “This is another test based on a similar idea: if the resultant is too large, then most possibly the directions are not uniform. Watson’s test provides an approximate threshold for the length of the resultant to be considered too large.”
Kuiper’s test - “This is a more sophisticated test that is based on the Kolmogorov-Smirnov idea.”
Rao’s spacing test - “This test is based on the fact that if the angles are uniformly scattered in all directions, then the arc lengths between any two of the angles should have a particular type of distribution.”
von Mises distribution - “So far we are talking about only the uniform distribution of angles. This is a special case of the von Mises distribution, which also allows the angles to crowd more towards a certain mean direction. This distribution takes two parameters: the mean direction and ; which measures the concentration of the angles around the mean direction.” In this case we can test if the given data follows a von Mise distribution with a Watson’s test (use argument: dist=‘vm’). According to Wikipedia, a von Mises distribution is a circular normal distribution.
Chapter 1 Analysis
Dissertation Research.
Learning Module Two: Species Classification with Random Forests
Draft of Bioinformatics Workshop day 3.
Assessing Bait Efficiency
Dissertation Research: The data here were collected August 28, 2017 and samples were sequenced on or near August 14, 2017 at Univeristy of Oregon. I have a pool of paired-end 100 sequences from *Cedrela*, *Swietenia*, *Guarea*, and *Trichillia* species (Meliaceae). These sequences were obtained via hybridization capture, targeted enrichment, and short-read sequencing on the Illumina HiSeq 4000. Baits were designed from the transcriptome of *Cedrela odorata*. Here I am testing how many reads were captured by the baits across these species.
Chloroplast Assembly Stats
Dissertation Research: This is a comparison of Chloroplast Assembly Protocols ABySS and Spades. The graphs show sequenctial changes to the assembly, and these data were generated using GAEMR basic_assebly_stats.py.
DNASeq data from *C. odorata* from HiSeq 3000 PE-100.
Data Exploration: Climate
The purpose of this RPub is to aid in partitioning the data set for Fst tests. The vegan-generated PCs may aid in grouping samples according to climate similarity for Fst testing. For example, I could partition samples into high,moderate, and low values on PC1,PC2,PC3 etc.
Climate Space
Dissertation Research: The purpose of this analysis is to assess how much of the climate space is captured by my samples.
Climate data from WorldClim.
Citation:
*Fick, S.E. and R.J. Hijmans, 2017. Worldclim 2: New 1-km spatial resolution climate surfaces for global land areas. International Journal of Climatology.*
*Cedrela odorata* observation data from GBIF
Citation:
*GBIF (2012). Recommended practices for citation of the data published through the GBIF Network. Version 1.0 (Authored by Vishwas Chavan), Copenhagen: Global Biodiversity Information Facility. Pp.12, ISBN: 87-92020-36-4.*
Fst Tests for SNP Selection
Dissertation Research: The purpose of this analysis is to identify SNPs that are spatially informative.
Locus Maps
Dissertation research
June Data Analysis: Fsts and MAFs
Dissertation Data Analysis. Cedrela SNPs.
Random Forests Reanalyzed for Revision 1
These data pertain to Finch et al. 2017 *Applications in Plant Sciences*.
Random Forests and Unbalanced Classes
Testing if random forest classification is sensitive to unbalanced class sizes.
Misclassification of Cores
Finch et al. 2017
Applications in Plant Sciences
K-mer Frequency Distribution
This is an R Markdown document. The data for this analysis was collected on 13 January 2017. I have a pool of paired-end 100 sequences from Cedrela species. These sequences were obtained via hybridization capture, targeted enrichment, and short-read sequencing on the Illumina HiSeq 3000.
I used kmercountexact.sh from bbtools to produce a k-mer frequency distribution.
Alignment Assessment & Comparison
I'm comparing alignments for the same individuals to the baits sequences, the ced132 SPAdes de novo assembly, and the transcriptome reference that was used to select the bait sequences.
SPAdes de novo assembly of ced132 length distribution
Length distribution for contigs resulting from SPAdes assembly and post filtering. CEOD 300/ced132 Peru.
Target Coverage
R plot showing bait coverage by individual. This shows a random 10% subset of all baits.
Non-target Sequences; chloroplast
Graph showing alignment of non-target reads to two chloroplast genomes, by individual and by species.