From small test harness sample of O2 binding proteins
Using `caret` and `caretEnsemble`
Title: EM test Summary: Model based clustering:Expectation Maximization
KNN=modelknn, C50=modelc50, SVMR=modelSvm, NB=modelnb, CART=modelCart, LDA=modellda
* Calculate correlation matrix * Find attributes that are highly corrected (ideally > 0.75) * Learning Vector Quantization (LVQ) model * Construct an Learning Vector Quantization (LVQ) model. The varImp is then used to estimate the variable importance, which is printed and plotted. * Estimate variable importance using filterVarImp * Recursive feature selection
Random forest using 7 x 500 test-harness with 5 x 10 fold cross-validation.
Take a test-harness (randomly sampled file of protein classes) and produce the percent amino acid composition from the proteins listed. In this case there are 3500 proteins.
This program samples 500 polypeptides from each of the files and combines them into a test-harness file. This test harness file (4MB) is much smaller than if I were to run the full dataset of 27MB.
A. Plot of Variances (x 10^5) Vs Amino Acid Type B. Plot of Means (% of Total) Vs Amino Acid Type
One important question which I would like to answer with this exploratory data analysis, is this data set sufficient to successfully classify the seven classes of protein using a Random Forest machine learning approach?
What is Fib(37)?
Produces 5000 files in the FASTA format then compresses them into a bz2 archive. This program will be the basis for producing all CLI Tests for any number of students.
See: https://rpubs.com/oaxacamatt/R-cran-TM This notebook is a continuation from part A which looks at the words from R-cran and does analysis on them. However this notebook looks at the words from Bioconductor packages. We will then take the output from these two sets of data and do a simple set comparison, looking A intersect B, A-B and B-A.
## Text mining R-cran Work with the text from ONLY the second column of 'R-cran' text found on [R-cran package short description](https://cran.r-project.org/) followed by 'Packages' then 'Table of available packages, sorted by name' For Part B see: https://rpubs.com/oaxacamatt/Bioconductor-TM
The take home message is that 'ceiling' is better than 'as.integer' due to the fact that 'as.integer' returns NAs by coercion. Also using 'ceiling' is slightly faster. ;)
Basic Histogram - you need to practice this, using GGplot.
I found this script on Rbloggers.com and I liked it and tweaked it a little bit.
title: "M.L. Using O2-Binding AA" subtitle: "1. Exploratory Investigation of Percent AA composition" author: "Matthew Curcio"
This module uses Random Forest against the %AA composition (using 20AA Not Dipeptides) of Oxygen binding proteins to predict classification
This is the first portion of my oxygen binding proteins machine learning project. This portion only displays the random forest generation using the percent amino acid composition.
This is the start of my project to ?? first step produce data frames of o2 binders.
This does not give the same values as the script produced by J Leek.
What is the dimension of the residual matrix, the effects matrix and the coefficients matrix?
Linear Regression Modelling
Set the seed to 333 and use k-means to cluster the samples into two clusters. Use svd to calculate the singular vectors.
Linear Models with Indicator variable for M sex.
Singular Value Analysis and Percent Variance
Statistics for Genomic Data Science
Epigenomics roadmap data
Coursera author: "MCC" date: "September 30, 2016"
This example sentiment analysis uses Natural Language Processing (NLP) using R and the library TM (text mining package, https://cran.r-project.org/web/packages/tm/). The corpus uses 81 job postings using the key words, computation biology or bioinfromatics from http://www.biospace.com.
This is an attempt to answer HW 2 question #3