gravatar

oaxacamatt

Matt Curcio

Recently Published

Produce % AA Composition
From small test harness sample of O2 binding proteins
EM - Learning
Document test
Kmeans test
Multiple Caret Run: Naive Bayes, KNN, C5.0 with AA Test-harness dataset
Using `caret` and `caretEnsemble`
EM test
Title: EM test Summary: Model based clustering:Expectation Maximization
Iris-caret-6ML-models
KNN=modelknn, C50=modelc50, SVMR=modelSvm, NB=modelnb, CART=modelCart, LDA=modellda
Cubist Example
Feature_Selection_Using_Caret
* Calculate correlation matrix * Find attributes that are highly corrected (ideally > 0.75) * Learning Vector Quantization (LVQ) model * Construct an Learning Vector Quantization (LVQ) model. The varImp is then used to estimate the variable importance, which is printed and plotted. * Estimate variable importance using filterVarImp * Recursive feature selection
Tuning RF Using ntree
Random-Forest-of-AA-on-Test-Harness-w-Scale-Center
Random forest using 7 x 500 test-harness with 5 x 10 fold cross-validation.
Produce_Single_AA_Percent_Composition_For_Test-Harness.rmd
Take a test-harness (randomly sampled file of protein classes) and produce the percent amino acid composition from the proteins listed. In this case there are 3500 proteins.
Alt-500-Test-Harness.rmd
This program samples 500 polypeptides from each of the files and combines them into a test-harness file. This test harness file (4MB) is much smaller than if I were to run the full dataset of 27MB.
EDA-Part2-Oxygen-binding.rmd
A. Plot of Variances (x 10^5) Vs Amino Acid Type B. Plot of Means (% of Total) Vs Amino Acid Type
EDA-Part1-Oxygen-Binding-Proteins
One important question which I would like to answer with this exploratory data analysis, is this data set sufficient to successfully classify the seven classes of protein using a Random Forest machine learning approach?
is.palindrome
Fibonacci Numbers
What is Fib(37)?
The Monty Hall Problem
Who Wins?
cli_test_generator
Produces 5000 files in the FASTA format then compresses them into a bz2 archive. This program will be the basis for producing all CLI Tests for any number of students.
Bioconductor TM
See: https://rpubs.com/oaxacamatt/R-cran-TM This notebook is a continuation from part A which looks at the words from R-cran and does analysis on them. However this notebook looks at the words from Bioconductor packages. We will then take the output from these two sets of data and do a simple set comparison, looking A intersect B, A-B and B-A.
Text mining R-cran Descriptions - Part A
## Text mining R-cran Work with the text from ONLY the second column of 'R-cran' text found on [R-cran package short description](https://cran.r-project.org/) followed by 'Packages' then 'Table of available packages, sorted by name' For Part B see: https://rpubs.com/oaxacamatt/Bioconductor-TM
Microbenchmarking for a random number generator
The take home message is that 'ceiling' is better than 'as.integer' due to the fact that 'as.integer' returns NAs by coercion. Also using 'ceiling' is slightly faster. ;)
Generate Random RNA / FASTA
Plotting with Expressions in Labels
Brownian Motion
Stratified Means
Math Pix
Basic Histogram
Basic Histogram - you need to practice this, using GGplot.
Joy Division - Unknown Pleasures Album cover
I found this script on Rbloggers.com and I liked it and tweaked it a little bit.
Publish Document
title: "M.L. Using O2-Binding AA" subtitle: "1. Exploratory Investigation of Percent AA composition" author: "Matthew Curcio"
Use of Machine Learning wrt Oxygen-Binding Proteins
This module uses Random Forest against the %AA composition (using 20AA Not Dipeptides) of Oxygen binding proteins to predict classification
Random Forest Investigation
This is the first portion of my oxygen binding proteins machine learning project. This portion only displays the random forest generation using the percent amino acid composition.
O2 Binders
This is the start of my project to ?? first step produce data frames of o2 binders.
04_10_eQTL
This does not give the same values as the script produced by J Leek.
Global temperatures since 1880
JLeek_GLM 3_14
"GLM_3_14_pvalues_multiple_testing"
coursera_genomics_stats_wk2_quiz_ques5
What is the dimension of the residual matrix, the effects matrix and the coefficients matrix?
coursera_genomics_stats_wk2_quiz_ques4
Linear Regression Modelling
coursera_genomics_stats_wk2_quiz_ques2
Set the seed to 333 and use k-means to cluster the samples into two clusters. Use svd to calculate the singular vectors.
coursera_genomics_stats_wk2_quiz_ques4
Linear Models with Indicator variable for M sex.
coursera_genomics_stats_wk2_quiz_ques3
Linear Models
coursera_genomics_stats_wk2_quiz_ques1
Singular Value Analysis and Percent Variance
Quiz Wk 1, Question #9
Statistics for Genomic Data Science
MCC Bioconductor Project 4
Test 3
Basic GRanges & AnnotationHub
Bioconductor Exam 2
Coursera Bioconductor Quiz 2
Epigenomics roadmap data
Bioconductor for Genomic Data Science, Quiz 1
Coursera author: "MCC" date: "September 30, 2016"
Model fitting for an interesting puzzle
See: https://youtu.be/i4VqXRRXi68
Example 1 of Sentiment Analysis Using Natural Language Processing
This example sentiment analysis uses Natural Language Processing (NLP) using R and the library TM (text mining package, https://cran.r-project.org/web/packages/tm/). The corpus uses 81 job postings using the key words, computation biology or bioinfromatics from http://www.biospace.com.
neuralnet
Joe's Code
Chi Sq simulation
Quantitative Genetics, HW #2, Question #3
This is an attempt to answer HW 2 question #3