Matt Curcio

Recently Published

Produce % AA Composition
From small test harness sample of O2 binding proteins
EM - Learning
Document test
Kmeans test
Multiple Caret Run: Naive Bayes, KNN, C5.0 with AA Test-harness dataset
Using `caret` and `caretEnsemble`
EM test
Title: EM test Summary: Model based clustering:Expectation Maximization
KNN=modelknn, C50=modelc50, SVMR=modelSvm, NB=modelnb, CART=modelCart, LDA=modellda
Cubist Example
* Calculate correlation matrix * Find attributes that are highly corrected (ideally > 0.75) * Learning Vector Quantization (LVQ) model * Construct an Learning Vector Quantization (LVQ) model. The varImp is then used to estimate the variable importance, which is printed and plotted. * Estimate variable importance using filterVarImp * Recursive feature selection
Tuning RF Using ntree
Random forest using 7 x 500 test-harness with 5 x 10 fold cross-validation.
Take a test-harness (randomly sampled file of protein classes) and produce the percent amino acid composition from the proteins listed. In this case there are 3500 proteins.
This program samples 500 polypeptides from each of the files and combines them into a test-harness file. This test harness file (4MB) is much smaller than if I were to run the full dataset of 27MB.
A. Plot of Variances (x 10^5) Vs Amino Acid Type B. Plot of Means (% of Total) Vs Amino Acid Type
One important question which I would like to answer with this exploratory data analysis, is this data set sufficient to successfully classify the seven classes of protein using a Random Forest machine learning approach?
Fibonacci Numbers
What is Fib(37)?
The Monty Hall Problem
Who Wins?
Produces 5000 files in the FASTA format then compresses them into a bz2 archive. This program will be the basis for producing all CLI Tests for any number of students.
Bioconductor TM
See: This notebook is a continuation from part A which looks at the words from R-cran and does analysis on them. However this notebook looks at the words from Bioconductor packages. We will then take the output from these two sets of data and do a simple set comparison, looking A intersect B, A-B and B-A.
Text mining R-cran Descriptions - Part A
## Text mining R-cran Work with the text from ONLY the second column of 'R-cran' text found on [R-cran package short description]( followed by 'Packages' then 'Table of available packages, sorted by name' For Part B see:
Microbenchmarking for a random number generator
The take home message is that 'ceiling' is better than 'as.integer' due to the fact that 'as.integer' returns NAs by coercion. Also using 'ceiling' is slightly faster. ;)
Generate Random RNA / FASTA
Plotting with Expressions in Labels
Brownian Motion
Stratified Means
Math Pix
Basic Histogram
Basic Histogram - you need to practice this, using GGplot.
Joy Division - Unknown Pleasures Album cover
I found this script on and I liked it and tweaked it a little bit.
Publish Document
title: "M.L. Using O2-Binding AA" subtitle: "1. Exploratory Investigation of Percent AA composition" author: "Matthew Curcio"
Use of Machine Learning wrt Oxygen-Binding Proteins
This module uses Random Forest against the %AA composition (using 20AA Not Dipeptides) of Oxygen binding proteins to predict classification
Random Forest Investigation
This is the first portion of my oxygen binding proteins machine learning project. This portion only displays the random forest generation using the percent amino acid composition.
O2 Binders
This is the start of my project to ?? first step produce data frames of o2 binders.
This does not give the same values as the script produced by J Leek.
Global temperatures since 1880
JLeek_GLM 3_14
What is the dimension of the residual matrix, the effects matrix and the coefficients matrix?
Linear Regression Modelling
Set the seed to 333 and use k-means to cluster the samples into two clusters. Use svd to calculate the singular vectors.
Linear Models with Indicator variable for M sex.
Linear Models
Singular Value Analysis and Percent Variance
Quiz Wk 1, Question #9
Statistics for Genomic Data Science
MCC Bioconductor Project 4
Test 3
Basic GRanges & AnnotationHub
Bioconductor Exam 2
Coursera Bioconductor Quiz 2
Epigenomics roadmap data
Bioconductor for Genomic Data Science, Quiz 1
Coursera author: "MCC" date: "September 30, 2016"
Model fitting for an interesting puzzle
Example 1 of Sentiment Analysis Using Natural Language Processing
This example sentiment analysis uses Natural Language Processing (NLP) using R and the library TM (text mining package, The corpus uses 81 job postings using the key words, computation biology or bioinfromatics from
Joe's Code
Chi Sq simulation
Quantitative Genetics, HW #2, Question #3
This is an attempt to answer HW 2 question #3