Guillaume Filion
### Guillaume Filion

I finally took the time to write this post. It tells how I was misunderstanding Principal Component Analysis and how a real world example made me understand something very important about unsupervised classification.

"The fact that experiments segregate by laboratory of origin does not mean that this dominates the signal. The differences between labs were tiny, but they were systematic, happening always on the same few loci. The differences between proteins were large, but unstructured, so they were not picked up by the PCA."

ENCODE data, Principal Components and racism | Filed under ENCODE, racism, series: genetics and racism, Principal Component Analysis.
### Guillaume Filion

Great article about CPU caches by Ulrich Drepper.

"A simple computation can show how effective caches can theoretically be. Assume access to main memory takes 200 cycles and access to the cache memory take 15 cycles. Then code using 100 data elements 100 times each will spend 2,000,000 cycles on memory operations if there is no cache and only 168,500 cycles if all data can be cached. That is an improvement of 91.5%. (...) These are the numbers Intel lists for a Pentium M (actual access times measured in CPU cycles):

Register <= 1
L1d ~3
L2 ~14
Main Memory ~240

[Editor's note: This is the second installment in Ulrich Drepper's "What every programmer should know about memory" document. Those who have not read the first part will likely want to start there. This is good stuff, and we once again thank Ulrich for allowing us to publish it.
### Guillaume Filion

Great article by Ulrich Drepper about computer memory architecture.
[Editor's introduction: Ulrich Drepper recently approached us asking if we would be interested in publishing a lengthy document he had written on how memory and software interact. We did not have to look at the text for long to realize that it would be of interest to many LWN readers.
### Guillaume Filion

Very well put "I want developer parity so that I can spend my time improving code rather than debugging differences between environments."
How I develop. Published Fri July 24 2015. I've been in the bioinformatics field for almost 10 years, originally coming from a molecular biology degree background, I deciding to move into computing after struggling to find a job doing lab work. This post is a general outline of how I now develop ...
### Guillaume Filion

The distribution of the largest fragment of a broken stick has been worked out a long time ago. But somehow this result is difficult to find on the Internet. With the help of the Cross Validated community, I found readable proofs for this distribution and its asymptotic limit.
### Guillaume Filion

"Time and energy spent on trying to increase internet clicks is time and energy we don't spend on the tedious administrative activities that are needed to actually affect change."﻿
Throughout history, engineers, medical doctors and other applied scientists have helped convert basic science discoveries into products, public goods and policy that have greatly improved our quality of life. With rare exceptions, it has taken years if not decades to establish these discoveries.
### Guillaume Filion

Some distributions can produce "bad" samples for which usual estimators will fail. What to do in this case?
### Guillaume Filion

In this blog post I explain how to use the so-called "stick breaking" process in the DNA alignment problem.

"Inserting k mutations at random in a sequencing read will produce k+1 (possibly empty) subsequences without errors. The process is analogous to inserting k breaks at random in a stick of length 1, and we can approximate the distribution of the longest subsequence without error by that of the longest fragment when breaking the stick."
Stick breaking and DNA alignment | Filed under heuristic, sequence alignment, spacings, stick breaking, bioinformatics.
### Guillaume Filion

A very simple Python example to use kd-trees with practical application to count shootings nearby schools.
### Guillaume Filion

Updating the lab website with our publications. It's nice to keep it alive.﻿
### Guillaume Filion

Ever wanted to compute eigenvalues with sparse matrices? APRACK is the real deal. Surprisingly, it is not available in R by default, but the developers of the igraph package have written a nice port. So in case you look for it, there it is.

Arguments. func: The function to perform the matrix-vector multiplication. ARPACK requires to perform these by the user. The function gets the vector x as the first argument, and it should return Ax, where A is the “input matrix”. (The input matrix is never given explicitly.) ...
### Guillaume Filion

Lior Pachter recently offered a cash prize to answer a scientific question. He got over a million views and got several answers from distinguished biologists. Very neat.

