I regret to inform that I will no longer be updating this blog. However, if you're interested in any further writing I might do on evolution, programming, and the like, I encourage you to follow me on Twitter. I'll post updates there.
Follow @elirmThe Computational Life
A computational biologist discusses the technical, biological, and philosophical aspects of evolutionary biology.
Wednesday, September 26, 2012
Sunday, September 23, 2012
An entitled derelict (me) simulates the economy
The Daily Show with Jon Stewart  Mon  Thurs 11p / 10c  
Exclusive  An Elegant Message to the 47%  
www.thedailyshow.com  

Political rant
At a fundraiser in May, Republican presidential candidate Mitt Romney (paraphrased above) characterized almost half of the American populace as hopeless plebians desperately attached to the sore teat of the government. As a generally leftleaning graduate student dependent on federal research grants for income, I can only assume I'm included within this group. Maybe I should feel insulted, but I'd rather understand the worldview that allows a presidential candidate to become a Simpsonsstyle caricature of himself and still garner support from half the country. This could lead to a discussion of zerosum games or the psychology of authoritarianism. For now, I want to focus on something more fundamental and (seemingly) nonpartisan  the notion of causality.
As far as I can tell from the rhetoric, the current Conservative narrative goes something like this:
Once upon a time there were two brothers, Lefty and Righty. Every day after school Righty would invest time into his lemonade stand, "Trickledown treats", which he built with his own hands using personally obtained venture capital. By age 10 he had successfully incorporated all neighborhood lemonade stands into his company, allowing him to create 50 jobs for area children. He eventually went on to additional fame and fortune as the CEO of bootstrapped bootstraps, the finest purveyor of selfreplicating footwear accessories. Meanwhile, Lefty delved into Marxist literature, writing diatribes on dialectical materialism whilst demanding the occasional handouts from Righty and thus depriving his brother from his grand ambition: opening the Grover Norquist bathtub emporium.Essentially, the message is that hard work leads to success; therefore, success comes from hard work. Likewise, laziness leads to economic hardship and dependency, meaning that poor people are lazy, entitled fuckwads.
This notion of economic causality is actually a very comforting thought, perhaps increasing the appeal of the right beyond their legitimate economic base. After all, most people (rightfully) believe they work hard and deserve the American dream  a house, two cars, and plenty of useless Chinesemanufactured crap to regift to their 2.5 kids when they move to Florida in their old age.
Yet the world is not causal. As great sages have said before me, "shit happens". Kids happen. Unexpected bills happen. Market changes happen. Still, many hold onto the notion that hard work guarantees success. Fortunately, as someone who can program, I'm in the position to play armchair economist and ask what happens when the only factors distinguishing the economic success of one person from another are the random vicissitudes of life. Now let's play makebelieve with the economy!
Simulating the economy
Here's my scenario: In the beginning there are 500 families, each with 80 dollars to their names. I then simulate their economic histories through the course of 50,000 days (about 137 years), so there are multiple generations here. On each day of the simulation, I guarantee every family an income of 80 dollars. However, each family also has daily expenses that are sampled from a normal distribution, where the mean of the distribution is given by the following, where $x$ is total savings up to that point: $$ f(x) = \frac{80}{1+(0.01)0.995^{0.01x}} $$ Basically, this function means that families with very low savings are spending almost all of their incomes on expenses, while wealthier families devote a much smaller portion toward expenses. I also set the standard deviation of the expense distribution to onefifth of the income. This allows for families to lose or gain money with each passing day, though they do have a slight positive edge on average. Additionally, families with positive savings (those who aren't in debt) invest their savings, which yields returns (or losses) through a normal distribution with mean 0 and a standard deviation equal to 1% of the accumulated wealth of the family.
Now, I know the model is not taking into account many aspects of economic life  rare extreme disastrous events (black swans to use Taleb's terminology), loan interest, the idiotic spending habits of certain individuals. However, the main point I want to address is what happens when everyone follows the same (not unrealistic) economic strategy. My Python code is below. It runs the simulation, plots the economics histories, and plots the fraction of wealth accumulated by the top 20% and top 5% of the population over time. I encourage you to try it for yourself and modify it if you want. If you think I'm being a total idiot, I encourage you to show me a better way!
from scipy import stats import numpy as np import matplotlib.pyplot as plt # Run the simulation histories = [np.zeros(50000) for x in xrange(500)] income = 80. expense_func = lambda x: 1./(1.+0.01*0.995**(0.01*x)) for his in histories: his[0] = income for i in xrange(1,50000): print i for his in histories: saved = income  stats.norm.rvs(loc=expense_func(his[i1])*income,scale=income/5.) if his[i1] > 0: his[i] = his[i1] + saved + \ stats.norm.rvs(scale=0.01*his[i1]) else: his[i] = his[i1] + saved # Plot money vs. time for his in histories: plt.plot(np.arange(0,50000,50),his[np.arange(0,50000,50)]) plt.xlabel('time') plt.ylabel('Dollars') plt.show() # Plot fraction owned by top 20% over time wealth_frac = [] for i in xrange(50000): wealth = sorted([his[i] for his in histories]) wealth_frac.append(np.sum(wealth[400:])/np.sum(wealth)) wealth_frac = np.array(wealth_frac) # Plot out fraction owned by top 20% over time plt.plot(np.arange(0,50000,50),wealth_frac[np.arange(0,50000,50)]) plt.xlabel('time') plt.ylabel('Ownership by top 20%') # Plot fraction owned by top 5% over time wealth_frac = [] for i in xrange(50000): wealth = sorted([his[i] for his in histories]) wealth_frac.append(np.sum(wealth[475:])/np.sum(wealth)) wealth_frac = np.array(wealth_frac) plt.plot(np.arange(0,50000,50),wealth_frac[np.arange(0,50000,50)]) plt.xlabel('time') plt.ylabel('Ownership by top 5%') plt.show()
In the end the richest family ended up with 38,281,980 dollars, while the poorest had only had 1,528 despite the fact that they pursued the same economic strategies. I can only assume the richest family  represented by the black line in the first figure  would be lauded for their success, a testament to the value of hard work. After all, Grandpa Richy McRicherton started with 80 dollars in his pocket and ended up a multimillionaire. Meanwhile, the poorer families might be decried for their use of public services and desire to fund them.
Obviously this wasn't totally realistic, since no one ended up in debt in the end, but the inequalities stand out over repeated simulations. A couple issues  although the figures appear to show that the top 20% started with 50% of the wealth, this is actually due to a sharp upswing in inequality at the beginning of the simulation as the families depart from the equal starting values. Curiously, the top 20% reaches a high of owning approximately 80% of the wealth, which falls in line with the famous Pareto principle, but this may just be a coincidence. I would need to run the simulation for longer to see if the top 20% continues to account for approximately 65% of the wealth or if there is another longer term trend. Either way, enough inequality occurs to allow the top 5% to control between 35% and 40% of the total wealth in the end, and a few different runs with varying parameters show this to be robust to changes in the values used in the simulation.
So what can we conclude from this excercise (assuming my simulation wasn't totally naive)? 1.) inequality is the natural tendency for unconstrained economic systems and 2.) the causal economic storyline is bullshit.
John Galt can kiss my ass
Saturday, September 15, 2012
The Origins of Bureaucratic Architecture OR How I Learned to Quit Complaining and Love the Paperwork
I highly recommend reading Michael Lynch's Origins of Genome Architecture, if you haven't already. In short, it provides a bridge from understanding the fundamentals of population genetics  particularly genetic drift, mutation, selection, and recombination  to understanding how the quintessential features of eukaryotic genomes may have come about. Even if you don't completely buy into his nonadaptationist paradigm, I believe his thinking provides an excellent basis for forming null hypotheses when testing evolutionary scenarios. However, it has also shaped my thinking on some decidedly nonbiological concepts  in particular, how bureaucracies become the mazes of papershuffling complexity we know and love. The analogy isn't perfect (no populations, nonrandom duplications), but I thought it was worth writing down. Here's how I see it, as told through the parable of Bob the accountant:
Meet Bob
Bob works at the University of University of HappyFunVille where he is universally loved by graduate students and faculty alike. He keeps track of the cash flow in the Department of Totally Fundable Research. When members of the department come back from trips, they fill out a single small form that lists expenses and a funding string. They usually get their reimbursement within a week.
Neofunctionalization
The University implements a computerized system to keep track of accounts across the whole campus. They announce that this will herald a new golden age of efficiency, as promised by Interfacial Solutions, the contractor hired to implement the software. Bob now has a new job function. He must provide an entirely new set of information to the software for travel reimbursement on top of maintaining consistent records for the Department. Travelers must now fill out a preauthorization form with expected expenses a week before traveling in addition to the reimbursement form after coming back. Bob occasionally loses one of the forms, and reimbursement now takes an average of two weeks.
Duplication, Subfunctionalization, and Escape from Adaptive Conflict
On a whim, the Dean favors the Department of Totally Fundable Research and provides funding to hire a new staff person, Bob II. In fact, Bob II ends up with the same job title and description as Bob I. Bob I complains to the chair of the department that Bob I is redundant, so the chair assigns Bob I to his old task of keeping the intradepartment financial records while giving Bob II the task of dealing with the University's computerized reimbursement system. Bob I enjoys having less work to do, and Bob II excels at his specialized job. However, the Bob's don't coordinate their schedules, so no work gets done when either Bob goes on vacation. Mistakes in the reimbursement process, though rare from each Bob, now occur twice as often in total. Reimbursement now takes an average of 3 weeks.
Nonfunctionalization
The department switches all their records to the computerized system, eliminating Bob I's remaining function in the department. However, the University lacks a good mechanism for laying him off, so he just sits around accumulating the signs of old age until the day when he's eliminated in a massive budget cut.
Epilogue
The Dean diverts increasing resources to the computerized record center, allowing them to hire 3 new people in the reimbursement department, each of whom specialize in a separate task performed by their predecessor and each of whom have a nonzero probability of losing a form or making a typographical error. Reimbursements now take 6 months. The Dean proposes hiring new staff to solve the problem.Friday, September 14, 2012
The Power of Randomness: Integration for the lazy
I had some great math teachers growing up, but they left out some of the techniques I use on a daily basis nowadays. I have a particular fondness for algorithms that rely on randomness to generate their solutions. True, they usually aren't the most efficient things in the world, but I think evolutionary biologists have an aesthetic predisposition to solutions with stochasticity at their cores. Also, they often work when nothing else will, including my will to perform symbolic integration.
A couple weeks ago I needed to calculate Bayes factors to evaluate whether the genotypes in a controlled cross were segregating according to Mendelian expectations. It was a bit more complicated than that, but the gist is that I wanted to evaluate the likelihood that the genotypes were segregating 1:2:1 relative to the marginal likelihood that the genotype probabilities were completely unknown.
Let's say that for a particular locus, I obtained perfect Mendelian segregation for double heterozygous parents in 100 offspring, meaning there were 25 AA, 50 Aa, and 25 aa individuals. I can obtain the likelihood for that particular data under the model of double heterozygous parents using the multinomial distribution. That is, I could calculate the likelihood as follows: $$ P(Dp=0.25,q=0.5)=\frac{100!}{25!50!25!}(0.25)^{25}(0.5)^{50}(0.25)^{25}=0.00893589 $$
That's easy enough. However, the marginal likelihood for the multinomial distribution under the prior probability that the genotype probabilities are completely unknown (a uniform prior) would be calculated using the following integral: $$ \int_0^1 \int_0^{1q} \frac{100!}{25!50!25!} p^{25} q^{50}(1pq)^{25} dp dq $$ I was not going to try and integrate that, so I decided to let randomness do the work for me using Monte Carlo integration. This is a fairly naive technique in which I let the computer randomly select points in the domain under consideration (in this case between 0 and 1 along both the $p$ and $q$ dimensions) and then average. This average then approaches the integral as the number of sampled points approaches infinity. Also, in this case my multidimensional volume ($V$) is 1, as both axes of integration are of size 1. However, in general I would have to correct for the volume of the multidimensional domain of integration. In mathspeak, I'm saying the following is true: $$ \int f(x)dx = \lim_{n \to \infty} V \frac{1}{n}\sum_{i=1}^nf(x_i) $$ where x could be of any dimension.
Below, I use Python to calculate the marginal likelihood of the multinomial using Monte Carlo integration with 1,000,000 random samples. In practice, I generally settle on around 100,000 for efficiency's sake, but more iterations will give a more precise answer. Here is the answer from Wolfram Alpha. I estimated the answer twice, and my results closely bracketed the actual solution.
import numpy as np from numpy import array, log, exp, random from scipy.special import gammaln ## Integrates over the two free parameters in the multinomial formula ## Helper function to calculate the factorial in order to get the multinomial probability # @param x The argument for the factorial # @return The logarithm of the factorial def log_factorial(x): return gammaln(array(x)+1) ## Helper function to get the multinomial probability # @param n The total number of individuals # @param xs A list of the numbers for each category # @param ps A list of the probabilities for each category # @param The multinomial probability def multinomial(n, xs, ps): xs, ps = array(xs), array(ps) result = log_factorial(n)  sum(log_factorial(xs)) + sum(xs * log(ps)) return exp(result) # @param n_genos The number of genotyped individuals at the locus # @param n_h1 The number of individuals homozygous for allele 1 # @param n_he The number of individuals heterozygous # @param mc_steps The number of steps to use for montecarlo integration over the multinomial formula # in order to generate the marginal likelihood under a uniform prior def mc_integrate_multinomial(n_genos,n_h1,n_he,mc_steps=10000): p_H1 = random.rand(mc_steps) p_He = random.rand(mc_steps) samps = [] for i in xrange(mc_steps): probs = [p_H1[i],p_He[i],1p_H1[i]p_He[i]] if p_H1[i] + p_He[i] > 1.: samps.append(0.) continue samps.append(multinomial(n_genos,[n_h1,n_he,n_genosn_h1n_he],probs)) return np.mean(samps) >>> mc_integrate_multinomial(100,25,50,mc_steps=1000000) 9.6417526480793174e05 >>> mc_integrate_multinomial(100,25,50,mc_steps=1000000) 9.7326408809908908e05
There are better ways to do this  namely through importance sampling or MCMC. However, if most of the function's volume is not too constrained, plain Monte Carlo integration works well enough. In future posts, I will discuss MCMC more thoroughly, especially with respect to Bayesian analysis.
Saturday, September 8, 2012
The genome as Detroit  my reaction to ENCODE
Ten years ago we saw the culmination of two decades worth of work in the publication of the human genome, a bloated mess of a As, Cs, Gs, and Ts halffull of seemingly repetitive nonsense obtained at the price of a dollar per precious base pair. Then, the media fed the public's collective chagrin as they announced the paltry figure of 1% deemed "functional" at the time. That word  functional  only encompassed those portions of the genome known or predicted to be translated into proteins or transcribed into the limited suite of RNAs with welldefined catalytic roles at the time. Over the last decade  using a set of model systems from the weedy thale cress to the diminutive fruit fly  that word has evolved to encompass an alphabet soup of specialized RNAs, regulatory binding sites, and activitymodifying marks that have yielded everincreasing insight into the dynamics of the eukaryotic genome. Of course each small step in that pursuit hardly merited the front cover of multiple highprofile journals. And so the ENCODE project bid the scientific community to hold its collective breath until this past Wednesday, when the fleet of ENCODE publications sailed forth into public view with a large "80% functional" above each masthead.
The wet dream of comparative genomics 
Yes, 80% of the human genome was found to either produce some RNA, or sometimes bind to a regulatory protein, or contain marks indicative of transcriptionally active regions. The authors also demonstrated that, on average, most of these regions are less variable than expected if selection were not acting to constrain them. This does imply that certain subsets of these elements do have biochemical roles with enough impact on reproductive success to overwhelm the stochastic fluctuation of variant frequencies from generation to generation and that the data sets were large enough to detect these subsets. Yet the 80% figure implies, at least to the average person, that the overwhelming majority of our genomes cannot sustain mutation without a nonnegligible impact on fitness, which would be extraordinary, but for now remains disingenuous.
The New York Times used a Google Maps analogy for ENCODE, and I thought this could fit quite well. I could envision the human genome as a major metropolis, Detroit perhaps. The downtown still contains the bastions of yesterday's economic glory, without which the city might completely turn to shambles. As someone viewing a Google map of Detroit, I could easily posit that these buildings still serve valuable economic and governmental functions, similar to how the ENCODE elements conserved across class Mammalia likely encode important developmental and housekeeping functions within humans. However, as I pan over the extremities of the city, I would find houses in various states of disrepair. Certainly, from my vantage point many would have all the characteristics of functional domiciles  roofs, driveways, fences  to discriminate them from rural areas of the country without human presence. Yet if I were to explore those areas on the ground level, I would find many of the homes abandoned. Granted, this wouldn't stop the occasional transient homeless person from squatting for certain periods of time; but that hardly meets the usual definition of functional.
The bleaker reality of population genetics 
This is how I suspect much of the human genome works. Most of it is capable of binding the occasional transcription factor, transcribing the odd RNA, and accepting contextual epigenetic markings. On rare occasions the insert of a duplicated gene or the local rearrangement of the DNA may provide opportunities for sections with previously transient but useless biochemical activity to take regulatory roles that have nonnegligible effects on fitness and eventually become conserved, much like how the introduction of a successful business into a dying community can drive the revitalization of existing infrastructure. However, like the successful business and its surrounding region, the conserved genomic elements wink in and out of existence on a longer timescale.
I am excited by the genomic "Google map" that the ENCODE project has provided us, and I am sure it will lend considerable insight into human disease when combined with the theoretical power of population genetics. I just don't think the authors should imply that everything with the characteristics of function is presently important.
Wednesday, September 5, 2012
The Stationary Distribution: A third way
Ian Carroll helpfully reminded me that I had neglected the easiest (and most computationally efficient) way of calculating the stationary distribution of the Markov model in the previous post, and I would be remiss if I didn't demonstrate it and the logic behind its use. This time we take advantage of the definition of an eigenvector. Recall that our goal is to find the probability distribution of states that, when leftmultiplied by the transition probability matrix, results in the same distribution. In mathspeak, this is the following: $$ {\bf v^TP = v^T} $$
Now, with that in mind, recall (or learn for the first time if you're one of today's lucky 10,000) that the eigenvector ${\bf u}$ and corresponding eigenvalue $\lambda$ for some square matrix ${\bf A}$ are defined as follows: $$ {\bf Au} = \lambda {\bf u} $$
More specifically, these are called the right eigenvectors/eigenvalues, since they are defined when the vector is multiplied on the right. We can also define the left eigenvectors/eigenvalues as the ${\bf u}$, $\lambda$ combos that satisfy the following: $$ {\bf u^TA} = \lambda {\bf u^T} $$ Now, if we plug in our stationary distribution vector ${\bf v}$ for ${\bf u}$, our transition matrix ${\bf P}$ for ${\bf A}$, and let $\lambda = 1$, we get the familiar ${\bf v^TP = v^T}$. Also, finding the left eigenvalues and eigenvectors for a matrix is the same as finding the rightside versions for the transpose of that matrix. This means that, computationally, we can take the eigen decomposition of ${\bf P^T}$, find the column in the ${\bf U}$ matrix corresponding to an eigenvalue of 1, normalize that vector (since it can be scaled by any arbitrary constant), and that will give the stationary distribution. I demonstrate this below:
import numpy as np from numpy.linalg import eig transition_mat = np.matrix([ [.95,.05,0.,0.],\ [0,0.9,0.09,0.01],\ [0,0.05,0.9,0.05],\ [0.8,0,0.05,0.15]]) S,U = eig(transition_mat.T) stationary = np.array(U[:,np.where(np.abs(S1.) < 1e8)[0][0]].flat) stationary = stationary / np.sum(stationary) >>> print stationary [ 0.34782609 0.32608696 0.30434783 0.02173913]
We get the same stationary distribution as before without having to perform and extra exponentiation or matrix multiplication! The glories of the eigen decomposition never cease to amaze!
Monday, September 3, 2012
Faith, Science, and a line in the sand
Jerry Coyne's blog, Why Evolution is True, reported today on a free homeschool course offered by Cambridge University (edit: the Faraday Institute, which apparently only rents space from Cambridge) that purports to profess the inherent compatibility between science and Christianity. As a scientist with an interest in promoting scientific literacy among all people, my initial reaction is that anything wrapped up for fundamentalist consumption without the addition of incestuous clones riding dinosaurs around an ecologically unsustainable garden is probably a good thing. After all, such a course might provide the right frame for at least opening up a discussion with the scientific community that doesn't devolve into the tried tripes of the intelligent design community such as irreducible complexity and the inadequacy of "microevolution".
The problem I have with the course and with others who follow the same line of thinking is neatly encapsulated by this quote from theologian Alister McGrath, which appears in the course materials:
I think it's fair to say that nothing that we observe in nature, for example, its regularity, or indeed these remarkable anthropic phenomena, prove that there is a God, but the really important point is they are absolutely consistent with belief in God, and therefore I'd like to suggest that we don't think about nature proving that there is a God; that's how an earlier generation might have approached this. For me the really important thing is that the world as we observe it corresponds with what Christians would say the world ought to be like, that there's a correspondence between the theory and the observation.
He is right in saying that nothing in nature could prove there is a God, but in identifying a harmonious correspondence between Christian theory and observation, he says precisely nothing. Faith, by definition, exists in the absence of evidence, and without any way to falsify Christian doctrine based on observable phenomena, there is no basis for saying that it is ever consistent with observation. This is the fundamental divide between science and faith. Scientists make predictions that must be testable and falsifiable. If a scientist's predictions fail, that person will (hopefully) suck it up and go back to the metaphorical or literal drawing board. Religions, on the other hand, behave more like highly unethical scientists  celebrating whenever the stochasticity of the universe happens to occasionally fall in line with their vague predictions and chalking the rest up to God's mysterious ways.
This is not to say that scientists cannot have faith. There are scientists who genuinely believe their faith gives them strength, inner peace, etc. We are all capable of some irrationality (read: belief without evidence), and that is not always a bad thing. But injecting faith into science through "theistic evolution" and the like will never be science and, if it were, Faith would no longer be faith.