Wednesday, September 26, 2012


I regret to inform that I will no longer be updating this blog. However, if you're interested in any further writing I might do on evolution, programming, and the like, I encourage you to follow me on Twitter. I'll post updates there.

Sunday, September 23, 2012

An entitled derelict (me) simulates the economy

Disclaimer: All characters depicted in the following post are entirely a work of the author's (sleep-deprived) imagination. All models are wrong and probably not useful.
The Daily Show with Jon StewartMon - Thurs 11p / 10c
Exclusive - An Elegant Message to the 47%
Daily Show Full EpisodesPolitical Humor & Satire BlogThe Daily Show on Facebook

Political rant

At a fundraiser in May, Republican presidential candidate Mitt Romney (paraphrased above) characterized almost half of the American populace as hopeless plebians desperately attached to the sore teat of the government. As a generally left-leaning graduate student dependent on federal research grants for income, I can only assume I'm included within this group. Maybe I should feel insulted, but I'd rather understand the worldview that allows a presidential candidate to become a Simpsons-style caricature of himself and still garner support from half the country. This could lead to a discussion of zero-sum games or the psychology of authoritarianism. For now, I want to focus on something more fundamental and (seemingly) nonpartisan - the notion of causality.

As far as I can tell from the rhetoric, the current Conservative narrative goes something like this:

Once upon a time there were two brothers, Lefty and Righty. Every day after school Righty would invest time into his lemonade stand, "Trickle-down treats", which he built with his own hands using personally obtained venture capital. By age 10 he had successfully incorporated all neighborhood lemonade stands into his company, allowing him to create 50 jobs for area children. He eventually went on to additional fame and fortune as the CEO of bootstrapped bootstraps, the finest purveyor of self-replicating footwear accessories. Meanwhile, Lefty delved into Marxist literature, writing diatribes on dialectical materialism whilst demanding the occasional handouts from Righty and thus depriving his brother from his grand ambition: opening the Grover Norquist bathtub emporium.
Essentially, the message is that hard work leads to success; therefore, success comes from hard work. Likewise, laziness leads to economic hardship and dependency, meaning that poor people are lazy, entitled fuckwads.

This notion of economic causality is actually a very comforting thought, perhaps increasing the appeal of the right beyond their legitimate economic base. After all, most people (rightfully) believe they work hard and deserve the American dream - a house, two cars, and plenty of useless Chinese-manufactured crap to regift to their 2.5 kids when they move to Florida in their old age.

Yet the world is not causal. As great sages have said before me, "shit happens". Kids happen. Unexpected bills happen. Market changes happen. Still, many hold onto the notion that hard work guarantees success. Fortunately, as someone who can program, I'm in the position to play armchair economist and ask what happens when the only factors distinguishing the economic success of one person from another are the random vicissitudes of life. Now let's play make-believe with the economy!

Simulating the economy

Here's my scenario: In the beginning there are 500 families, each with 80 dollars to their names. I then simulate their economic histories through the course of 50,000 days (about 137 years), so there are multiple generations here. On each day of the simulation, I guarantee every family an income of 80 dollars. However, each family also has daily expenses that are sampled from a normal distribution, where the mean of the distribution is given by the following, where $x$ is total savings up to that point: $$ f(x) = \frac{80}{1+(0.01)0.995^{-0.01x}} $$ Basically, this function means that families with very low savings are spending almost all of their incomes on expenses, while wealthier families devote a much smaller portion toward expenses. I also set the standard deviation of the expense distribution to one-fifth of the income. This allows for families to lose or gain money with each passing day, though they do have a slight positive edge on average. Additionally, families with positive savings (those who aren't in debt) invest their savings, which yields returns (or losses) through a normal distribution with mean 0 and a standard deviation equal to 1% of the accumulated wealth of the family.

Now, I know the model is not taking into account many aspects of economic life - rare extreme disastrous events (black swans to use Taleb's terminology), loan interest, the idiotic spending habits of certain individuals. However, the main point I want to address is what happens when everyone follows the same (not unrealistic) economic strategy. My Python code is below. It runs the simulation, plots the economics histories, and plots the fraction of wealth accumulated by the top 20% and top 5% of the population over time. I encourage you to try it for yourself and modify it if you want. If you think I'm being a total idiot, I encourage you to show me a better way!

from scipy import stats
import numpy as np
import matplotlib.pyplot as plt

# Run the simulation
histories = [np.zeros(50000) for x in xrange(500)]
income = 80.
expense_func = lambda x: 1./(1.+0.01*0.995**(-0.01*x))
for his in histories: his[0] = income
for i in xrange(1,50000):
    print i
    for his in histories:
        saved = income - stats.norm.rvs(loc=expense_func(his[i-1])*income,scale=income/5.)
        if his[i-1] > 0: his[i] =  his[i-1] + saved + \
        else: his[i] = his[i-1] + saved

# Plot money vs. time
for his in histories: plt.plot(np.arange(0,50000,50),his[np.arange(0,50000,50)])


# Plot fraction owned by top 20% over time
wealth_frac = []
for i in xrange(50000):
    wealth = sorted([his[i] for his in histories])

wealth_frac = np.array(wealth_frac)
# Plot out fraction owned by top 20% over time
plt.ylabel('Ownership by top 20%')

# Plot fraction owned by top 5% over time
wealth_frac = []
for i in xrange(50000):
    wealth = sorted([his[i] for his in histories])

wealth_frac = np.array(wealth_frac)
plt.ylabel('Ownership by top 5%')

In the end the richest family ended up with 38,281,980 dollars, while the poorest had only had 1,528 despite the fact that they pursued the same economic strategies. I can only assume the richest family - represented by the black line in the first figure - would be lauded for their success, a testament to the value of hard work. After all, Grandpa Richy McRicherton started with 80 dollars in his pocket and ended up a multi-millionaire. Meanwhile, the poorer families might be decried for their use of public services and desire to fund them.

Obviously this wasn't totally realistic, since no one ended up in debt in the end, but the inequalities stand out over repeated simulations. A couple issues - although the figures appear to show that the top 20% started with 50% of the wealth, this is actually due to a sharp upswing in inequality at the beginning of the simulation as the families depart from the equal starting values. Curiously, the top 20% reaches a high of owning approximately 80% of the wealth, which falls in line with the famous Pareto principle, but this may just be a coincidence. I would need to run the simulation for longer to see if the top 20% continues to account for approximately 65% of the wealth or if there is another longer term trend. Either way, enough inequality occurs to allow the top 5% to control between 35% and 40% of the total wealth in the end, and a few different runs with varying parameters show this to be robust to changes in the values used in the simulation.

So what can we conclude from this excercise (assuming my simulation wasn't totally naive)? 1.) inequality is the natural tendency for unconstrained economic systems and 2.) the causal economic storyline is bullshit.

John Galt can kiss my ass

Saturday, September 15, 2012

The Origins of Bureaucratic Architecture OR How I Learned to Quit Complaining and Love the Paperwork

I highly recommend reading Michael Lynch's Origins of Genome Architecture, if you haven't already. In short, it provides a bridge from understanding the fundamentals of population genetics - particularly genetic drift, mutation, selection, and recombination - to understanding how the quintessential features of eukaryotic genomes may have come about. Even if you don't completely buy into his non-adaptationist paradigm, I believe his thinking provides an excellent basis for forming null hypotheses when testing evolutionary scenarios. However, it has also shaped my thinking on some decidedly non-biological concepts - in particular, how bureaucracies become the mazes of paper-shuffling complexity we know and love. The analogy isn't perfect (no populations, non-random duplications), but I thought it was worth writing down. Here's how I see it, as told through the parable of Bob the accountant:

Meet Bob

Bob works at the University of University of HappyFunVille where he is universally loved by graduate students and faculty alike. He keeps track of the cash flow in the Department of Totally Fund-able Research. When members of the department come back from trips, they fill out a single small form that lists expenses and a funding string. They usually get their reimbursement within a week.


The University implements a computerized system to keep track of accounts across the whole campus. They announce that this will herald a new golden age of efficiency, as promised by Interfacial Solutions, the contractor hired to implement the software. Bob now has a new job function. He must provide an entirely new set of information to the software for travel reimbursement on top of maintaining consistent records for the Department. Travelers must now fill out a pre-authorization form with expected expenses a week before traveling in addition to the reimbursement form after coming back. Bob occasionally loses one of the forms, and reimbursement now takes an average of two weeks.

Duplication, Subfunctionalization, and Escape from Adaptive Conflict

On a whim, the Dean favors the Department of Totally Fund-able Research and provides funding to hire a new staff person, Bob II. In fact, Bob II ends up with the same job title and description as Bob I. Bob I complains to the chair of the department that Bob I is redundant, so the chair assigns Bob I to his old task of keeping the intra-department financial records while giving Bob II the task of dealing with the University's computerized reimbursement system. Bob I enjoys having less work to do, and Bob II excels at his specialized job. However, the Bob's don't coordinate their schedules, so no work gets done when either Bob goes on vacation. Mistakes in the reimbursement process, though rare from each Bob, now occur twice as often in total. Reimbursement now takes an average of 3 weeks.


The department switches all their records to the computerized system, eliminating Bob I's remaining function in the department. However, the University lacks a good mechanism for laying him off, so he just sits around accumulating the signs of old age until the day when he's eliminated in a massive budget cut.


The Dean diverts increasing resources to the computerized record center, allowing them to hire 3 new people in the reimbursement department, each of whom specialize in a separate task performed by their predecessor and each of whom have a non-zero probability of losing a form or making a typographical error. Reimbursements now take 6 months. The Dean proposes hiring new staff to solve the problem.

Friday, September 14, 2012

The Power of Randomness: Integration for the lazy

I had some great math teachers growing up, but they left out some of the techniques I use on a daily basis nowadays. I have a particular fondness for algorithms that rely on randomness to generate their solutions. True, they usually aren't the most efficient things in the world, but I think evolutionary biologists have an aesthetic predisposition to solutions with stochasticity at their cores. Also, they often work when nothing else will, including my will to perform symbolic integration.

A couple weeks ago I needed to calculate Bayes factors to evaluate whether the genotypes in a controlled cross were segregating according to Mendelian expectations. It was a bit more complicated than that, but the gist is that I wanted to evaluate the likelihood that the genotypes were segregating 1:2:1 relative to the marginal likelihood that the genotype probabilities were completely unknown.

Let's say that for a particular locus, I obtained perfect Mendelian segregation for double heterozygous parents in 100 offspring, meaning there were 25 AA, 50 Aa, and 25 aa individuals. I can obtain the likelihood for that particular data under the model of double heterozygous parents using the multinomial distribution. That is, I could calculate the likelihood as follows: $$ P(D|p=0.25,q=0.5)=\frac{100!}{25!50!25!}(0.25)^{25}(0.5)^{50}(0.25)^{25}=0.00893589 $$

That's easy enough. However, the marginal likelihood for the multinomial distribution under the prior probability that the genotype probabilities are completely unknown (a uniform prior) would be calculated using the following integral: $$ \int_0^1 \int_0^{1-q} \frac{100!}{25!50!25!} p^{25} q^{50}(1-p-q)^{25} dp dq $$ I was not going to try and integrate that, so I decided to let randomness do the work for me using Monte Carlo integration. This is a fairly naive technique in which I let the computer randomly select points in the domain under consideration (in this case between 0 and 1 along both the $p$ and $q$ dimensions) and then average. This average then approaches the integral as the number of sampled points approaches infinity. Also, in this case my multidimensional volume ($V$) is 1, as both axes of integration are of size 1. However, in general I would have to correct for the volume of the multidimensional domain of integration. In math-speak, I'm saying the following is true: $$ \int f(x)dx = \lim_{n \to \infty} V \frac{1}{n}\sum_{i=1}^nf(x_i) $$ where x could be of any dimension.

Below, I use Python to calculate the marginal likelihood of the multinomial using Monte Carlo integration with 1,000,000 random samples. In practice, I generally settle on around 100,000 for efficiency's sake, but more iterations will give a more precise answer. Here is the answer from Wolfram Alpha. I estimated the answer twice, and my results closely bracketed the actual solution.

import numpy as np
from numpy import array, log, exp, random
from scipy.special import gammaln

## Integrates over the two free parameters in the multinomial formula
## Helper function to calculate the factorial in order to get the multinomial probability
# @param x The argument for the factorial
# @return The logarithm of the factorial
def log_factorial(x):
    return gammaln(array(x)+1)
## Helper function to get the multinomial probability
# @param n The total number of individuals
# @param xs A list of the numbers for each category
# @param ps A list of the probabilities for each category
# @param The multinomial probability
def multinomial(n, xs, ps):
    xs, ps = array(xs), array(ps)
    result = log_factorial(n) - sum(log_factorial(xs)) + sum(xs * log(ps))
    return exp(result)    

# @param n_genos The number of genotyped individuals at the locus
# @param n_h1 The number of individuals homozygous for allele 1
# @param n_he The number of individuals heterozygous
# @param mc_steps The number of steps to use for monte-carlo integration over the multinomial formula
# in order to generate the marginal likelihood under a uniform prior
def mc_integrate_multinomial(n_genos,n_h1,n_he,mc_steps=10000):
    p_H1 = random.rand(mc_steps)
    p_He = random.rand(mc_steps)
    samps = []
    for i in xrange(mc_steps):
        probs = [p_H1[i],p_He[i],1-p_H1[i]-p_He[i]]
        if p_H1[i] + p_He[i] > 1.:
    return np.mean(samps)

>>> mc_integrate_multinomial(100,25,50,mc_steps=1000000)
>>> mc_integrate_multinomial(100,25,50,mc_steps=1000000)

There are better ways to do this - namely through importance sampling or MCMC. However, if most of the function's volume is not too constrained, plain Monte Carlo integration works well enough. In future posts, I will discuss MCMC more thoroughly, especially with respect to Bayesian analysis.

Saturday, September 8, 2012

The genome as Detroit - my reaction to ENCODE

Ten years ago we saw the culmination of two decades worth of work in the publication of the human genome, a bloated mess of a As, Cs, Gs, and Ts half-full of seemingly repetitive nonsense obtained at the price of a dollar per precious base pair. Then, the media fed the public's collective chagrin as they announced the paltry figure of 1% deemed "functional" at the time. That word - functional - only encompassed those portions of the genome known or predicted to be translated into proteins or transcribed into the limited suite of RNAs with well-defined catalytic roles at the time. Over the last decade - using a set of model systems from the weedy thale cress to the diminutive fruit fly - that word has evolved to encompass an alphabet soup of specialized RNAs, regulatory binding sites, and activity-modifying marks that have yielded ever-increasing insight into the dynamics of the eukaryotic genome. Of course each small step in that pursuit hardly merited the front cover of multiple high-profile journals. And so the ENCODE project bid the scientific community to hold its collective breath until this past Wednesday, when the fleet of ENCODE publications sailed forth into public view with a large "80% functional" above each masthead.

The wet dream of comparative genomics

Yes, 80% of the human genome was found to either produce some RNA, or sometimes bind to a regulatory protein, or contain marks indicative of transcriptionally active regions. The authors also demonstrated that, on average, most of these regions are less variable than expected if selection were not acting to constrain them. This does imply that certain subsets of these elements do have biochemical roles with enough impact on reproductive success to overwhelm the stochastic fluctuation of variant frequencies from generation to generation and that the data sets were large enough to detect these subsets. Yet the 80% figure implies, at least to the average person, that the overwhelming majority of our genomes cannot sustain mutation without a non-negligible impact on fitness, which would be extraordinary, but for now remains disingenuous.

The New York Times used a Google Maps analogy for ENCODE, and I thought this could fit quite well. I could envision the human genome as a major metropolis, Detroit perhaps. The downtown still contains the bastions of yesterday's economic glory, without which the city might completely turn to shambles. As someone viewing a Google map of Detroit, I could easily posit that these buildings still serve valuable economic and governmental functions, similar to how the ENCODE elements conserved across class Mammalia likely encode important developmental and housekeeping functions within humans. However, as I pan over the extremities of the city, I would find houses in various states of disrepair. Certainly, from my vantage point many would have all the characteristics of functional domiciles - roofs, driveways, fences - to discriminate them from rural areas of the country without human presence. Yet if I were to explore those areas on the ground level, I would find many of the homes abandoned. Granted, this wouldn't stop the occasional transient homeless person from squatting for certain periods of time; but that hardly meets the usual definition of functional.

The bleaker reality of population genetics

This is how I suspect much of the human genome works. Most of it is capable of binding the occasional transcription factor, transcribing the odd RNA, and accepting contextual epigenetic markings. On rare occasions the insert of a duplicated gene or the local rearrangement of the DNA may provide opportunities for sections with previously transient but useless biochemical activity to take regulatory roles that have non-negligible effects on fitness and eventually become conserved, much like how the introduction of a successful business into a dying community can drive the revitalization of existing infrastructure. However, like the successful business and its surrounding region, the conserved genomic elements wink in and out of existence on a longer timescale.

I am excited by the genomic "Google map" that the ENCODE project has provided us, and I am sure it will lend considerable insight into human disease when combined with the theoretical power of population genetics. I just don't think the authors should imply that everything with the characteristics of function is presently important.

Wednesday, September 5, 2012

The Stationary Distribution: A third way


Ian Carroll helpfully reminded me that I had neglected the easiest (and most computationally efficient) way of calculating the stationary distribution of the Markov model in the previous post, and I would be remiss if I didn't demonstrate it and the logic behind its use. This time we take advantage of the definition of an eigenvector. Recall that our goal is to find the probability distribution of states that, when left-multiplied by the transition probability matrix, results in the same distribution. In math-speak, this is the following: $$ {\bf v^TP = v^T} $$

Now, with that in mind, recall (or learn for the first time if you're one of today's lucky 10,000) that the eigenvector ${\bf u}$ and corresponding eigenvalue $\lambda$ for some square matrix ${\bf A}$ are defined as follows: $$ {\bf Au} = \lambda {\bf u} $$

More specifically, these are called the right eigenvectors/eigenvalues, since they are defined when the vector is multiplied on the right. We can also define the left eigenvectors/eigenvalues as the ${\bf u}$, $\lambda$ combos that satisfy the following: $$ {\bf u^TA} = \lambda {\bf u^T} $$ Now, if we plug in our stationary distribution vector ${\bf v}$ for ${\bf u}$, our transition matrix ${\bf P}$ for ${\bf A}$, and let $\lambda = 1$, we get the familiar ${\bf v^TP = v^T}$. Also, finding the left eigenvalues and eigenvectors for a matrix is the same as finding the right-side versions for the transpose of that matrix. This means that, computationally, we can take the eigen decomposition of ${\bf P^T}$, find the column in the ${\bf U}$ matrix corresponding to an eigenvalue of 1, normalize that vector (since it can be scaled by any arbitrary constant), and that will give the stationary distribution. I demonstrate this below:

import numpy as np
from numpy.linalg import eig

transition_mat = np.matrix([

S,U = eig(transition_mat.T)
stationary = np.array(U[:,np.where(np.abs(S-1.) < 1e-8)[0][0]].flat)
stationary = stationary / np.sum(stationary)

>>> print stationary
[ 0.34782609  0.32608696  0.30434783  0.02173913]

We get the same stationary distribution as before without having to perform and extra exponentiation or matrix multiplication! The glories of the eigen decomposition never cease to amaze!

Monday, September 3, 2012

Faith, Science, and a line in the sand

Jerry Coyne's blog, Why Evolution is True, reported today on a free homeschool course offered by Cambridge University (edit: the Faraday Institute, which apparently only rents space from Cambridge) that purports to profess the inherent compatibility between science and Christianity. As a scientist with an interest in promoting scientific literacy among all people, my initial reaction is that anything wrapped up for fundamentalist consumption without the addition of incestuous clones riding dinosaurs around an ecologically unsustainable garden is probably a good thing. After all, such a course might provide the right frame for at least opening up a discussion with the scientific community that doesn't devolve into the tried tripes of the intelligent design community such as irreducible complexity and the inadequacy of "microevolution".

The problem I have with the course and with others who follow the same line of thinking is neatly encapsulated by this quote from theologian Alister McGrath, which appears in the course materials:

I think it's fair to say that nothing that we observe in nature, for example, its regularity, or indeed these remarkable anthropic phenomena, prove that there is a God, but the really important point is they are absolutely consistent with belief in God, and therefore I'd like to suggest that we don't think about nature proving that there is a God; that's how an earlier generation might have approached this. For me the really important thing is that the world as we observe it corresponds with what Christians would say the world ought to be like, that there's a correspondence between the theory and the observation.

He is right in saying that nothing in nature could prove there is a God, but in identifying a harmonious correspondence between Christian theory and observation, he says precisely nothing. Faith, by definition, exists in the absence of evidence, and without any way to falsify Christian doctrine based on observable phenomena, there is no basis for saying that it is ever consistent with observation. This is the fundamental divide between science and faith. Scientists make predictions that must be testable and falsifiable. If a scientist's predictions fail, that person will (hopefully) suck it up and go back to the metaphorical or literal drawing board. Religions, on the other hand, behave more like highly unethical scientists - celebrating whenever the stochasticity of the universe happens to occasionally fall in line with their vague predictions and chalking the rest up to God's mysterious ways.

This is not to say that scientists cannot have faith. There are scientists who genuinely believe their faith gives them strength, inner peace, etc. We are all capable of some irrationality (read: belief without evidence), and that is not always a bad thing. But injecting faith into science through "theistic evolution" and the like will never be science and, if it were, Faith would no longer be faith.