Exploring how non-coding DNA affects the ability of evolution to evolve around neutral gaps
Frequently Asked QuestionsIntroduction:
One of the most common challenges to modern genetics and evolution is the claim that natural selection is incapable of evolving across large “neutral gaps” that hinder the evolution of genetic information. Indeed, a quick statistical look at the daunting task of evolving sequences hundreds of base pairs long can make the situation look hopeless. If a genetic sequence is composed of 100 elements, and environmental pressures require the sequence to find a genetic solution of at least 120 elements, the statistical probability of random mutations causing this change seem nearly impossible. This experiment aims at showing how evolution through natural selection and random mutations dramatically increases the odds of such change occurring. In particular, the experiment will focus on how non-coding sequences, often referred to as “junk DNA”, greatly increases the likelihood of the evolutionary process's success.
Non-coding sequences refer to sections of genetic code that serve no obvious purpose or don't code for anything at all. These sections of DNA are particular interesting because they make up the vast majority of genetic information in organisms that live on Earth. Roughly 98% of human DNA is termed “junk DNA”. In some organisms it is even higher. The existence of non-coding DNA is sort of a mystery, some believing that this DNA is nothing more than a relic from previous generations, bits and pieces of genetic code that is no longer being used. Others insist that junk DNA plays a crucial role in the evolutionary process, providing the organism with large amounts of “test” DNA to toy around with. Since these sections of DNA are not currently in use, mutations to them do not always result in immediate impact to the organism. We will demonstrate how these sections of code can have a dramatic impact on the survival of a genetic code.
Methods:
In this experiment, we invent a fictitious genetic world as an analogy to our own. Instead of the 64 codons that make up the building blocks of real DNA, our system will have ten, represented by the integers zero through nine. The next step is determining which sequences of genetic code translate into living organisms. In reality, this is one of the biggest unknowns in genetics. There is no set formula for what sequences result in living organisms. All we really know is that the vast majority of sequences do not survive. Previous experiments have used letters as building blocks, and words as the organisms. If a sequence of letters results in a real word, the organism lives. If the sequence is gibberish, the organism is stillborn. Since the vast majority of letter-combinations result in gibberish, this approach is a fairly good one for analogy to real genetics. It has been shown that very large words can be evolved from very small words, countering claims of misleading statistics that would suggest otherwise.
But in our simulation, the approach is more mathematical. We will allow sequences to live only if they conform to a certain mathematical formula. In essence, only sequences that sum up to be perfect squares will live. A perfect square is a non-negative number that is the square of some other integer, such as 4, 16, 25, 36, 64, etc. Except we're going to make one slight change to make it more difficult. We’re only going to let squares of the fourth order survive. These are numbers such as 2^4= 16, 3^4=81, or 4^4=256. This makes the task far more challenging, as far fewer solutions exist, and they become very spread apart very quickly. We're trying to see if an organism can evolve from a very simple solution into a much harder solution. An example of a sequence that would survive would be
23146
since 2+3+1+4+6 = 16. But the probability of that exact sequence occurring randomly is quite small, about 1 in 100,000. Of course, many other sequences 5 elements long can also survive, such as 11824 or 53323. The order of the sequence does not matter right now, since we're simply taking a sum, but it will matter shortly; the code will have to be read from left to right.
So let’s take a look at the task presented. The figure below is an estimate showing what percent of possible sequences would survive, for a given sequence length:

Figure 1: Potential solution percentages (y-axis) versus sequence length (x-axis)
So we can see that for a sequence of length 50, approximately 1% of all combinations will be solutions that survive and reproduce. Now there is a disturbing trend in this graph that is crucially important. Figure two shows a closer inspection:

Figure 2: Magnified view of Figure 1
We can see that between sequence lengths of about 30 and 40, there exist no survivable sequences. The reason for this is simple: There's virtually no way to add up 35 numbers that range from 0-9 to be 16, 81, or 256 (the first three possible solutions). The only exception here is if you use a lot of zeros, such as the sequence:
55600000000000000000000000000000000
which is 35 elements long and sums up to 16, thus surviving. However, the probability of such a sequence occurring is so barbarically small that for all intents and purposes, it is accurate to say that there are no solutions for sequences around length 35.
So the problem here is this: How does a sequence 30 elements long evolve into a sequence 45 units long, if there are no solutions in-between? If the sequence tried to add just one element at a time, it would die! It would seem that the code would have to jump 15 elements in one generation, and have those 15 elements help it sum to exactly the next solution. It can be shown that the probability of such a mutation event occurring is practically zero. It would be unrealistic to expect such a mutation to occur, even in millions of years of attempts, especially considering how rare a large mutation like that is.
So how does nature pull off these sorts of miracles?
To introduce the concept of non-coding or “junk” sequences, we will save zero as a special codon, much like in real genetic codes. Zeros will represent start/stop codons. If zero is the next element in a sequence, the summation will halt until another zero reactivates it. So, 90178607 is also a solution, since 9+0…..0+7=16. Any mutations that occur in-between two zeros will have no impact on the survival of the organism that generation.
90178607 => 16
1230782760610879203 => 16
Here is where things get interesting, and start to turn the statistical tide. Imagine we start with a simple solution that could easy form together by random occurrences:
4552 =>16
Now subsequent mutations go as follows:
45520 => 16
45207 => 16
... and after many generations ...
45520712356253746562 =>16
Now a point mutation changes the zero into a one.
45521712356253746562 =>81
And thus, the organism survives in its newly evolved form. If the mutation had occurred at any previous generation (which probably happened dozens of times), the organism would have died. But in organisms that reproduce in massive numbers (single celled organisms, who have the hardest time crossing neutral gaps), such deaths have little impact on the survival of the population.
Results:
Before we even begin simulating such an environment, we can take a mathematical look at how things will play out. Figure 1 showed how the situation looks without non-coding sequences. Large gaps exist that dramatically hinder the evolution of larger sequences from smaller sequences. Let's take a closer look at why these gaps occur where they do:

Figure 3: Breakdown of Figure 1 into individual solution components. Solution 1 (16) in red, Solution 2 (81) in green, and Solution 3 (256) in blue.
Now we can see why this happens. The first two solutions (16 and 81) only occur in sequences smaller than about 30 elements. The second solution begins at about 10 elements, right as the first solution ends. This border zone isn't a huge jump, and could probably be evolved easily, even without non-coding sequences getting involved. Then we can see where the second solution tails off (around x=30), but the third solution doesn't begin until around x=40. The gap between the third and fourth solutions, and fourth to fifth, is even larger.
But now let's take a look at the probabilities when junk DNA gets involved:

Figure 4: Solution probabilities when non-coding sequences are allowed. Curves are jagged as a result of numerical (as opposed to analytical) solution method. Percentages based on number of solutions drawn randomly out of 100,000 attempts over various sequence lengths.
We can see that the situation is now completely evolvable. The solutions overlap, and there is no gap. A sequence could easily evolve from the first solution into the second and then into the third. Another way of looking at it is that there is roughly a 1% chance that a sequence 20 elements long will sum up 16, and approximately 1% of those sequences will have enough junk DNA to potentially mutate in the very next generation and add up to 81 by simply altering a stop codon. So the seemingly impossible evolutionary situation turns out to be completely realistic due to the presence of junk DNA. Not so useless after all!
Interestingly enough, the overlapping of solutions does not seem to become a problem ever, even for barbarically large sequence differences. The gap between the fourth and fifth solutions is about 70 elements long from Figure 3, but Figure 4 clearly shows that the two solutions overlap. As the distance between solutions increases, so too does the workable range of sequence lengths for a given solution with non-coding sequences.
Q & A
But your solutions still become less and less probable as sequence complexity increases!
That's simply because of the way I've chosen the fitness function. I could have just as easily designated a fitness function that becomes easier to string together as sequence length increases. I simply chose a simple fuction that increases gap length between solutions, and has the unfortunate side effect of becoming less common. No one has demonstrated that longer sequence solutions are inevitably less likely, in fact, many contend that as the amount of genetic material that gets involved increases, so too do the potential ways they can be arranged to form a suitable protein/organ/organism.
Okay, but your solutions are becoming more spread apart. Eventually they'll get too far from each other for even junk DNA to help
Again, only because I've chosen a fitness function to experiment with increasing sequence gaps. Most geneticists will tell you that no one has ever demonstrated the need for an organism to cross a gap more than a few sequences long. I'm simply looking at the "what if" case. What if an organism needed to cross a gap 20 elements long? Could it do it? The answer seems to be yes, but it depends highly on the nature of the solution space, which, of course, is unknown. It is important to note that my experiment suggests that as the length of genetic sequence solutions increases, so too does the effect of non-coding sequences. That is, the larger the sequence becomes, the more non-coding sequences help.
So how the heck does this model evolution?
Exactly! It's not meant to model evolution. No simulation (to date) can. There are simply too many things involved! The best we can do is handle specific Creationist claims one at a time, set up small simulations designed to test a specific claim, and see if it holds true, by analogy. Intelligent Design proponents continue to claim that a sequence gap of any real length is unevolvable. Well, ignore for the moment that we have no reason to believe such gaps need crossing in the first place. Is their claim accurate, mathematically speaking? Not even close! And the reasons why are because they leave out concepts such as non-coding DNA and dynamic fitness functions in their probability "calculations". This experiment simply shows how such a claim is statistically flawed. We can argue till the cows come home over mutation rates and such, but it still won't change the fact that ID proponents grossly oversimplify the process of genetic evolution.
I noticed that between figures 3 & 4, the actual probability of a sequence being a solution decreases for a given sequence length. Doesn't this make things harder?
Not at all. Despite the fact that any particular sequence length has a smaller chance of forming a specific solution, you have to realize that the overall probabilities are the same. It's simply been stretched over a larger span of sequence lengths. In calculus terms, the integral areas remains the same. The bonus effect of stretching the solution space out is that it makes it much easier for solutions to overlap. And when solutions overlap, there is no gap to jump whatsoever.
Other Factors Involved
Several other factors are involved with either expediting or hindering this evolutionary progression. The first and most obvious factor is relative mutation rates. I say relative because actual mutation rates are simply scaled down versions of what we will be using in this simulation. For example, if our simulated pool of organisms is 1,000 members strong and has point mutations occur 0.1% of the time, this is computationally equivalent to a population of 10 million organisms (remember, bacteria here) that has point mutations occur every 0.00001% of the time when it comes to the average number of generations required to evolve something. So we try to keep the population size small enough so that the computer can compute more generations faster without seriously hindering the accuracy of our results.
That said, the mutation rates greatly affect the average number of generations required to evolve something. We will test our model over various mutation rates for deletions, insertions, point mutations etc, but it is almost pointless to do so, since our experiment is simply an analogy to real genetics. Some mutation rates will make it harder to evolve solutions, increase the number of generations to mutate from one solution to another, but the graphs will simply be stretched out. The fact that the solutions overlap will not be affected by the mutation rates (unless they are drastically incorrect).
Another factor that affects the evolutionary method is how real systems involve dynamic fitness parameters. Static fitness parameters are nothing more than simplifications of real biological systems. This is vitally important to the debate, and will eventually be explored. One can easily imagine how if the fitness parameters vary with time, the location of the first three solution curves relative to each other will shift around, often accelerating the evolutionary process.
Simulation Results:
(In progress)