This post garnered a lot of comments, often making some variation on the same point, so here's a general clarification:
My goal writing this was to encourage people to look at living organisms through the lens of "how complex can they afford to be, given the constraints of evolution?".
I know that the comparison between software and genomes is not mathematically rigorous (and people are correct to point it out).
But here is the thing: being mathematically rigorous, here, would basically amount to saying, "the true Kolmogorov complexity of living organisms is unknown, we can't calculate it rigorously, end of the story". But then, I couldn't make any of the points I wanted to make about living organisms and evolution!
On the other hand, if we accept a bit of approximate information theory and handwaving, I do think we can get to really interesting stuff.
What I'm saying is that, the fact that the complete E. coli genome fits on a floppy disk is at least *a little bit meaningful* – it gives us *at least some intuition* of how complex the organism can afford to be, and it's a good way to intuitively understand that the constraints on a living organism's complexity are very different from the constraints on software size. With this intuition in hand, we can look at some real mechanisms that are found in nature and better appreciate why/how they differ from human-made technology.
Here's the sad truth: most biology is like that. Most papers on theoretical biology will have some crazy approximation and assumptions under the hood. It doesn't prevent it from being meaningful – on the contrary, it's often a necessary step to extract some meaning from the messy living world.
Your post is provocative and vivid, and that can be great for inspiring curiosity about biology. That's much needed, especially in these times.
As an academic researcher studying epigenetics, let me say a little more about why I am pushing back against your argument in my other comment.
Briefly, "DNA has fewer bits than the source code of MS Word" should invite us to ask, "so where is the rest of the information stored?" and "What aspects of biology does this information encode?"
Don't just stop at "I guess that's all there is!"
I think it's overwhelmingly important to see the cell (and its DNA) in the context of the 4.5 billion years of evolution that gave rise to it. If we knew the DNA sequence of the last universal common ancestor (LUCA), and put it into a human cell, the cell would die. That's despite the fact that LUCA did manage, somehow, to evolve through a sequence of mutations into the human genome. But somehow, the biochemistry engendered by this series of changes diverged from the biochemistry that would be compatible with developing a LUCA cell.
Note that this is still classic Darwinian evolution. It's just that there are few intrinsically advantageous or deleterious mutations or structural variations, with the exception of extreme changes like the complete chemical digestion of the genome into small fragments. Advantage depends on the time scale, organism, and biochemical context of the mutation. The contexts that evolution has produced are quite diverse, and the result of an unfathomable number of stepwise mutations going back to the context of the first spontaneously self-replicating DNA-based structure.
Understanding the nature of these interactions, how information is divvied up between cell biochemistry and DNA, between a cell and its local environment, and between tissues, organs, bodies, and their environment (including in the womb and the microbiome!) is at the forefront of scientific research and our understanding of basic biology. It's also going to lead to another generation of medicines.
We can still use metrics like number or ratio of bits after compression as a lens on biology. A paper that does this for many taxa across the tree of life is "On the Approximation of the Kolmogorov Complexity for DNA Sequences." They don't argue in defence of its validity for a particular scientific question - they just compute the statistics and call it a day, which is fine.
Another interesting one is "Sequence complexity and DNA curvature," which showed a relationship between sequence complexity, DNA curvature, and noncoding regulatory elements.
>If we knew the DNA sequence of the last universal common ancestor (LUCA), and put it into a human cell, the cell would die. That's despite the fact that LUCA did manage, somehow, to evolve through a sequence of mutations into the human genome.
A reader in another comment gave an analogy that I think is pretty good: imagine you intercept two encrypted messages from CIA agents. One is 12 gigabytes of encrypted binaries, indistinguishable from random bits. The second one is just "8". Even if you cannot interpret any of the messages (in the same way a LUCA cell couldn't be "interpreted" by a human body), you can still say that the first message contains more information than the second one.
But I agree that there's more complexity to the system that what's in the message itself. I guess the comparisons discussed in the post are more a "banana for scale" type of situation.
Banana for scale is a great analogy here, if DNA is the banana and the overall complexity of the cell is what you're scaling with it. The challenge, of course, is that we don't have a great way to show the complexity of the cell or organism overall in relation to its DNA. We have the banana, but little ability to embed it in a picture of the object we want to photograph.
Your spy analogy is also good. What the receiver knows - the spy plans they worked out with their confidant, let's say - can be very complex. Thinking of those plans reductively as a set of permutations, the new message multiplies the number of permutations by the number of bits in the message. So even a simple message, like "8", can increase the complexity of the receiver's information by an arbitrary amount that depends on their prior information state. But if we only see the message, we might mistakenly assume that the receiver has very little information at their disposal.
Good written article and there is probably some truth in it. But you misunderstand Kolmogorov Complexity. The surrounding machine that reads the code and instantiates it is *critical*. Is the shortes string describing tetris maybe just a link on my PC? Why not? After all the encoding of the floppy disk disregards all of the internal interpretations that are precoded into the computer.
In the same vain, the surrounding biology is *critical* to get from a DNA sequence to an animal or human. This includes not only the chemistry of ribosomes but all of physical reality. (The value might still be lower than one would assume naively, but much higher than estimates given by DNA)
Well put and completely correct. This whole essay rests on the idea that the length of the instructions is directly proportional to the complexity of the resulting structure, without understanding that structures and languages for interpreting instructions can differ in complexity by orders of magnitude. If the language didn't matter you could take the genome of E. Coli, convert it into binary and run it as an .exe to "have" an E. Coli cell.
I see what you mean and you're certainly right for the computer program side – I don't know enough about low-level computing to know how much Tetris relies on external libraries, ready-made CPU instructions, or whatever computer do when they execute programs. I agree the size of a software package must vastly underestimate its total complexity. I kind of neglected this, because my point is the biological systems are often more simple that software packages – if the complexity of software turns out to be higher than their size suggests, it only makes biological look even more simple in comparison.
For biological systems, however, I still think almost all the information about how the system works has to be contained in nucleic acids sequences, and that includes the nature of nucleic acid themselves. That's where I was getting at in the parenthesis about DNA being the substrate of evolution: if there are any "structures for interpreting instructions" encoded in the chemistry of the ribosome, and if these structures have any non-trivial complexity beyond the baseline complexity of the primordial soup, then this additional complexity must have been selected for through evolution. And, with very few exceptions, evolution operates on nucleic acids. Any hidden information that isn't be encoded in the DNA/RNA in some way (if anything, as a kind of redundancy) would have to evolve in a Lamarckian way, and that's just not very powerful compared to Darwinian evolution. From that, I conclude that pretty much all the information about how polymerases/ribosomes/etc work must show up somewhere in the DNA in the form of a constraint on the sequence. Not *all* the information, but close enough.
Of course none of these estimates include information about the laws of physics etc., but since I'm making a comparison between two things from the same universe, it's just an offset and shouldn't matter for the comparison.
> For biological systems, however, I still think almost all the information about how the system works has to be contained in nucleic acids sequences, and that includes the nature of nucleic acid themselves.
If that were true, then it would in principle be possible to physically transfer naked DNA from species A into the gamete or zygote of species B and produce a viable member of species A.
Alternatively, you should be able to implant a zygote from mammal A into the womb of mammal B.
The general reason why this is somewhere between biologically unlikely and impossible is that epigenetic mechanisms mediate the interaction between DNA and its chemical environment throughout the life cycle. I'm not referring to epigenetic mechanisms of inheretance per se, but to the simple fact that disruptions to the chemical environment in which cells or fetuses exist have a profound effect on their viability.
As you may know, DNA is densely and dynamically interacting with an extremely complex protein mileau that is cell type and state dependent. This mileau is tightly regulated throughout mitosis and meiosis.
There is never any point in any organism whatsoever's life cycle in which naked DNA even exists, much less develops a cell or multicellular organism out of an arbitrary chemical soup. DNA is incapable of independently structuring a viable cell or organism outside the context of the biological soma with which it has co-evolved. The soma is arbitrarily complex.
> And, with very few exceptions, evolution operates on nucleic acids. Any hidden information that isn't be encoded in the DNA/RNA in some way (if anything, as a kind of redundancy) would have to evolve in a Lamarckian way, and that's just not very powerful compared to Darwinian evolution.
Another way to look at this issue that you may find clarifying is that evolution operates on *ancestral series* of nucleic acids. There is a series of mutational steps that permitted LUCA to evolve into a blue whale, another that permitted it to evolve into pangolins, and another that permitted it to evolve into the venus fly trap. One way we could potentially ignore the soma as a source of biological complexity is by focusing on ancestral series of genomes as a form of complexity. We can treat the soma as the product of the evolutionary path that gave rise to a particular individual, or even a particular cell.
I'm not even sure how to begin trying to make a theoretical estimate on how much complexity is possible, given average mutation rates, but it's probably much higher than what's possible if you treated a single individual's genome as the upper bound on their complexity.
This argument applies equally to MS Word. MS Word co-evolved with Windows, the programming language in which it's implemented, various hardware architectures, and so on. Its complexity is not bounded by the complexity of its source code.
I think the central point has to be that the genetic code which ended up creating the actual organism we see is probably much less complex than the code that would *guarantee* it, and this is because some aspects of the complexity of an organism are imminentised through its development.
The code for “release chemical X in condition Y” can be very simple, but if it doesn’t fully determine the conditions you have a code for building something *a bit like the organism in front of you,* but not one that works in all contexts and all circumstances, and not one which is exact.
The complexity of the organism is not the complexity of the code that created it because the code will not produce it exactly. The code for TETRIS will, though?
So there is a difference between the instructions needed to create a complex emergent thing and the thing itself, and the simplicity results from a small number of processes turning on differently in different contexts. But that’s effectively offshoring complexity from the code itself; the contexts are necessary for the organism— these are two different forms of complexity accordingly
It's like saying that a language is as complex as the number of letters in it's alphabet or that a book is only as complex as the number of words in it. It really is about what the DNA translates into. It's about the protein production and even those proteins complex interactions. In terms of the code itself, the shannon entropy of DNA to binary could be compared. Not sure that much of importance could be gained from the comparison.
'only 10% of the human genome is actually useful, in the sense that it’s maintained by natural selection. The remaining 90% just seems to be randomly drifting with no noticeable consequences'
This is simply false, and the scientists who study DNA have known it to be false for decades:
'For decades most scientists thought the bulk of the material in the human genome—up to 95 percent—was “junk DNA.”
It now turns out much of this “junk,” far from an evolutionary byproduct, actually contains the vital instructions that switch genes on and off in all kinds of different cells. Changes in these instructions can affect everything from color vision to whether a person develops diabetes or cardiovascular disease or a host of other conditions.
“The junk DNA concept, as it has come to color our perception of the human genome, is somewhat bizarre,” says Stamatoyannopoulos. “If you picked up a Chinese newspaper and you could read only one or two percent of the characters, would you automatically assume the rest was junk?”
The Human Genome Project sequenced the 3 billion letters or DNA bases that make up the genome, and it provided a basic catalog of genes, which occupy only about 2 percent of the genome. But understanding how genes turn on and off is vital to figuring out basic biological processes, like development, or how genes contribute to normal health and disease. It turns out—contrary to expectation—that there are a modest number of genes (around 20,000) but these genes are controlled by millions of DNA “switches,” with the whole unit functioning as a kind of operating system for the cell. '
'For decades, scientists have known that, despite its name, “junk DNA” in fact plays a critical role: While the coding genes provide blueprints for building proteins, which direct most of the body's functions, some of the noncoding sections of the genome, including regions previously dismissed as “junk,” seem to turn up or down the expression of those genes"
'So-called “junk” DNA plays a key role in speciation'
'More than 10 percent of our genome is made up of repetitive, seemingly nonsensical stretches of genetic material called satellite DNA that do not code for any proteins. In the past, some scientists have referred to this DNA as “genomic junk.”
Over a series of papers spanning several years, however, Whitehead Institute Member Yukiko Yamashita and colleagues have made the case that satellite DNA is not junk, but instead has an essential role in the cell: it works with cellular proteins to keep all of a cell’s individual chromosomes together in a single nucleus.'
******
When you make sweeping claims, it's best to have your basic facts straight at the outset.
Otherwise, someone might conclude you simply have no idea what you're talking about with regard to any subject.
I think there's a confusion between two separate questions:
1) About 1-2% of the human genome can be clearly identified as protein-coding genes. In the early days, some people referred to the remaining 98% as "junk". Are these 98% actually junk?
This first question is pretty much solved – of course it's *not* actually junk. A lot of it does important things, even if they don't fall into a simple protein-coding cassette framework. The articles you posted are all about this first question.
The second question is:
2) Now, we know that *more than 2%* of the human genome has an effect on phenotype. But we still don't know what the real percentage is. Is it 10%? 90%? 99%?
The article I linked in the main post (https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004525), from 2014, attempts to answer the second question. They do that by looking at signatures of natural selection: basically, if a position appears to be mutating freely without any selection constraining the sequence, then that position must not have any evolutionarily-meaningful effect on phenotype. They find that this is the case for >90% of the genome. That's what I mean by "randomly drifting without consequences", and, as far as I know, this remains true. Note that this approach doesn't require any understanding of how these sequences actually work, so the analogy with reading a text in Chinese isn't relevant. A better analogy would be: "we find that if we replace 90% of this Chinese book with random gibberish, nobody notices, therefore these 90% must not contain any important information".
So, to my knowledge, the current consensus is that ~2% of the genome codes for proteins, ~8% does some other illegible thing (including lncRNA, protogenes, weird satellite stuff and whatnot), and ~90% has no impact on fitness.
Of course there are a lot of caveats and technicalities (for example, if you need a "spacer" sequence between two genes to prevent these genes from interfering with each other, the actual sequence of the spacer doesn't matter, but the *existence* of the spacer is important. So, does it count as "junk"?). There's been a ton of writing about these topic – see the "ENCODE controversy" – but that's mostly beyond the scope of this post.
Has anyone mathed the information density of 'junk' DNA? It is certainly above zero and certainly less than genes, even those under weak selective pressure.
'Entropy and Information within Intrinsically Disordered Protein Regions'
'Despite our abstract understanding of the conformational entropy as a defining characteristic of IDRs [6,97,98,99,100], it has proven tremendously difficult to quantify the full range of this thermodynamic component at IDRs’ disposal. Some of the underlying reasons are the limited availability of experimental data to characterize the vast degrees of freedom that contribute to the conformational entropy of IDRs, the lack of understanding of the contributions of solvent, and a still evolving synergy between theory, experiment and simulations (Box 4). Therefore, our understanding of conformational entropy, and its change in a functional context (Figure 4), is limited to the degrees of freedom that can be measured experimentally...
The conformational plasticity of IDRs is also exploited for the regulation of their biological activities through post-translational modifications (PTMs) [8,102,134,168] (Figure 4). IDRs are the prevalent sites of PTMs perhaps because the lack of stable structure enables easier access to modifying enzymes, as previously proposed [134,169,170,171,172]. The PTM sites represent high-information density regions in IDR sequences that offer vastly diverse options for regulating biological functions, such as modulation of subcellular localization, protein-protein interactions, and rates of protein-synthesis...
When separately considering the constituents of the overall entropy, e.g., conformational entropy, it is often not straightforward to experimentally evaluate (or to simulate) its changes in a functional context. Information in the form of a loss of conformational entropy upon functional interactions is intuitively expected, and, indeed, numerous reports exist of IDRs that acquire a partial or complete fold in a complex or stabilize their pre-existing transient structural propensities.
However, there are also accumulating reports of biomolecular complexes in which IDRs remain highly disordered, or, where a loss of disorder in one part of an IDR is compensated by a gain of disorder in another (Figure 4A). Hence, functional context need not always result in a reduced conformational entropy of an IDR. Therefore, like their primary amino acid sequences, the conformational ensembles of IDRs demand additional metrics of functional information that are not strictly determined by structural propensities.'
That's proteins, not DNA. And the disorder they are talking about is conformational entropy, i.e. protein chains of defined sequence that are folding (ordered), or disordered and waving about in the breeze. Chemistry, not information theory.
On the contrary, the two sentences you quote at the top of your reply are quite true. ~90% of the human genome is not conserved and/or not under purifying selection. This means that the genetic sequence isn't important for this 90%; i.e. you can change it by mutation and it still "works". Any regulatory or other functions buried in this 90% cannot depend on the specific genetic sequence. This greatly limits the possible functions of the 90%.
No knowledgeable scientist ever denied the existence of the "vital instructions that switch genes on and off". Importantly, any regulatory stuff for which the gene sequence matters, must be located in the ~10%. This includes the "millions of DNA switches". Otherwise, mutations would break them.
Knowledgeable scientists understood all this going back to the 60's. Susumu Ohno's 1972 paper that popularised the term "junk DNA" lays it all out simply. (note the contemporary estimate of ~6% genes in place of the modern ~2% genes and ~10% under selective pressure):
"Aside from conventional structural and regulatory genes, this 6% should include the promotor and operator region which are situated adjacent to each structural gene, for these regions can certainly sustain deleterious mutations. More than 90% degeneracy contained within our genome should be kept in mind when we consider evolutional changes in genome sizes. What is the reason behind this degeneracy?
Certain untranscribable and/or untranslatable DNA base sequences appear to be useful in a negative way (the importance of doing nothing). If functional genes customarily occupied the region around the centromere, evolutional changes of chromosome complements would not have occurred as often as has been observed.
[...]
The same can be said of those DNA base sequences which are used as partitions between the genes. It may be of selective advantage to space adjacent genes far enough apart [...]"
It's all there. They estimated 30,000 genes in 1972; the human genome project showed there were only 20,000, or 2% of the genome. Ohno only knew a few examples of "vital instructions that switch genes on and off" such as promotor and operator, but he did know that if their function depended on their genetic sequence, then they would be found in the the ~10%. But importantly, he gives examples of possible functions for the 90% that don't depend on a specific genetic sequence.
Of your provided examples, any whose function depends on a specific gene sequence will be found in the ~10%. The rest, with "seemingly nonsensical" sequences, are in Ohno's "junk" category, because you can break them with mutations and they still work. Knowledgeable scientists know that this has remained true since Ohno's time.
Ohno (1972) is hard to find online, but Wayback Machine has a copy:
Or this could be a good starting point for a non-biologist. I don't think you would be so harsh on on someone 200 years ago (caveat, not sure _exactly_ how long ago) saying the earth is the center of the solar system. You cite an article from two years ago, saying it was decades ago. Sounds like a relatively new concept... but tell me about telomeres.
Waw, that's really cool. Thanks for making this work. Hopefully it will produce an interesting new tweet, or if it never does I'll be all the more impressed that life can exist at all.
For me, it’s word choice. Humans are more elegant than Microsoft Word, not simpler, in the same sense in which Pascal (and others since) said “I have made this [letter] longer than usual because I have not had time to make it shorter.”
I am considerably more complex than just my genome. I learned to understand speak and read English, and live in my culture and smile or scowl. I then did a degree and worked for 50 years. Had children, earned and spent money and all that. That stuff is not coded in my genome. The potential is but all that nurture is part of me too. And it ain't coded in DNA. And there is an awful lot of it in any functioning adult.
While this article presents an intriguing analytical framework, it exemplifies a fundamental limitation in modern scientific discourse that reduces the profound complexity of life to purely mechanical and computational metaphors. This perspective is problematically reductionist.
First, the article's characterization of bacteria as strategic actors that 'decide' to optimize for simplicity reflects a fundamental misunderstanding. Bacteria aren't sentient strategic agents - they don't make decisions in the way the article implies, which stems from modern science's reductionist habit of imposing human-like strategic behavior on natural processes it doesn't fully understand.
They are part of an interconnected living system that operates according to quantum principles rather than the linear cause-and-effect relationships our minds tend to impose on them. The distinction between 'good' and 'bad' bacteria is similarly an artificial construct that reveals more about our linear thinking than the actual nature of these organisms.
Here's an inconvenient truth: the body doesn't know what's a toxin, a protein, enzyme and whatnot. The body operates on a fundamentally different paradigm than the mind. The body's innate intelligence doesn't recognize the dualistic concepts of good versus bad that dominate our mental framework. This mirrors Hippocrates' insight about medicine and poison being distinguished only by dosage. Ironically, it's often the mind's well-intentioned interventions - through its insistence on linear, reductionist approaches - that create harm by imposing its rigid interpretations onto the body's more nuanced systems.
Second, the article's treatment of DNA storage capacity is particularly revealing of this limited perspective. Recent research has already confirmed that DNA functions as a fractal antenna - a sophisticated multidimensional structure capable of receiving, transmitting, and processing information in ways that transcend simple linear sequences (https://pubmed.ncbi.nlm.nih.gov/21457072/). This property is completely overlooked when we reduce DNA to a one-dimensional string of nucleotides measured in megabytes.
This relates to a deeper issue: the fundamental mismatch between how our analytical mind processes information and how biological systems actually operate. The human mind works like a serial processor - it understands ONLY through juxtaposition, creating meaning by placing one thing against another (like plotting points on an X and Y axis). This is why our scientific models tend to be two-dimensional representations that break systems down into comparable components.
But biological systems operate more like quantum parallel processors. They can maintain complete understanding of phenomena without needing to break them down into constituent parts. This is why attempting to measure genetic complexity in terms of conventional data storage units is deeply misleading. What appears as 156 KB in our linear, particle physics model could actually represent thousands of terabytes of information when understood in terms of quantum fields - yet paradoxically show as zero size in quantum measurements because the information exists in a zero-point scalar field state.
The article's comparison of genome sizes to software programs reveals this blind spot. It's not that Microsoft Word is more complex than a living organism - it's that we're measuring complexity using tools designed for linear, sequential information processing while completely missing the quantum, multidimensional nature of biological information storage and processing.
What makes this current paradigm particularly dangerous is its complete disconnection from the true nature of being. Modern science, in its clinical rationality, has elevated this mechanical worldview to an absolute model, while simultaneously severing our connection to the deeper principles that govern life itself. The supreme irony is that this approach, in its very sophistication, has fostered a profound delusion: that we can outsmart nature through pure rational analysis.
This hubris is perfectly reflected in the inventions and aspirations of modern humans, who, in their vanity, style themselves as direct descendants of the gods - capable of improving upon nature's designs through sheer force of analytical intelligence. It's a mindset that mistakes technological sophistication for true wisdom, and computational complexity for genuine understanding.
Until we develop a more humble and holistic understanding that reconnects scientific insight with spiritual wisdom, our models will continue to miss not just the true complexity and elegance of living systems, but also the profound responsibility that comes with attempting to manipulate them.
A beautifully written and well-researched thought experiment. Although I disagree with some of your statements, the topics you raise and the questions you inspire are of utmost importance and should inspire further research by your readers.
Personally, I’m fascinated by the inner workings of a single cell, particularly its protein synthesis, energy production, and communication abilities. Studying the different organelles with their specialized functions, including self-replication, I find that a single living cell is far more complex than any human-made system.
Just search for “flagellar motor” and be in awe of the complexity of this single organelle.
And human beings that can create a Large Hadron Collider still cannot synthesize a single self-replicating cell.
I think that reducing human beings to a number of genes in the genome is hugely off -- it doesn't even take account of epigenetics: how genes can be switched on or off by a myriad of factors relating to environment and experience.
It's like saying all there is to be known about a house is the number of bricks, floorboards, rafters and roof tiles.
And what about the brain? Every one with more neural connections than stars in the known universe -- constantly forming new connections and having old ones pruned: also directed by environment and experience..
Its complexity is probably beyond measurement in millions of petabytes.
Correction: the brain more neurons than stars in the Milky Way, with 100 trillion synaptic connections. I have no idea how many petabytes of storage that would require.
The scale of software analogized to complexity is waaaaaay off. The Microsoft Word could easily be in a fraction of its size. In the 90s the software development community totally lost its ability to worry the size of its product. Speed of development and the notion of reuse became ALL. And apps like Microsoft generation-to-generation depended on faster apps and more memory. Depended wholly because the entire software industry lost the knowledge of how to build compact, concisely performing code. There is no logical justification for the present size of the Microsoft Word installation. None. It is NOT that much more complex than the version I still have on a single-sided floppy disk.
Great article. This points at one of the central difficulties for (human) engineers hoping to influence biology: everything has multiple functions. There is no gene for a single trait; most traits are influenced by hundreds or thousands of genes, when the entire genome is only tens of thousands of genes! Every possible drug target also has hundreds of other functions!
I am hopeful that the nonlinear gestalt thinking of modern AIs will be able to intuit the emergent function of these n-dimensional biological networks.
You assert that non-coding DNA is (to use the old term) ‘junk’. This may be what was assumed to be the case about 25 years ago, but significant evidence now supports the hypothesis that it plays a vital part in regulating transcription of coding DNA and how it is put to use. It clearly plays a major part in the development process. Perhaps you need to gain a more up to date understanding of biology / genomics.
Similarly, in your discussion of complexity, you fail to consider issues such as bloat and compression. Modern editions of Microsoft Word are a classic example of bloat, due to dubious coding choices. Observer that Windows, by your estimate, is of huge complexity, and yet Linux distributions of similar or greater capability can fit on a CD.
Issues around compression also plague your statements about genomes. The human genome is ‘small’ because it is very efficiently coded, with individual genes coding for many proteins (regulated, possibly, by non-coding DNA), so we are apparently less complex than less efficiently coded amoeba. The ultimate example of this has to be viruses, which fit code for complex physical structures into a few hundred bases, thanks to incredibly sophisticated coding, eg RNA that can be read in either direction, to give different proteins.
So, your basic contention that the length of the description equates to the complexity of the described is simplistic and misleading. As a final example: try to describe ‘gemütlich’ in English; it will take many words, whereas German can do it in one.
This post garnered a lot of comments, often making some variation on the same point, so here's a general clarification:
My goal writing this was to encourage people to look at living organisms through the lens of "how complex can they afford to be, given the constraints of evolution?".
I know that the comparison between software and genomes is not mathematically rigorous (and people are correct to point it out).
But here is the thing: being mathematically rigorous, here, would basically amount to saying, "the true Kolmogorov complexity of living organisms is unknown, we can't calculate it rigorously, end of the story". But then, I couldn't make any of the points I wanted to make about living organisms and evolution!
On the other hand, if we accept a bit of approximate information theory and handwaving, I do think we can get to really interesting stuff.
What I'm saying is that, the fact that the complete E. coli genome fits on a floppy disk is at least *a little bit meaningful* – it gives us *at least some intuition* of how complex the organism can afford to be, and it's a good way to intuitively understand that the constraints on a living organism's complexity are very different from the constraints on software size. With this intuition in hand, we can look at some real mechanisms that are found in nature and better appreciate why/how they differ from human-made technology.
Here's the sad truth: most biology is like that. Most papers on theoretical biology will have some crazy approximation and assumptions under the hood. It doesn't prevent it from being meaningful – on the contrary, it's often a necessary step to extract some meaning from the messy living world.
Your post is provocative and vivid, and that can be great for inspiring curiosity about biology. That's much needed, especially in these times.
As an academic researcher studying epigenetics, let me say a little more about why I am pushing back against your argument in my other comment.
Briefly, "DNA has fewer bits than the source code of MS Word" should invite us to ask, "so where is the rest of the information stored?" and "What aspects of biology does this information encode?"
Don't just stop at "I guess that's all there is!"
I think it's overwhelmingly important to see the cell (and its DNA) in the context of the 4.5 billion years of evolution that gave rise to it. If we knew the DNA sequence of the last universal common ancestor (LUCA), and put it into a human cell, the cell would die. That's despite the fact that LUCA did manage, somehow, to evolve through a sequence of mutations into the human genome. But somehow, the biochemistry engendered by this series of changes diverged from the biochemistry that would be compatible with developing a LUCA cell.
Note that this is still classic Darwinian evolution. It's just that there are few intrinsically advantageous or deleterious mutations or structural variations, with the exception of extreme changes like the complete chemical digestion of the genome into small fragments. Advantage depends on the time scale, organism, and biochemical context of the mutation. The contexts that evolution has produced are quite diverse, and the result of an unfathomable number of stepwise mutations going back to the context of the first spontaneously self-replicating DNA-based structure.
Understanding the nature of these interactions, how information is divvied up between cell biochemistry and DNA, between a cell and its local environment, and between tissues, organs, bodies, and their environment (including in the womb and the microbiome!) is at the forefront of scientific research and our understanding of basic biology. It's also going to lead to another generation of medicines.
We can still use metrics like number or ratio of bits after compression as a lens on biology. A paper that does this for many taxa across the tree of life is "On the Approximation of the Kolmogorov Complexity for DNA Sequences." They don't argue in defence of its validity for a particular scientific question - they just compute the statistics and call it a day, which is fine.
Another interesting one is "Sequence complexity and DNA curvature," which showed a relationship between sequence complexity, DNA curvature, and noncoding regulatory elements.
>If we knew the DNA sequence of the last universal common ancestor (LUCA), and put it into a human cell, the cell would die. That's despite the fact that LUCA did manage, somehow, to evolve through a sequence of mutations into the human genome.
A reader in another comment gave an analogy that I think is pretty good: imagine you intercept two encrypted messages from CIA agents. One is 12 gigabytes of encrypted binaries, indistinguishable from random bits. The second one is just "8". Even if you cannot interpret any of the messages (in the same way a LUCA cell couldn't be "interpreted" by a human body), you can still say that the first message contains more information than the second one.
But I agree that there's more complexity to the system that what's in the message itself. I guess the comparisons discussed in the post are more a "banana for scale" type of situation.
Banana for scale is a great analogy here, if DNA is the banana and the overall complexity of the cell is what you're scaling with it. The challenge, of course, is that we don't have a great way to show the complexity of the cell or organism overall in relation to its DNA. We have the banana, but little ability to embed it in a picture of the object we want to photograph.
Your spy analogy is also good. What the receiver knows - the spy plans they worked out with their confidant, let's say - can be very complex. Thinking of those plans reductively as a set of permutations, the new message multiplies the number of permutations by the number of bits in the message. So even a simple message, like "8", can increase the complexity of the receiver's information by an arbitrary amount that depends on their prior information state. But if we only see the message, we might mistakenly assume that the receiver has very little information at their disposal.
Good written article and there is probably some truth in it. But you misunderstand Kolmogorov Complexity. The surrounding machine that reads the code and instantiates it is *critical*. Is the shortes string describing tetris maybe just a link on my PC? Why not? After all the encoding of the floppy disk disregards all of the internal interpretations that are precoded into the computer.
In the same vain, the surrounding biology is *critical* to get from a DNA sequence to an animal or human. This includes not only the chemistry of ribosomes but all of physical reality. (The value might still be lower than one would assume naively, but much higher than estimates given by DNA)
It also disregards the fact that ms word could be coded in a tiny fraction of the space if Microsoft had any economic interest in doing so.
Well put and completely correct. This whole essay rests on the idea that the length of the instructions is directly proportional to the complexity of the resulting structure, without understanding that structures and languages for interpreting instructions can differ in complexity by orders of magnitude. If the language didn't matter you could take the genome of E. Coli, convert it into binary and run it as an .exe to "have" an E. Coli cell.
I see what you mean and you're certainly right for the computer program side – I don't know enough about low-level computing to know how much Tetris relies on external libraries, ready-made CPU instructions, or whatever computer do when they execute programs. I agree the size of a software package must vastly underestimate its total complexity. I kind of neglected this, because my point is the biological systems are often more simple that software packages – if the complexity of software turns out to be higher than their size suggests, it only makes biological look even more simple in comparison.
For biological systems, however, I still think almost all the information about how the system works has to be contained in nucleic acids sequences, and that includes the nature of nucleic acid themselves. That's where I was getting at in the parenthesis about DNA being the substrate of evolution: if there are any "structures for interpreting instructions" encoded in the chemistry of the ribosome, and if these structures have any non-trivial complexity beyond the baseline complexity of the primordial soup, then this additional complexity must have been selected for through evolution. And, with very few exceptions, evolution operates on nucleic acids. Any hidden information that isn't be encoded in the DNA/RNA in some way (if anything, as a kind of redundancy) would have to evolve in a Lamarckian way, and that's just not very powerful compared to Darwinian evolution. From that, I conclude that pretty much all the information about how polymerases/ribosomes/etc work must show up somewhere in the DNA in the form of a constraint on the sequence. Not *all* the information, but close enough.
Of course none of these estimates include information about the laws of physics etc., but since I'm making a comparison between two things from the same universe, it's just an offset and shouldn't matter for the comparison.
> For biological systems, however, I still think almost all the information about how the system works has to be contained in nucleic acids sequences, and that includes the nature of nucleic acid themselves.
If that were true, then it would in principle be possible to physically transfer naked DNA from species A into the gamete or zygote of species B and produce a viable member of species A.
Alternatively, you should be able to implant a zygote from mammal A into the womb of mammal B.
The general reason why this is somewhere between biologically unlikely and impossible is that epigenetic mechanisms mediate the interaction between DNA and its chemical environment throughout the life cycle. I'm not referring to epigenetic mechanisms of inheretance per se, but to the simple fact that disruptions to the chemical environment in which cells or fetuses exist have a profound effect on their viability.
As you may know, DNA is densely and dynamically interacting with an extremely complex protein mileau that is cell type and state dependent. This mileau is tightly regulated throughout mitosis and meiosis.
There is never any point in any organism whatsoever's life cycle in which naked DNA even exists, much less develops a cell or multicellular organism out of an arbitrary chemical soup. DNA is incapable of independently structuring a viable cell or organism outside the context of the biological soma with which it has co-evolved. The soma is arbitrarily complex.
> And, with very few exceptions, evolution operates on nucleic acids. Any hidden information that isn't be encoded in the DNA/RNA in some way (if anything, as a kind of redundancy) would have to evolve in a Lamarckian way, and that's just not very powerful compared to Darwinian evolution.
Another way to look at this issue that you may find clarifying is that evolution operates on *ancestral series* of nucleic acids. There is a series of mutational steps that permitted LUCA to evolve into a blue whale, another that permitted it to evolve into pangolins, and another that permitted it to evolve into the venus fly trap. One way we could potentially ignore the soma as a source of biological complexity is by focusing on ancestral series of genomes as a form of complexity. We can treat the soma as the product of the evolutionary path that gave rise to a particular individual, or even a particular cell.
I'm not even sure how to begin trying to make a theoretical estimate on how much complexity is possible, given average mutation rates, but it's probably much higher than what's possible if you treated a single individual's genome as the upper bound on their complexity.
This argument applies equally to MS Word. MS Word co-evolved with Windows, the programming language in which it's implemented, various hardware architectures, and so on. Its complexity is not bounded by the complexity of its source code.
I think the central point has to be that the genetic code which ended up creating the actual organism we see is probably much less complex than the code that would *guarantee* it, and this is because some aspects of the complexity of an organism are imminentised through its development.
The code for “release chemical X in condition Y” can be very simple, but if it doesn’t fully determine the conditions you have a code for building something *a bit like the organism in front of you,* but not one that works in all contexts and all circumstances, and not one which is exact.
The complexity of the organism is not the complexity of the code that created it because the code will not produce it exactly. The code for TETRIS will, though?
So there is a difference between the instructions needed to create a complex emergent thing and the thing itself, and the simplicity results from a small number of processes turning on differently in different contexts. But that’s effectively offshoring complexity from the code itself; the contexts are necessary for the organism— these are two different forms of complexity accordingly
It's like saying that a language is as complex as the number of letters in it's alphabet or that a book is only as complex as the number of words in it. It really is about what the DNA translates into. It's about the protein production and even those proteins complex interactions. In terms of the code itself, the shannon entropy of DNA to binary could be compared. Not sure that much of importance could be gained from the comparison.
'only 10% of the human genome is actually useful, in the sense that it’s maintained by natural selection. The remaining 90% just seems to be randomly drifting with no noticeable consequences'
This is simply false, and the scientists who study DNA have known it to be false for decades:
https://magazine.washington.edu/feature/no-such-thing-as-junk-dna-researchers-say/
'No such thing as ‘junk’ DNA, researchers say'
'For decades most scientists thought the bulk of the material in the human genome—up to 95 percent—was “junk DNA.”
It now turns out much of this “junk,” far from an evolutionary byproduct, actually contains the vital instructions that switch genes on and off in all kinds of different cells. Changes in these instructions can affect everything from color vision to whether a person develops diabetes or cardiovascular disease or a host of other conditions.
“The junk DNA concept, as it has come to color our perception of the human genome, is somewhat bizarre,” says Stamatoyannopoulos. “If you picked up a Chinese newspaper and you could read only one or two percent of the characters, would you automatically assume the rest was junk?”
The Human Genome Project sequenced the 3 billion letters or DNA bases that make up the genome, and it provided a basic catalog of genes, which occupy only about 2 percent of the genome. But understanding how genes turn on and off is vital to figuring out basic biological processes, like development, or how genes contribute to normal health and disease. It turns out—contrary to expectation—that there are a modest number of genes (around 20,000) but these genes are controlled by millions of DNA “switches,” with the whole unit functioning as a kind of operating system for the cell. '
https://med.stanford.edu/news/all-news/2023/09/junk-dna-diseases.html
'For decades, scientists have known that, despite its name, “junk DNA” in fact plays a critical role: While the coding genes provide blueprints for building proteins, which direct most of the body's functions, some of the noncoding sections of the genome, including regions previously dismissed as “junk,” seem to turn up or down the expression of those genes"
https://biology.mit.edu/so-called-junk-dna-plays-a-key-role-in-speciation/
'So-called “junk” DNA plays a key role in speciation'
'More than 10 percent of our genome is made up of repetitive, seemingly nonsensical stretches of genetic material called satellite DNA that do not code for any proteins. In the past, some scientists have referred to this DNA as “genomic junk.”
Over a series of papers spanning several years, however, Whitehead Institute Member Yukiko Yamashita and colleagues have made the case that satellite DNA is not junk, but instead has an essential role in the cell: it works with cellular proteins to keep all of a cell’s individual chromosomes together in a single nucleus.'
******
When you make sweeping claims, it's best to have your basic facts straight at the outset.
Otherwise, someone might conclude you simply have no idea what you're talking about with regard to any subject.
I think there's a confusion between two separate questions:
1) About 1-2% of the human genome can be clearly identified as protein-coding genes. In the early days, some people referred to the remaining 98% as "junk". Are these 98% actually junk?
This first question is pretty much solved – of course it's *not* actually junk. A lot of it does important things, even if they don't fall into a simple protein-coding cassette framework. The articles you posted are all about this first question.
The second question is:
2) Now, we know that *more than 2%* of the human genome has an effect on phenotype. But we still don't know what the real percentage is. Is it 10%? 90%? 99%?
The article I linked in the main post (https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004525), from 2014, attempts to answer the second question. They do that by looking at signatures of natural selection: basically, if a position appears to be mutating freely without any selection constraining the sequence, then that position must not have any evolutionarily-meaningful effect on phenotype. They find that this is the case for >90% of the genome. That's what I mean by "randomly drifting without consequences", and, as far as I know, this remains true. Note that this approach doesn't require any understanding of how these sequences actually work, so the analogy with reading a text in Chinese isn't relevant. A better analogy would be: "we find that if we replace 90% of this Chinese book with random gibberish, nobody notices, therefore these 90% must not contain any important information".
So, to my knowledge, the current consensus is that ~2% of the genome codes for proteins, ~8% does some other illegible thing (including lncRNA, protogenes, weird satellite stuff and whatnot), and ~90% has no impact on fitness.
Of course there are a lot of caveats and technicalities (for example, if you need a "spacer" sequence between two genes to prevent these genes from interfering with each other, the actual sequence of the spacer doesn't matter, but the *existence* of the spacer is important. So, does it count as "junk"?). There's been a ton of writing about these topic – see the "ENCODE controversy" – but that's mostly beyond the scope of this post.
Has anyone mathed the information density of 'junk' DNA? It is certainly above zero and certainly less than genes, even those under weak selective pressure.
Yes, but pinning this down is (to put it mildly) challenging-
https://www.mdpi.com/1099-4300/21/7/662
'Entropy and Information within Intrinsically Disordered Protein Regions'
'Despite our abstract understanding of the conformational entropy as a defining characteristic of IDRs [6,97,98,99,100], it has proven tremendously difficult to quantify the full range of this thermodynamic component at IDRs’ disposal. Some of the underlying reasons are the limited availability of experimental data to characterize the vast degrees of freedom that contribute to the conformational entropy of IDRs, the lack of understanding of the contributions of solvent, and a still evolving synergy between theory, experiment and simulations (Box 4). Therefore, our understanding of conformational entropy, and its change in a functional context (Figure 4), is limited to the degrees of freedom that can be measured experimentally...
The conformational plasticity of IDRs is also exploited for the regulation of their biological activities through post-translational modifications (PTMs) [8,102,134,168] (Figure 4). IDRs are the prevalent sites of PTMs perhaps because the lack of stable structure enables easier access to modifying enzymes, as previously proposed [134,169,170,171,172]. The PTM sites represent high-information density regions in IDR sequences that offer vastly diverse options for regulating biological functions, such as modulation of subcellular localization, protein-protein interactions, and rates of protein-synthesis...
When separately considering the constituents of the overall entropy, e.g., conformational entropy, it is often not straightforward to experimentally evaluate (or to simulate) its changes in a functional context. Information in the form of a loss of conformational entropy upon functional interactions is intuitively expected, and, indeed, numerous reports exist of IDRs that acquire a partial or complete fold in a complex or stabilize their pre-existing transient structural propensities.
However, there are also accumulating reports of biomolecular complexes in which IDRs remain highly disordered, or, where a loss of disorder in one part of an IDR is compensated by a gain of disorder in another (Figure 4A). Hence, functional context need not always result in a reduced conformational entropy of an IDR. Therefore, like their primary amino acid sequences, the conformational ensembles of IDRs demand additional metrics of functional information that are not strictly determined by structural propensities.'
That's proteins, not DNA. And the disorder they are talking about is conformational entropy, i.e. protein chains of defined sequence that are folding (ordered), or disordered and waving about in the breeze. Chemistry, not information theory.
On the contrary, the two sentences you quote at the top of your reply are quite true. ~90% of the human genome is not conserved and/or not under purifying selection. This means that the genetic sequence isn't important for this 90%; i.e. you can change it by mutation and it still "works". Any regulatory or other functions buried in this 90% cannot depend on the specific genetic sequence. This greatly limits the possible functions of the 90%.
No knowledgeable scientist ever denied the existence of the "vital instructions that switch genes on and off". Importantly, any regulatory stuff for which the gene sequence matters, must be located in the ~10%. This includes the "millions of DNA switches". Otherwise, mutations would break them.
Knowledgeable scientists understood all this going back to the 60's. Susumu Ohno's 1972 paper that popularised the term "junk DNA" lays it all out simply. (note the contemporary estimate of ~6% genes in place of the modern ~2% genes and ~10% under selective pressure):
"Aside from conventional structural and regulatory genes, this 6% should include the promotor and operator region which are situated adjacent to each structural gene, for these regions can certainly sustain deleterious mutations. More than 90% degeneracy contained within our genome should be kept in mind when we consider evolutional changes in genome sizes. What is the reason behind this degeneracy?
Certain untranscribable and/or untranslatable DNA base sequences appear to be useful in a negative way (the importance of doing nothing). If functional genes customarily occupied the region around the centromere, evolutional changes of chromosome complements would not have occurred as often as has been observed.
[...]
The same can be said of those DNA base sequences which are used as partitions between the genes. It may be of selective advantage to space adjacent genes far enough apart [...]"
It's all there. They estimated 30,000 genes in 1972; the human genome project showed there were only 20,000, or 2% of the genome. Ohno only knew a few examples of "vital instructions that switch genes on and off" such as promotor and operator, but he did know that if their function depended on their genetic sequence, then they would be found in the the ~10%. But importantly, he gives examples of possible functions for the 90% that don't depend on a specific genetic sequence.
Of your provided examples, any whose function depends on a specific gene sequence will be found in the ~10%. The rest, with "seemingly nonsensical" sequences, are in Ohno's "junk" category, because you can break them with mutations and they still work. Knowledgeable scientists know that this has remained true since Ohno's time.
Ohno (1972) is hard to find online, but Wayback Machine has a copy:
https://web.archive.org/web/20120218215829/http://junkdna.com/ohno.html
On this note, would anyone (knowledgeable on this subject) like to update the wikipedia article?
https://en.wikipedia.org/wiki/Junk_DNA?wprov=sfti1
Or this could be a good starting point for a non-biologist. I don't think you would be so harsh on on someone 200 years ago (caveat, not sure _exactly_ how long ago) saying the earth is the center of the solar system. You cite an article from two years ago, saying it was decades ago. Sounds like a relatively new concept... but tell me about telomeres.
Here you go: https://x-chromosome.xyz
Waw, that's really cool. Thanks for making this work. Hopefully it will produce an interesting new tweet, or if it never does I'll be all the more impressed that life can exist at all.
For me, it’s word choice. Humans are more elegant than Microsoft Word, not simpler, in the same sense in which Pascal (and others since) said “I have made this [letter] longer than usual because I have not had time to make it shorter.”
You hit exactly the right balance of big thoughts and amusing for a Sunday morning read.
Looking forward to the next time you boil down a vast amount of Greek characters and log scale plots into a tweet analogy.
And I’m correctly offended you’re under some false assumption I can’t produce a proper insect. How dare you.
this might be of interest, by Anastassia Makarieva, physicist and biotic pump researcher about natural systems capacity to compute:
https://bioticregulation.substack.com/p/information-processing-by-natural
I am considerably more complex than just my genome. I learned to understand speak and read English, and live in my culture and smile or scowl. I then did a degree and worked for 50 years. Had children, earned and spent money and all that. That stuff is not coded in my genome. The potential is but all that nurture is part of me too. And it ain't coded in DNA. And there is an awful lot of it in any functioning adult.
While this article presents an intriguing analytical framework, it exemplifies a fundamental limitation in modern scientific discourse that reduces the profound complexity of life to purely mechanical and computational metaphors. This perspective is problematically reductionist.
First, the article's characterization of bacteria as strategic actors that 'decide' to optimize for simplicity reflects a fundamental misunderstanding. Bacteria aren't sentient strategic agents - they don't make decisions in the way the article implies, which stems from modern science's reductionist habit of imposing human-like strategic behavior on natural processes it doesn't fully understand.
They are part of an interconnected living system that operates according to quantum principles rather than the linear cause-and-effect relationships our minds tend to impose on them. The distinction between 'good' and 'bad' bacteria is similarly an artificial construct that reveals more about our linear thinking than the actual nature of these organisms.
Here's an inconvenient truth: the body doesn't know what's a toxin, a protein, enzyme and whatnot. The body operates on a fundamentally different paradigm than the mind. The body's innate intelligence doesn't recognize the dualistic concepts of good versus bad that dominate our mental framework. This mirrors Hippocrates' insight about medicine and poison being distinguished only by dosage. Ironically, it's often the mind's well-intentioned interventions - through its insistence on linear, reductionist approaches - that create harm by imposing its rigid interpretations onto the body's more nuanced systems.
Second, the article's treatment of DNA storage capacity is particularly revealing of this limited perspective. Recent research has already confirmed that DNA functions as a fractal antenna - a sophisticated multidimensional structure capable of receiving, transmitting, and processing information in ways that transcend simple linear sequences (https://pubmed.ncbi.nlm.nih.gov/21457072/). This property is completely overlooked when we reduce DNA to a one-dimensional string of nucleotides measured in megabytes.
This relates to a deeper issue: the fundamental mismatch between how our analytical mind processes information and how biological systems actually operate. The human mind works like a serial processor - it understands ONLY through juxtaposition, creating meaning by placing one thing against another (like plotting points on an X and Y axis). This is why our scientific models tend to be two-dimensional representations that break systems down into comparable components.
But biological systems operate more like quantum parallel processors. They can maintain complete understanding of phenomena without needing to break them down into constituent parts. This is why attempting to measure genetic complexity in terms of conventional data storage units is deeply misleading. What appears as 156 KB in our linear, particle physics model could actually represent thousands of terabytes of information when understood in terms of quantum fields - yet paradoxically show as zero size in quantum measurements because the information exists in a zero-point scalar field state.
The article's comparison of genome sizes to software programs reveals this blind spot. It's not that Microsoft Word is more complex than a living organism - it's that we're measuring complexity using tools designed for linear, sequential information processing while completely missing the quantum, multidimensional nature of biological information storage and processing.
What makes this current paradigm particularly dangerous is its complete disconnection from the true nature of being. Modern science, in its clinical rationality, has elevated this mechanical worldview to an absolute model, while simultaneously severing our connection to the deeper principles that govern life itself. The supreme irony is that this approach, in its very sophistication, has fostered a profound delusion: that we can outsmart nature through pure rational analysis.
This hubris is perfectly reflected in the inventions and aspirations of modern humans, who, in their vanity, style themselves as direct descendants of the gods - capable of improving upon nature's designs through sheer force of analytical intelligence. It's a mindset that mistakes technological sophistication for true wisdom, and computational complexity for genuine understanding.
Until we develop a more humble and holistic understanding that reconnects scientific insight with spiritual wisdom, our models will continue to miss not just the true complexity and elegance of living systems, but also the profound responsibility that comes with attempting to manipulate them.
A beautifully written and well-researched thought experiment. Although I disagree with some of your statements, the topics you raise and the questions you inspire are of utmost importance and should inspire further research by your readers.
Personally, I’m fascinated by the inner workings of a single cell, particularly its protein synthesis, energy production, and communication abilities. Studying the different organelles with their specialized functions, including self-replication, I find that a single living cell is far more complex than any human-made system.
Just search for “flagellar motor” and be in awe of the complexity of this single organelle.
And human beings that can create a Large Hadron Collider still cannot synthesize a single self-replicating cell.
I think that reducing human beings to a number of genes in the genome is hugely off -- it doesn't even take account of epigenetics: how genes can be switched on or off by a myriad of factors relating to environment and experience.
It's like saying all there is to be known about a house is the number of bricks, floorboards, rafters and roof tiles.
And what about the brain? Every one with more neural connections than stars in the known universe -- constantly forming new connections and having old ones pruned: also directed by environment and experience..
Its complexity is probably beyond measurement in millions of petabytes.
Correction: the brain more neurons than stars in the Milky Way, with 100 trillion synaptic connections. I have no idea how many petabytes of storage that would require.
The scale of software analogized to complexity is waaaaaay off. The Microsoft Word could easily be in a fraction of its size. In the 90s the software development community totally lost its ability to worry the size of its product. Speed of development and the notion of reuse became ALL. And apps like Microsoft generation-to-generation depended on faster apps and more memory. Depended wholly because the entire software industry lost the knowledge of how to build compact, concisely performing code. There is no logical justification for the present size of the Microsoft Word installation. None. It is NOT that much more complex than the version I still have on a single-sided floppy disk.
Great article. This points at one of the central difficulties for (human) engineers hoping to influence biology: everything has multiple functions. There is no gene for a single trait; most traits are influenced by hundreds or thousands of genes, when the entire genome is only tens of thousands of genes! Every possible drug target also has hundreds of other functions!
I am hopeful that the nonlinear gestalt thinking of modern AIs will be able to intuit the emergent function of these n-dimensional biological networks.
You assert that non-coding DNA is (to use the old term) ‘junk’. This may be what was assumed to be the case about 25 years ago, but significant evidence now supports the hypothesis that it plays a vital part in regulating transcription of coding DNA and how it is put to use. It clearly plays a major part in the development process. Perhaps you need to gain a more up to date understanding of biology / genomics.
Similarly, in your discussion of complexity, you fail to consider issues such as bloat and compression. Modern editions of Microsoft Word are a classic example of bloat, due to dubious coding choices. Observer that Windows, by your estimate, is of huge complexity, and yet Linux distributions of similar or greater capability can fit on a CD.
Issues around compression also plague your statements about genomes. The human genome is ‘small’ because it is very efficiently coded, with individual genes coding for many proteins (regulated, possibly, by non-coding DNA), so we are apparently less complex than less efficiently coded amoeba. The ultimate example of this has to be viruses, which fit code for complex physical structures into a few hundred bases, thanks to incredibly sophisticated coding, eg RNA that can be read in either direction, to give different proteins.
So, your basic contention that the length of the description equates to the complexity of the described is simplistic and misleading. As a final example: try to describe ‘gemütlich’ in English; it will take many words, whereas German can do it in one.
This is a very simplistic view of the information content in the DNA, and the comparison with Microsoft Word is nonsensical.