Introduction:
The accumulation of genomic and brain data opens new opportunities for resource friendly, data driven brain exploration. A key challenge is to develop versatile and accessible strategies that integrate and mine multimodal datasets for novel neuroscientific insights. Here, we optimized an integrated workflow for mapping multigenic evolutionary traits in the human brain across cognitive, cellular, and molecular levels.
Methods:
At the input stage, the workflow fuses an evolutionary genetic dataset with searchable synthetic functional magnetic resonance imaging (fMRI) databases that are pre clustered into concise psychological domains for improved interpretability. At its core, a Genetic Algorithm for Generalized Biclustering (GABi) mines gene sets under evolutionary selection that also show high expression correlation with fMRI networks.
Results:
Applying this workflow, we identified evolutionary patterns spanning cognitive traits, brain cell types, and molecular mechanisms. Focusing on socio affective traits, the algorithm highlighted peaks in adaptive selection in networks for social interaction (language) and social concepts (theory of mind) across hominid, early hominin, and anatomically modern human (AMH) ancestry. These traits emerge from a broad spectrum of excitatory (glutamatergic) and inhibitory (GABAergic) neuronal, as well as non neuronal, cell types. The associated Gene Ontology (GO) terms were enriched for cell signaling, synaptic organization, and neuronal morphology.
Discussion:
Together, these findings demonstrate an integrated workflow for molecular to systems level exploration of the brain and provide new perspectives on the evolutionary history of human socio affective functions. This approach can be adapted to screen for functional traits in the context of mental disorders or applied to the brains of other phylogenies in a similar manner.
1 Introduction1.1 Multimodal brain data: a new frontier for neuroscientific discoveryAn increasing amount of brain data, along with neurogenetic information on brain development, cell types, and diseases, offers unprecedented opportunities for neuroscientific exploration and hypothesis generation. Moreover, these datasets provide footholds with problems that are costly or difficult to address through conventional experimental approaches.
At the neurogenetic level, multiple large-scale neurogenetic databases contain gene expression and genomic information across brain regions, developmental stages, and species, which are essential resources for understanding brain function. The Allen Brain Atlas offers genome-wide 3D maps of gene expression across mouse, human, and non-human primate brains, whereas the BrainSpan Atlas provides multidimensional genomic data across developing and adult human brain regions. Projects such as the Genotype-Tissue Expression (GTEx) project catalog genetic effects on gene expression across 49 human tissues, including multiple brain regions, providing cis- and trans-expression quantitative trait loci (eQTLs) and splicing QTLs that link genetic variation to functional outcomes. Initiatives such as BrainBase and databases such as OpenTargets and DisGeNE curate disease-gene associations and disorder-specific profiles.
At the system level, neuroimaging repositories span from primary data sources, such as the Human Connectome Project (HCP), which provides task-based and resting-state functional magnetic resonance imaging (fMRI) data from large cohorts, to synthetic databases, such as Neurosynth and NeuroQuery. These databases aggregate published neuroimaging findings to create probabilistic brain maps linked to cognitive functions.
With the growing neurogenetic and imaging datasets, the scientific community and funding systems have invested significantly in providing access to such multimodal resources. Both privately (e.g., Allen Brain Institute) and publicly (e.g., BRAIN Initiative, Human Brain Project) funded multimodal brain mapping initiatives provide regional and local (cellular) gene expression data, connectomics, and neuronal activity data, integrated by emerging spatial omics approaches (Bressan et al., 2023). The next frontier is the establishment of low-threshold, resource-friendly workflows with variable statistical search thresholds and approaches that mine these resources for insights into brain function (i.e., encompassing cognition and behavior in the context of this study) (Pfaff et al., 2019).
1.2 Integrating genetics with brain functionAmong such approaches, pipelines linking genetics to brain function are gaining popularity, as they can be used to annotate and deconstruct the functional organization of the brain (Ganglberger et al., 2018; Richiardi et al., 2015; Dear et al., 2024; Fornito et al., 2019). One such prime example is a study linking genome-wide association studies (GWAS) to Diffusion Weighted Imaging (DWI) data, allowing for filtering significant single-nucleotide polymorphism (SNP)-connectome pairs to systematically identify associations between genetic variants and structural connectivity phenotypes. Finally, such associations can be then linked to cognitive traits (Chen et al., 2022). Moreover, advances in single-cell sequencing allow for the mapping of brain region-specific features down to individual cell types and the identification of gene markers that define circuit architecture, supporting trait mapping at the cellular resolution (Yao et al., 2025; Lee et al., 2025). These strategies collectively help pinpoint the brain regions and genetic mechanisms that underlie the trait-relevant network architecture (Arkhipov et al., 2025). Likewise, a cross-species resting-state fMRI imaging study using several mouse models of autism has successfully identified four etiologically relevant functional connectivity subtypes for this disorder (Zerbi et al., 2021). This demonstrates the power of computational strategies to integrate resting-state fMRI connectivity data across various mouse genetic models to identify specific brain networks underlying functional traits and psychiatric disorders.
Importantly, an increasing number of tools are available, allowing the mining of such complex data. These tools can be grouped into three approaches. Neuromaps (Markello et al., 2022) and abagene (Markello et al., 2021) integrate, transform and compare molecular, structural and functional maps of the brain by combining classical statistics (t-test, correlation, permutation test) and machine learning methods (regression, clustering). Neurosynth, Neuroquery and NiMARE (Salo et al., 2023) apply literature and data meta-analysis approaches to reconstruct brain-wide networks underlying specific functions. Finally, tools such as CellWhisperer (Schaefer et al., 2025) or CellTransformer (Lee et al., 2025), which are artificial intelligence (AI)-based workflows, currently allow the integration of single-cell expression brain data and deep data mining.
1.3 Evolutionary approaches: genetics, archeology, and comparative dataThe longstanding quest to reconstruct the cognitive history of the human self is driven by our intrinsic curiosity about the origin of what makes us human and what sets us apart from our closest relatives, whether alive or extinct. Ultimately, our neurocognitive capacities have driven tool use and technologies, the emergence of complex societies, creative expression, human culture, and even the evolution and prevalence of psychiatric disorders. However, deeper insights into the evolution of specific cognitive traits are limited by the fact that ancestral brains are inaccessible to experimental or even archeological exploration.
Over the last decade, evolutionary anthropology has adopted a bottom-up approach, typically based on studies of archeological records, animal and organoid models, and genetic variants of extant and ancient organisms. Integrating evolutionary genetics with archeological evidence, particularly through the analysis of events spanning over 60 million years, has provided molecular timelines illuminating the origins of humanity as well as nuanced understanding of human evolution (Piszczek et al., 2023). This approach is especially powerful when combined with paleogenetic data from extinct human relatives, such as Neanderthals (Zeberg et al., 2024) and Denisovans (Green et al., 2010; Peyrégne et al., 2024). By comparing genome sequences across different organisms and hominin lineages, we can understand their molecular relationships (The Chimpanzee Sequencing and Analysis Consortium, 2005) and identify genetic variants and mutations that influence gene expression through copy number variations or cis-regulatory elements (Pollen et al., 2023). These analyses have revealed complex patterns of separation and interbreeding among hominin lineages, which are associated with their migration patterns, habitats, and archeological footprints (Skoglund and Mathieson, 2018).
Comparative genomic approaches have pinpointed specific genetic modifications associated with significant developments in brain structure and function, revealing the evolutionary pressures that have shaped the human brain (Dorus et al., 2004). Computational methods, such as PrediXcan, enable the identification of regulatory differences in primate brains using regression models applied to genotype-tissue expression (GTEx) project data (Colbran et al., 2019). Recent advanced computational approaches offer increasingly integrated views: comparative neuroscience datasets allow for the imputation of ancestral cortical structures, modeling of cortical evolution, and computational retracing of evolutionary adaptation to different environmental demands (Schwartz et al., 2023).
Together, archeology, genetics, organoid and animal models offer strong mechanistic insights and specific snapshots of human brain evolution, such as the mechanisms underlying brain size and cellular composition. However, a comprehensive reconstruction of the human mental past has proven to be a challenging task using either comparative neuroscience approaches alone (through analysis of endocasts or functional neuroanatomical extrapolations) or neurogenetics alone (by examining single-gene mutations influencing human traits, such as brain expansion and language capacity) (Piszczek et al., 2023). Thus, the field faces a critical gap: current methods excel at revealing isolated evolutionary changes but struggle to provide an integrated, systems-level understanding of how multiple genetic and brain network changes collectively shape human cognitive evolution.
1.4 From brain networks to ancestral selectionWe explored a complementary strategy enabling the holistic reconstruction of ancestral functional brain states in a top-down manner, drawing from multigenic effects across cognitive traits. A recent pilot study (Kaczanowska et al., 2022) laid the groundwork by fusing publicly available multimodal brain data. This approach operates on two levels: first, a data-driven computational level operating on present-day human data, integrating genetic and functional information with evolutionary features; second, an evolutionary interpretation level relating the identified patterns to potential ancestral selection processes. This study used substitution ratio (ω) values and likelihood estimates of non-synonymous versus synonymous coding sequence changes as gene-based proxies of ancestral selection within a phylogenetic framework (Yang, 1998). These ω-values are computed from protein-coding DNA sequence (CDS) information from databases such as Ensembl, OMA, and ancient genome repositories, using computational tools such as codeml packages (Yang, 2007). The selection pressure values were projected gene-wise onto brain space according to expression sites from the Allen Human Brain Atlas (AHBA) (Hawrylycz et al., 2012), yielding brain-wide maps of ancestral selection pressure. This approach, used in imaging transcriptomics (Mandal et al., 2022), enables spatial mapping of genetic attributes onto the brain architecture. This spatially explicit selection dataset was subsequently fused with gene-wise brain-wide correlations to task-related functional brain networks (FNs) derived from the HCP data, with registration performed in the AHBA template space. This essentially bimodal data matrix, consisting of temporal genetic (evolution) and the correlation of the spatio-functional (FNs) domains to the gene expression, is mined in a straightforward manner using biclustering algorithms. Therefore, at a more abstract level, we correlated the genetic information G(g,s) for every gene g and biopsy site s to the functional information F(n,s) for every network n and biopsy site s, and concatenated it with information on genetic pressure ω(g,b) for gene g and key mammalian species b. On this matrix, we find biclusters as subsets of genes, branches, and networks. Although not used in brain research, this class of unsupervised learning methods is advantageous for uncovering patterns within complex, two-dimensional datasets (Cheng and Church, 2000). We specifically chose a customizable Genetic Algorithm for Biclustering (GABi) for our approach (Curry, 2014). In our context, the resulting biclusters represent gene sets that co-evolve, share similar ω selection profiles across evolutionary branches, and are co-expressed in spatial patterns that align with a specific functional brain network. By applying this algorithm, traces of ancestral selection pressures acting on cognitive functions can be systematically uncovered. Each identified bicluster links a cognitive function (via its associated FN) to a particular evolutionary time point and gene set, enabling the computational reconstruction of when and how specific traits, such as language capacity or strategic cognition, emerged within human ancestry.
1.4.1 Proposed workflow and enhancementIn this context, we refined the aforementioned method for data mining and explored both the evolutionary temporal aspect and the functional spatial network dimension. Here, we present several key aspects to increase the applicability of this workflow across a broader spectrum of research fields and enhance its usability for the neuroscience community:
i) We adapted multiple aspects of the pipeline (Figure 1) to improve the overall accessibility and usability. To this end, we developed an integrated computational workflow with clearly defined instructions and a comprehensible code to jointly handle genetic data, brain gene expression maps, and fMRI connectivity data as inputs.
ii) We modified the handling of genetic data to enable the integration of a larger, more comprehensive gene set into the analysis pipeline, particularly by employing less conservative assumptions regarding incomplete ω data.
iii) We adapted FN handling to accommodate larger, user-defined task sets derived from synthetic imaging databases (such as neurosynth.org, Yarkoni et al., 2011), enabling greater flexibility in selecting cognitive domains for analysis.
iv) We enhanced the mining of the multimodal data space via the GABi algorithm through parameter optimization to ensure the robust identification of biologically and evolutionarily meaningful biclusters.
v) Finally, we added a specific structured output for adaptive evolution and purifying selection at the FN, cellular, and molecular levels, facilitating interpretation across biological scales.

Workflow for mapping multigenic evolution across the cognitive, cellular, and molecular levels. The integrated pipeline processes three types of input data: (A) functional networks (FNs) derived from selected neuroimaging data sources (Neurosynth, HCP, Neurovault), (B) spatially resolved brain gene expression maps (utilizing ABA), and (C) branch-specific ω-values for selected gene sets. The workflow comprises six sequential steps that can be customized: (1) correlating FNs with regional gene expression; (2) employing GABi biclustering to identify coordinated gene-FN-evolution associations; (3) clustering FNs into groups (domains) concurrently; and (4) selecting significant biclusters through cluster-thresholding across phylogenetic branches. The resulting biclusters can then be analyzed in terms of: (5) cell-type enrichment, and (6) gene ontology term enrichment. Key workflow steps (1–6) and data sets (A–D) can be adapted to address specific scientific questions, for example, (A) integrating various fMRI databases (Neurosynth, HCP, Neurovault) for broader FNs coverage, (1) employing different correlation methods between FN profiles and gene expression patterns for specific network focus; (C) applying various methods and constraints for estimating gene evolutionary pressure for diverse evolutionary coverage; and (D) thresholding such estimates for tailored gene set selection, (2) exploring various GABi parameters for enhanced bicluster discovery, and (4) determining the bicluster selection criteria. Steps 5. and 6. can be adapted by using different sources for cell type and GO term enrichment analyses. Importantly, step 3 could be omitted, and the biclusters could be analyzed at a single FN level in the downstream analysis (steps 5–6).
Applying this optimized workflow to the evolution of socio-affective traits, we identified specific peaks in adaptive selection affecting a subset of higher-order socio-affective trait networks in anatomically modern humans (AMH). In summary, we demonstrate the versatility of our approach by targeting socio-affective brain networks and offering novel computational means for jointly assigning genetic and brain imaging data within the context of psychiatric and population genetics research. This framework provides a generalizable strategy for not only reconstructing the evolutionary forces that have shaped human brain organization and cognition, but could be expanded to other neuroscientific fields.
2 Materials and methods2.1 Task-evoked functional brain activity data F(n,s)In our pilot study (Kaczanowska et al., 2022), functional magnetic resonance imaging (fMRI) data were obtained from the Human Connectome Project (HCP), including gambling, emotional processing, and working memory. Here, we took advantage of the synthetic database, NeuroSynth (neurosynth.org, (Yarkoni et al., 2011)) as a source for task-related functional network representation (Figure 1A). The Neurosynth database allows for the straightforward assembly and retrieval of user keyword-defined task-related functional network information. Note that any other reference database (e.g., HCP, OpenfMRI) would work similarly. To focus on socio-affective functions, e obtained “uniformity test maps” for 45 a priori selected terms (see Supplementary Table 1) being the set of functional networks serving as input for the following steps. Uniformity test maps were provided as z-scores sampled in a discrete Montreal Neurological Institute (MNI) 152 2 mm space. These statistical inference maps scores indicate whether activation patterns in a given voxel were more or less frequent than expected by chance, based on the overall distribution of activations in the brain (see (Yarkoni et al., 2011) for details). Activation pattern visualization was performed using the plot_glass_brain function from the Nilearn package (Nilearn Contributors et al., 2025) using Python version 3.9.12.
The original method from (Kaczanowska et al., 2022) was upgraded to improve the way the network values F(n,s) were derived at the 3702 biopsy sites s of the gene expression dataset from the Allen Human Brain Atlas (see below)for each of the 45 functional networks n. Specifically, the radius for mapping the FN signal onto each biopsy site was contingent on the density of the biopsy sites in its region: if the distance for a given biopsy site to its nearest neighbors was equal to or below 5 of squared Euclidean distance, the direct z-value from the closest MNI point was taken; otherwise, a Gaussian-weighted normalization was applied (with μ and σ set to 0 and 20, respectively) to the z-value. This ensured that in areas with closely spaced biopsy sites, a smaller radius was used to prevent overlapping effects, whereas in regions with fewer biopsy sites, a larger radius was employed. This improvement over the previous approach allowed the biopsy site space to be filled with representative FN signals, particularly in areas where the biopsy sites were widely spaced.
2.2 Spatial gene expression data G(g,s)Spatial gene expression data (oligo microarrays) for human protein-coding brain-expressed genes were reused from our pilot study (Kaczanowska et al., 2022), specifically Supplementary Data S6 (all_genes_expression_Human.mat). These data (Figure 1B) include expression information G(g,s) for the 8977 genes g at the 3,702 biopsy sites in the brain from the Allen Human Brain Atlas (AHBA) (Hawrylycz et al., 2012). These values were used in Section 2.4.1.
2.3 Substitution rate ratio data ω(g,b)To explore the evolutionary path of brain-expressed protein-coding genes, we employed a dataset previously curated and calculated by Kaczanowska et al. (2022), specifically in Supplemental Table S2 and Supplemental Data S6. This dataset includes one-to-one orthologs for 8,977 genes g across key mammalian species b within the AHM phylogeny, as well as raw ω-values (Figure 1C). These ω-values ω(g,b) represent the maximum likelihood estimates for the ratios of nonsynonymous (dN) to synonymous substitutions (dS) in a gene’s coding sequence along a specific phylogenetic branch. In brief, Kaczanowska et al. obtained homologous coding sequences from the OMA orthology database or ancient genome repositories, filtering for the presence of gene expression in AHBA. They reconstructed phylogenetic trees and timelines from mitochondrial DNA (mt-DNA), and calculated raw ω-values using the codeml routine implemented in PAML v4.9i (Yang, 2007), see, e.g., https://github.com/abacus-gene/paml/wiki/Substitution-models#nucleotide-substitution-models for a detailed description. In the previous study by (Kaczanowska et al., 2022), these raw ω-values were conservatively filtered (Table 1, 2nd Column). To enhance the discovery potential of our approach, we relaxed three of the constraints to include a broader range of genes, particularly in recent branches (Table 1, 3rd Column; Supplementary Figure 2B).
PAML outcomeConservative constraints (Kaczanowska et al., 2022)Relaxed constraints (this study)Evolutionary interpretationStable ω > 1Value keptValue keptAdaptive selectionStable ω 0–1Value keptValue keptNeutral evolutionStable ω ~ 0Value keptValue keptPurifying selectionUnstable ω in 5 runsFiltered outFiltered outFiltered outShort edge, unstableFiltered outFiltered outFiltered outdN/dS = 0, (dN = 0)NASet to 0Purifying selectiondN/dS = ∞, (dS = 0)NASet to maximum value of the species/branchAdaptive selectionIdentical sequencesNASet to 0Neutral evolutionHandling constraints for ω-values.
PAML predictions yield different outcomes (first column), which can be filtered and clipped pending conservative (second column) or more relaxed (third column) constraints. High, intermediate, or low ω-values can then be interpreted as signatures for adaptive selection, neutral evolution, or purifying selection, respectively (Jeffares et al., 2015). Importantly, these constraints operate on the original PAML output (here taken from Kaczanowska et al., 2022), and can be set by the researcher pending their specific use cases (e.g., when operating with different phylogenies, gene (sub)sets, etc.…).
The underlying phylogenetic tree and timeline were obtained from (Kaczanowska et al., 2022). In brief, we previously used an AMH-chimpanzee mean divergence of 7.8 Mya (SD 1.2 Mya), an Old World monkey-ape divergence of 28 Mya (SD 3 Mya), and a 60 Mya mean (SD 2.8 Mya) for the coalescence of the primate lineages. A maximum clade credibility tree was generated using the software Figtree 1.4 (Rambaut, 2018). The mt-derived phylogeny outcomes aligned with the primary gene tree topology and were used to annotate the timelines. The mt-DNA-derived phylogeny and main gene tree topology position Neanderthals closer to AMH (Meyer et al., 2014) and support the Chimp-to-Denisovan-Neanderthal-AMH species sequence used here. Note that the workflow (Figure 1) can be used with any other phylogeny and genetic feature other than the ω-value and the specific ω-table presented here.
2.4 Pattern mining through biclustering2.4.1 Correlating gene expression and functional networksTo identify genes linked to specific tasks or FNs, we mined co-evolving genes with high spatial correlations with these networks. As described above, we retrieved anatomically modern human (AMH) spatial gene expression data G(g,s) from 3,702 biopsy sites in the Allen Human Brain Atlas (ABA) for each of the 8,977 genes (see Section 2.2. Spatial gene expression data). Next, we set the network data F(n,s) of the 45 socio-affective networks (see 2.1. Task-evoked functional brain activity data sets) in context to the gene expression data G(g,s): we did this by computing the Spearman’s rank correlation coefficients rij: = r(gi,nj) = ρ (rank(G(gi,s)), rank(F(nj,s)) over the 3,702 biopsy sites s between every gene gi and network nj (Figure 1, Step 1) using the function “cor” from the “stats” packages in R with option “pairwise.complete.obs” (computation for all complete pairs) and the method “spearman.”
Next, to organize the FNs into meaningful socio-affective domains, a hierarchical clustering analysis was conducted on rows of the r(g,n) = ((rij))i=1,...,45,j=1,...,8977) matrix, using the R implementation of Ward’s minimum variance method using squared Euclidean distance as the distance measure algorithm (Figures 1,2, Step 3; Supplementary Figure 1). This method was selected because of its effectiveness in minimizing within-cluster variance, which ensures that the resulting clusters are as cohesive and distinct as possible. This allowed grouping different individual FNs into overarching social and emotional domains for initial gross interpretation. The optimal number of clusters was selected using the silhouette average width method using silhouette function from cluster package in R. The overlap between the individual FNs and socio-affective domains was determined using both the Pearson correlation coefficient (using the pearsonr function from the scipy package (Virtanen et al., 2020) in Python version 3.9.12) and the Dice coefficient (using a self-written function using the numpy package (Harris et al., 2020) in Python 3.9.12, where the Dice coefficient 2*|A∩B|/(|A| + |B|), for A and B FNs signal patterns). The latter is a measure of the extent of overlap between the thresholded activation maps (Wilson et al., 2017). A Dice coefficient value of 0 implies no overlap at all between the given FNs activations, whereas a value of 1 implies perfect overlap. This gross interpretation can be extended to individual FNs within these clusters for a more detailed dissection. The activation patterns for the overlaps, differences, and mean signal in the socio-affective domains were visualized using the plot_glass_brain function from the Nilearn package (Nilearn Contributors et al., 2025) using Python version 3.9.12.

Data-driven classification of Neurosynth terms into socio-affective domains. (A) Hierarchical clustering of Neurosynth socio-affective FNs and ABA gene expression correlations across biopsy sites revealed six distinct socio-affective domains. (B) Brain-wide representation of the six socio-affective domains showing their spatial distinction. For each domain, a mean uniformity test z-score task-fMRI signal across the domain-specific Neurosynth terms was calculated and plotted using nilearn. Note that such data-driven clustering can be adapted to various combinations of Neurosynth terms or other task-fMRI datasets.
2.4.2 Data preparation for biclusteringTo calculate the evolutionary signatures within the AMH lineage, we used the aforementioned table ω(g,b) comprising 8,977 genes g (rows) and 11 branches b (columns); see (Kaczanowska et al., 2022), with a similar data processing workflow. In the first step (Figure 1C), the ω-values were rank-normalized for each branch (rank/number of genes), excluding undefined ω (dS = 0) values (Villanueva-Cañas et al., 2013) using the function “rank” in R with ties method “min”: let its rank normalization Specifically, the -values approaching 1 correspond to the largest ω for a given branch, whereas rank-normalized -values near zero indicate the smallest values. Next, the full correlation matrix (g,n) was filtered for genes whose correlation with at least one network exceeded the selected threshold > 0.75. Similarly to the previous workflow, we pre-selected genes for each branch (Branch 1 to 7, Chimp, Denisovan, Neanderthal, and AMH) in the evolutionary tree from either high (>1) or low ω (~0) categories and harmonized the pool sizes of these high-ω and low-ω gene sets across branches.
Concurrently, and in accord with the previous approach, the fMRI-to-gene expression correlation matrix r(g,n) (here 8,978 genes as rows and 45 FNs as columns, Figure 1 Step 1) was rank-normalized to , with a range between 0 and 1 with gene-to-network values with the lowest correlations set to zero and those with the highest correlations set to 1 using the function rank in R with ties method “max”: . Subsequently, we set the minimum value per gene and the values for genes with a low overall correlation with all networks (i.e., <0.1) to 0.
Finally, we concatenated the normalized fMRI-to-gene-expression correlation matrix (g,n) with the table of normalized ω-ranked genes (g,b) for each evolutionary branch under investigation to the matrix [(g,b)(g,n)].
2.4.3 Subspace pattern mining for network evolution via biclusteringThe center of our proposed workflow is the use of the customizable biclustering algorithm GABi (Curry, 2014) to mine the ω-fMRI-to-gene-expression data table [(g,b)(g,n)]. Figure 1 Step 2, which was concatenated in the previous steps (2.4.2 Data preparation for biclustering). Here, we updated the previously used genetic algorithm pipeline (Kaczanowska et al., 2022) on several levels with respect to (1) input data, (2) adding an option for branch-specific ω-rank thresholds (Figure 1 Step 2; Supplementary Figure 2A), (3) improving the optimization strategy for finding biclusters, and (4) adapting the post-processing pipeline.
In terms of data input, we relaxed three constraints in the previously calculated ω-values, thereby enabling the processing of a broader range of genes (Table 1; Supplementary Figure 2B). To address this issue, a modification was introduced to the GABi pipeline, wherein thresholds were iteratively computed for each branch independently, setting branch-specific minimum and maximum ω -rank thresholds (Figure 2; Supplementary Figure 2A).
We improved the optimization strategy as follows: GABi uses a genetic algorithm to search for optimal biclusters. We define a bicluster as B = (gsub, [bsub nsub]) with a sublist gsub of all 8,977 genes g (gsub ⊆ g), a branch bsub of all 11 key mammalian species b (bsub ⊆ b) and a list of networks nsub of all 45 networks n (nsub ⊆ n). The algorithm has several customizable parameters related to the exploration of the fitness space (Supplementary Figures 2C,D); for a more detailed description, see (Curry, 2014). Genetic algorithms incorporate randomness into the initial part of the bicluster finding. Therefore, iteration steps are generally beneficial for increasing the possibility of finding a global optimum. We adapted the relevant parameters such as the amount of iterated bicluster searches “amountOfBiclusterSearches,” the number of isolated sub-populations “nsubpops” and the number of candidate solutions “popsize.” A higher value for the parameter “amountOfBiclusterSearches,” which defines the number of iterations of GABi, will increase the computation time but also lead to a higher number of found biclusters. This also holds for the parameter “popsize,” which defines the number of candidate solutions. The parameter “nsubpops” defines the split of the search space into subpopulations and leads to higher computational performance but less optimal solutions. These parameters were adapted to 20, 4, and 250 for (high-ω/1,200 for low-ω) gene sets, respectively. We also have the parameter “do_weighting,” which defines whether ω and the network table are weighted differently (can be True or False), and the parameter “variableClassNotOnTabuList,” indicating whether already found biclusters are set to zero by adding them to the tabu list, which can lead to more diverse solutions. To increase the chance of finding interesting solutions, we added the option “global_optim” (Supplementary Figures 2C,D), which looped for a given number of iterations (ideally set to a multiple of 4) through several combinations of these parameters “do_weighting” (set to True or False) and “variableClassNotOnTabuList” (set to (True, True) or (True, False) for (ω, n)). Generally, all parameter values were obtained heuristically, based on the parameter description given in (Curry, 2014) and iterative experiments, aiming to obtain a diverse set of clusters while considering the computational time. Finally, during each iterative run, several biclusters can be found with a substantial overlap between the runs. Therefore, we merged similar biclusters and recomputed the scores (=size) of the merged clusters. Biclusters were defined as similar, if their genes overlapped by more than 50% (normalized by smaller size) or had the same networks similar to the published approach (Kaczanowska et al., 2022).
2.4.4 Data post-processingFor further analysis, we chose either the top 40 overall (Figure 3) or the top 20 biclusters (Figure 4) per branch according to their bicluster scores. Next, each bicluster was assigned to the respective socio-affective domain (as shown in Figure 1, Step 4), depending on the FNs composition of the bicluster. If a given bicluster FNs composition spanned more than one domain, it was assigned to all the belonging domains.

Tracing genetic selection in human phylogeny. (A) Spatial filtering. Full branch x gene ω table, filtered for spatially informative genes whose correlation to the fMRI networks exceeds a selected threshold at least once (specifically, R > 0.1). This left 45.1% of genes for tracing evolution in fMRI networks. (B)Top: Human phylogeny. Black lines indicate ancestral branches in human (AMH, Denisovan, and Neanderthal) ancestry used for computational predictions. Gray lines indicate branches not considered in this study (adapted from Kaczanowska et al., 2022). Bottom: Distribution of ω-values across branches of the selected brain-expressed genes. Dotted lines indicate the criteria for high (light green) or low (dark green) ω gene sets. (C) Evolutionary ranking. Filtered genes (A,B) that are either in the high and low ω gene sets per branch (i) and the number of times a gene was present in the high or low ω groups across the branches (ii). Note that the choice of gene selection is highly customizable, as step A can be omitted or set at different R thresholds. In addition, the branch-wise ω thresholds (step B) can be set more stringently or relaxed, and/or include additional splits (i.e., middle ranged ω genes) for further contrasts.

Mining the human socio-affective task space for evolutionary selection. Top 40 biclusters from GABi for either high ω (A) or low ω (B) gene input. Top, Identity and branch-wise number of unique genes within the biclusters of the main socio-affective domains. Bottom, Neurosynth network composition for each bicluster. For an extended representation, refer to Supplementary Figure 3. The biclusters assigned to Affective processing domain did not reach the selection threshold in either gene input case.
Comments (0)