Skip to content

Conversation

@Waschina
Copy link
Collaborator

@Waschina Waschina commented Apr 2, 2025

TODOs before merge

  • When the genomic nucleotide genome is provided as input, predict ORFs and translate them to a protein fasta before calculating alignments. Use Prodigal for this.
  • Specify new optional dependencies in the installation instructions:
    1. pyrodigal
    2. diamond
    3. mmseqs2
  • Adjust the gapseq test to look for optional dependencies.
  • Rewrite gapseq find-transport to allow other aligners and to improve speed.
  • Honor "-v"
  • Decide: Should reaction/pathway prediction be logged (stdout) as in the previous gapseq version?
  • Test if the complete workflow (find -> find-transport -> draft - > medium -> fill) and doall still works.
  • Test if pan-draft still works.
  • Write tutorial for pan-draft
  • Test/Modify gapseq adapt (i.e., use R implementation of getDBhit function) instead of bash script
  • Calculation of genome coverage (we actually calculate the percentage of ORFs associated with at least one metabolic reaction).
  • Are all options in find still used / needed? Cleaning up?
  • Pre-final step: run gapseq for toy data (perhaps write a script that re-generates all toy data? Incl. pan-draft results?)
  • Check if all columns in ".tbl" output tables are really needed. If not, drop before export
  • Check if option -g Exhaustive search is still required, and if yes, how should the alignment command be adjusted?
  • Gram-staining prediction in gapseq find only if -p (all)?
  • Include rxnWeights und Genes RDS into model RDS?
  • legacy branch from master before merge
  • bump version to 2.0.0 (first pre-release 2.0.0-beta)
  • Update reference sequence data base -> Zenodo repo (It's stored as draft version)
  • Publish new reference sequence database.
  • writeable database directories (see Offline-mode, writeable database directories and shared installation on cluster #275)
  • Option for offline mode in doall (see Offline-mode, writeable database directories and shared installation on cluster #275)
  • Can the following files be removed?
    • src/gapseq_quickstart.sh - I think the functionality is covered by gapseq doall
    • all files in src/graveyard/
    • src/coverage.R—The script won't work anymore since gapseq now performs the alignments on protein fasta genomes. If a nucleotide fasta is provided, it is translated to proteins before the alignments. Also, the coverage (in terms of the percentage of ORFs associated with metabolic reactions) is always calculated in gapseq find and reported in the header of the output ...-Reactions.tbl and -Pathways.tbl tables.
    • src/download.sh - Script is not used anywhere.
    • src/seed_pwy.R
    • src/seedcorrect.py
    • any more?

What's new?

This PR mainly includes a re-implementation of parts in gapseq find and gapseq find-transport. Major changes:

  • Run time is greatly improved by performing only one large multiple sequence alignment rather than many smaller ones.

  • Users can now choose between three different sequence alignment algorithms: blast, diamond, mmseqs2. The users can choose the algorithm using the option -A <algorithm> in gapseq find/gapseq find-transport.

  • A few bug fixes (see below where new gapseq pathway predictions were compared to old predictions).

  • The output table <query>-Pathways.tbl now includes additional columns that fully document how the completion percent was calculated and why the pathways were predicted to be present or absent. Also, a FAQ and its answer concerning completeness calculations was added to the documentation.

  • When a genomic nucleotide FASTA file is used as input, it’s first translated into amino acid sequences of open reading frames (ORFs). For this step, the optional dependency pyrodigal is required.
    gapseq automatically selects the appropriate codon translation table by running pyrodigal with three options:

    • Table 4: "Mycoplasma/Spiroplasma (Mollicutes)"
    • Table 11: "Bacterial, Archaeal, and Plant Plastid Code" (default for most prokaryotic tools)
    • Table 25: "Candidate Division SR1 and Gracilibacteria"

    The choice between Table 11 and Tables 4/25 depends on genome coverage. If using Table 4 or 25 gives at least 5% higher coverage than Table 11, then 4 or 25 is used. Choosing between Table 4 and 25 is more nuanced since both yield the same coverage. The key difference is how the codon UGA is interpreted:

    • In Table 11, UGA is a stop codon.
    • In Table 4, UGA codes for Tryptophan.
    • In Table 25, UGA codes for Glycine.

    Since the Tryptophan content in proteins is typically around 1%, the table that produces a Tryptophan usage closest to this value is selected.

    Admittedly, this approach relies on heuristic thresholds, but it works well in practice. If users already know the correct codon table for their genome, they can provide a protein FASTA file directly to avoid translation by gapseq.

  • There are fewer dependencies on other software libraries. Specifically, the dependencies on 'exonerate', 'barrnap', 'bedtools', 'perl', and 'parallel' were dropped.

  • Users can now specify a custom directory for the reference sequence database (option -D), and which version to use (not only the latest) (option -Z). If gapseq is installed in a directory that is not writable by the user, the reference sequence database location is automatically set to $HOME/.gapseq/seq.

Comparison of new to old gapseq pathway predictions

I compared the pathway prediction from gapseq version 1.4 (commit 11c3011) to the new prediction made with batch alignments. Tests were made with genome toy/myb71.faa.gz. A few important things I noticed.

Discrepancy: Old prediction FALSE, new prediction TRUE, same completeness

Example case:

gapseq find -p PWY-5973 myb71.faa.gz

In this case, the old prediction is FALSE in he old version, because a key reaction stated in the file dat/meta_pwy.tbl is not part of the reaction IDs listed for this pathway in the same table. Thus, gapseq tested if the key reaction was predicted, but since it was not tested, it was always FALSE. When checking all pathways, this is relevant for a total of 26 pathways:

library(data.table)
mpw <- fread("dat/meta_pwy.tbl")
mpw <- mpw[keyRea != ""]
mpw[, keyReaRE := gsub(",","|",keyRea)]
mpw[, keyReaRE := gsub("\\.","\\\\.",keyReaRE)]
mpw[, mk := grepl(keyReaRE, reaId), by = id]
mpw[mk == FALSE, .(id,reaId, keyRea)]

Result:

              id                                                                                                                                                                               reaId                            keyRea
          <char>                                                                                                                                                                              <char>                            <char>
 1:   |PWY-1042|                    6PFRUCTPHOS-RXN,2.7.1.90-RXN,F16ALDOLASE-RXN,TRIOSEPISOMERIZATION-RXN,GAPOXNPHOSPHN-RXN,PHOSGLYPHOS-RXN,3PGAREARR-RXN,2PGADEHYDRAT-RXN,PEPDEPHOS-RXN,1.2.1.9-RXN                                OR
 2:  |PWY2B4Q-4|                                                                                                                   RXN2B4Q-48,RXN2B4Q-49,RXN2B4Q-50,RXN2B4Q-51,RXN2B4Q-52,RXN2B4Q-53                         RXN-22424
 3: |PWY2DNV-11|                                                                                                                                           H2PTEROATESYNTH-RXN,RXN2DNV-42,RXN2DNV-43                          RXN-8850
 4:   |PWY-4202| RXN-7002,RXN-7001,RXN-21019,2.1.1.138-RXN,RXN-22325,RXN-22332,RXN-22333,RXN-22334,RXN-22336,RXN-22335,RXN-22329,RXN-22337,1.20.4.2-RXN,TRANS-RXN0-551,RXN-22324,RXN-22323,RXN-22403 RXN-21028,2.1.1.137-RXN,RXN-22330
 5:   |PWY-5115|                                                                                                                                                                            RXN-7772                          RXN-7771
 6:   |PWY-5278|                                                                                                                                       SULFATE-ADENYLYLTRANS-RXN,RXN-23218,RXN-23217     ADENYLYLSULFATE-REDUCTASE-RXN
 7:   |PWY-5288|                                                                                           RXN-8184,RXN-8185,RXN-8186,RXN-8187,RXN-8189,RXN-8190,RXN-8026,RXN-8025,RXN-8214,RXN-8215                                OR
 8:   |PWY-5306|                                                                                                          RXN-8202,SULFATE-ADENYLYLTRANS-RXN,RXN-23218,RXN-23217,RXN-17803,RXN-17804     ADENYLYLSULFATE-REDUCTASE-RXN
 9:   |PWY-5308|                                                                                           R164-RXN,RXN-8202,SULFATE-ADENYLYLTRANS-RXN,RXN-23218,RXN-23217,SULFITE-DEHYDROGENASE-RXN     ADENYLYLSULFATE-REDUCTASE-RXN
10:   |PWY-5690|                         FUMHYDR-RXN,SUCCCOASYN-RXN,2OXOGLUTARATEDEH-RXN,ACONITATEHYDR-RXN,CITSYN-RXN,ACONITATEDEHYDR-RXN,ISOCITRATE-DEHYDROGENASE-NAD+-RXN,MALATE-DEH-RXN,RXN-14971 ISOCITRATE-DEHYDROGENASE-NAD+-RXN
11:   |PWY-5973|                                                                                                                                   RXN-9555,2.3.1.179-RXN,RXN-9556,RXN-9557,RXN-9558                          RXN-8389
12:   |PWY-6676|                                               R17-RXN,RXN-11942,RXN-11941,RXN-11940,RXN-16434,RXN-16435,RXN-17803,RXN-21586,SULFATE-ADENYLYLTRANS-RXN,RXN-23218,RXN-23217,RXN-15349     ADENYLYLSULFATE-REDUCTASE-RXN
13:   |PWY-6893|                                                                                                                                                             THI-P-KIN-RXN,RXN-12610                         RXN-12609
14:   |PWY-7663|                                                                                                                                             RXN-16629,RXN-16630,RXN-16631,RXN-16632                         RXN-16614
15:   |PWY-7966|                                                                                                                                                       RXN-19305,RXN-19301,RXN-19322                         RXN-17124
16:   |PWY-7967|                                                                                                                                                       RXN-19323,RXN-19304,RXN-19300                         RXN-17123
17:   |PWY-7969|                                                                                                                                                       RXN-19320,RXN-19303,RXN-19299                         RXN-17122
18:   |PWY-8195|                                                                                                                   RXN-21844,RXN-21843,2.7.8.6-RXN,RXN-21851,TRANS-RXN-433,RXN-22033                          RXN-9160
19:   |PWY-8198|                                                                                                                   RXN-21844,RXN-21843,2.7.8.6-RXN,RXN-21854,TRANS-RXN-430,RXN-22030                          RXN-9159
20:   |PWY-8246|                                                                                                                                                                 RXN-22248,RXN-22249               RXN-22246,RXN-22245
21:   |PWY-8247|                                                                                                                         RXN-22250,RXN-22251,RXN-22252,RXN-22253,RXN-22262,RXN-22273               RXN-22245,RXN-22246
22:   |PWY-8266|                                                                                                               RXN-21019,RXN-22332,RXN-22333,RXN-22334,RXN-22336,RXN-22335,RXN-22329 RXN-21028,2.1.1.137-RXN,RXN-22330
23:   |PWY-8321|                                                                                                                                                  ISOCHORSYN-RXN,RXN-22684,RXN-22683                          RXN-1981
24:   |PWY-8352|                                                                    NICONUCADENYLYLTRAN-RXN,NAD-SYNTH-GLN-RXN,NAD-SYNTH-NH3-RXN,QUINOPRIBOTRANS-RXN,QUINOLINATE-SYNTHA-RXN,RXN-22981                      1.4.1.21-RXN
25:   |PWY-8391|                                                                                                                                                 URONATE-DEHYDROGENASE-RXN,RXN-23374                         RXN-11407
26:   |PWY-8393|                                                                                                                                                                 RXN-23378,RXN-23376          D-XYLULOSE-REDUCTASE-RXN
              id                                                                                                                                                                               reaId                            keyRea

The new gapseq version ignores defined key reactions, that are not part of the respective pathway.

Discrepancy: Old prediction TRUE, new prediction FALSE, different completeness

Example run:

gapseq find -p SERDEG-PWY myb71.faa.gz

These discrepancies were in all cases I analysed in depth due to a small bug in the previous version when counting found subunits. The responsible code line in the previous version:

https://github.com/jotech/gapseq/blob/11c30119b8aa1407b5c83c1c11b92e983c1efefc/src/gapseq_find.sh#L917C17-L917C84

The problem is that in cases where an undefined subunit was found, it also adds 1 to the counter in the variable $subnits_found. However, these should not be considered in the specific code line, as the $subunit_fraction variable is later compared to the $subunit_cutoff (which is by default 0.5). Thus in cases of two real subunits (with no hits) and an hit to an undefined subunit it predicts the reaction to be present because it calculates 1/2 (1 undef subunit / 2 total real subunits), which is 0.5. This value is equal to $subunit_cutoff , and the undefined hit causes the complex prediction to be TRUE/present. The actual code line should have been:

subunit_fraction=$(echo "100*($subunits_found-$subunits_undefined_found)/$subunits_count" | bc)

This bug is fixed in the new version (within the R code of file analyse_alignments.R).

This problem causes discrepancies in > 25 cases for myb71.faa.gz.

Discrepancy: Less transporters identified in new gapseq version

While I am still investigating this, I noticed a potential bug in the previous gapseq version. E.g. when a hit to a transport with the TC 2.A.1.1.4 is found, it connects this hit also to substrates of transporters with a TC that start with 2.A.1.1.4, e.g. 2.A.1.1.403. The responsible line in the previous gapseq code (src/transporter.sh) was:

substrates=$(grep -F $tc tcdb_all | cut -f2)

With tc=2.A.1.1.4, this line yielded the hits:

2.A.1.1.45 CHEBI:5418;glucose 
2.A.1.1.41 CHEBI:10085;xylose 
2.A.1.1.4 CHEBI:5418;glucose 
2.A.1.1.49 CHEBI:9840;UDP-sugar 
2.A.1.1.40 CHEBI:10085;xylose 
2.A.1.1.43 CHEBI:5584;hydron|CHEBI:5418;glucose|CHEBI:5256;galactose|CHEBI:14575;mannose|CHEBI:5172;fructose 2.A.1.1.47 CHEBI:9885;7,9-dihydro-1H-purine-2,6,8(3H)-trione 
2.A.1.1.404 CHEBI:5418;glucose 2.A.1.1.403 CHEBI:23062;cellodextrin|CHEBI:3522;cellobiose|CHEBI:6353;alpha-lactose|CHEBI:3528;cellotriose 
2.A.1.1.46 CHEBI:5418;glucose 
2.A.1.1.48 CHEBI:9840;UDP-sugar 
2.A.1.1.44 CHEBI:5709;hexose 
2.A.1.1.42 CHEBI:5584;hydron|CHEBI:4167;D-glucopyranose

The new gapseq version is more strict and reports only the substrates of transporters with the exact TC.

Other small changes in the new gapseq version

Complex detection

In the old and the new gapseq version, complexed are detected by analysing the fasta sequence headers for key terms such as "chain" or "subunit". In rare cases, where there were several sequences but only very few that indicated a subunit association, gapseq always needed hits to those sequences in order to say that the complex is there. However, in most organisms this enzyme might not be a complex/heteromer.

New approach: If 20% or less of the sequences are predicted to be a specific subunit, the reaction is not tested as a complex; i.e., no subunit hits are required for the reaction prediction to be TRUE. This is implemented in src/complex_prediction.R

Gram prediction

Gram prediction is used to decide which biomass reaction is added to a bacterial metabolic model. In the previous version, the prediction was made within the gapseq draft, where the biomass reaction was also added to the model. Now, the Gram-staining prediction is moved to gapseq find. The rationale behind this decision is that gapseq find has the genome sequence already as input; performing the HMM-based Gram prediction here makes sense, as this also needs the genome. The predicted Gram-staining is added as info to the headers of the output tables "...-Reactions.tbl" and "...-Pathways.tbl".

Updating reference sequence databases

gapseq now has a new module to update the reference sequence database. Two examples:

gapseq update-sequences -t Bacteria # Update Reference sequences for Bacteria
gapseq update-sequences -t Bacteria -D ~/gapseqDB/ # Update Reference sequences for Archaea and save the database in a user-defined directory

Waschina and others added 24 commits March 3, 2025 08:41
- allows to use annotations from eggnog-mapper to be used in gapseq
- only creates reaction table and no pathway table at the moment
Get latest commits from master branch
Next step, perform the actual batch blast!
- still need to be tested
- comparison between aligners
- comparison with previous tables still built with sequentional
  alignments
- is the new output compatible with gapseq draft?
Still some errors...
Also: Search for custom EC number(s) was tested and should now work.
Also for cases were a single EC does results in no blast or no
sequences. Multiple ECs(comma-sep.) were also tested.
When a **genomic nucleotide FASTA** file is used as input, it’s first translated into amino acid sequences of open reading frames (ORFs). For this step, the optional dependency pyrodigal is required.
  gapseq automatically selects the appropriate codon translation table by running pyrodigal with three options:

  - Table 4: "Mycoplasma/Spiroplasma (Mollicutes)"
  - Table 11: "Bacterial, Archaeal, and Plant Plastid Code" (default for most prokaryotic tools)
  - Table 25: "Candidate Division SR1 and Gracilibacteria"

  The choice between Table 11 and Tables 4/25 depends on genome coverage. If using Table 4 or 25 gives at least 5% higher coverage than Table 11, then 4 or 25 is used. Choosing between Table 4 and 25 is more nuanced since both yield the same coverage. The key difference is how the codon UGA is interpreted:

  - In Table 11, UGA is a stop codon.
  - In Table 4, UGA codes for Tryptophan.
  - In Table 25, UGA codes for Glycine.

  Since the Tryptophan content in proteins is typically around 1%, the table that produces a Tryptophan usage closest to this value is selected.

  Admittedly, this approach relies on heuristic thresholds, but it seems to work well in practice. If users already know the correct codon table for their genome, they can provide a protein FASTA file directly to avoid translation by gapseq.
@Waschina Waschina requested a review from jotech April 2, 2025 15:01
Waschina added 11 commits June 30, 2025 17:25
New reactions and metabolites were added for some additional human milk
oligosaccharides.

The reaction+metabolite additions were suggested by Coen Berns (SILS UvA)
Quotes in reaction names caused errors when attempting to download the
reference sequences from uniprot.

Bug was fixed by escapting the quotes in reaction names when calling
"src/uniprot.sh"
Separate files for reaction weights and gene associations are not
required anymore, as these are now include in the RDS-files. (cobrar
modelorg class)
- Updated Tutorial "Joghurt"
- New Tutorial "Pan-Draft"
- Rewritten usage docs for pan-draft
Reaction is relevant for e.g. Archaea of the genus Sulfolobus.

Three reactions were added:
- NADPH sulfur oxidoreductase (EC 1.8.1.18)
- NADH sulfur oxidoreductase (EC 1.8.1.18)
- cross-membrane elemental sulfur diffusion
The script involves the identification of potential resource
metabolites. This part requires that the final table of all corrected
reactions is already exported to the file
"dat/seed_reactions_corrected.tsv".
Updates involve mainly corrected mappings of metacyc reactions to
gapseq's reaction DB.

Pathways affected:
- PWY-7400 (L-arginine biosynthesis IV (archaebacteria)
- PWY-3081 (L-lysine biosynthesis V)
- NAD(P)H repair
- NPGLUCAT-PWY (Entner-Doudoroff pathway II (non-phosphorylative))
The GTP-dependent dephospho-CoA kinase was not yet in the biochemistry
database.

EC 2.7.1.237
Pathways include:
- L-glutamate biosynthesis I
- coenzyme B biosynthesis
- reductive acetyl coenzyme A pathway II (autotrophic methanogens)
- coenzyme A biosynthesis III (archaea)
@Waschina Waschina changed the title Batch alignment Version 2 (o.a., Batch alignment) Aug 25, 2025
- Added reactions and EC mappings for PWY-5254 (Metanofuran
  biosynthesis) pathway.
- The pathway with the ID "ASPARAGINE-BIOSYNTHESIS" was defined in
  metacyc only for Eukaryota and Bacteria. Yet, the enzyme (EC 6.3.5.4)
  is also present in Archaea. The pathway was therefore transferred to
  "custom_pwy.tbl" with updated taxonomy range.
- For the medium prediction, proline is not added if the enzyme EC
  4.3.1.12 is part of the draft network.
- Also removed user sequences for the old EC. Correct seqs are retrieved
  from uniprot with the new EC.
- Now, users can specify a directory, where the reference sequence data
  base is stored.
- If no directory is specified, the directory "dat/seq" within the
  gapseq installation directory is used (as before this commit)
- If default directory within gapseq installation dir is not writable,
  the fallback directory is: $HOME/./gapseq/seq
- Important: For reproducibility purposes the users can now also specify
  a reference database version, which might not be the latest version
(see option "-Z" in gapseq find|doall|update-sequences)
- The command to update the sequence database is now a gapseq module:
  `gapseq update-sequences` and not as before calling
  "src/./update_sequences.sh" from within the installation directory.

Refers to #275.
New commands to update the reference sequence database.
Now, only one Saccharide is added to gapfill-medium if several were
predicted to be potential carbon/energy sources.
EC 7.1.1
Hydron translocation or charge separation linked to oxidoreductase
reactions.
It is planned to add a new tutorial some time after the next release.
The original pathway is tagged as superpathway, which caused that the
pathway is not test when running `gapseq find -p all ...` with default
parameters. However, this is the arginine biosynthesis pathway for e.g.
E. coli.

Otherwise, no changes to the pathways.
- EC 1.1.2.3 - Cytochrom-dep. lactate dehydrogenase
- EC 1.1.3.2 - lactate oxidase

Sequences of enzyme encoding genes can be very similar.
Now requires r-cobrar>=0.2.2 . The new cobrar version includes a bug fix
related to the metabolite charges in exported SBML files.

See more here: Waschina/cobrar@cf75c5c
The "...-all-Reactions.tbl" output table should list all subunits of
protein complexes, even if they are not found by sequence alignments to
reference proteins. When implementing the batch alignment approach, we
accidently lost this behavious, which let to missing 'no_blast' entries
in the output table.

Listing all subunits in the "...-all-Reactions.tbl" is important,
because the info on subunits with "no_blast" status inform the calculations of
reactions scores (during `gapseq draft`) used for later gapfilling.

The bug had the effect, that we overpredicted reactions associated to
protein complex enzymes, especially in those cases, where most subunits
were not found but only one or very few with bitscores > 100.

I noticed the bug with the enzyme EC 1.1.1.436 (electron confurcating
LDH).
When using the option "-U" in `gapseq find`, no download of the
reference sequence database is triggered/checked.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants