- 
                Notifications
    You must be signed in to change notification settings 
- Fork 8
Description
Hello,
first of all -- thank you for your work and contribution of Dips-plus to the community. I think this is a very valuable dataset and the documentation is awesome.
I saw in your paper that you ran FoldSeek to establish a fair split between DB5 containing complexes and DIPS-Plus train and validation data sets to reduce cross-contamination:
"DIPS-Plus offers a standardized train-validation partitioning of the dataset’s complexes using FoldSeek42, such that there are no complexes (i.e., chains) within the dataset’s original sequence-based training or validation splits that are structurally similar to any chain within the DB5 dataset’s test split. To create such structure-based splits, FoldSeek was run using DIPS-Plus’ data partitioning script in exhaustive search mode between all chains in the training and validation dataset and the chains in the DB5 test dataset. In this context, a chain was considered structurally similar to another chain if FoldSeek assigned a 50% or greater probability to the two chains belonging to the same SCOPe superfamily, while permitting E-values up to 0.1. Note that the default E-value upper limit for FoldSeek is 0.001, which means that the use of an increased upper limit on E-values in DIPS-Plus’ FoldSeek searches modified the similarity searches for all chain pairs to report more distant potential homologs compared to FoldSeek’s default search settings. After performing this exhaustive search, FoldSeek removed 3,727 of the 33,159 original sequence-filtered chain pairs within the DIPS-Plus training split, resulting in 29,432 chain pairs remaining for training. Similarly, FoldSeek removed 890 of the 8,290 original sequence-filtered chain pairs within the DIPS-Plus validation split, resulting in 7,400 chain pairs remaining for validation."
Would you mind sharing the reduced datasets on zenodo as well? I can see dips/data-set-postprocessed-train.txt" but that is the non-filtered set of 33.159 complexes.
Thanks a ton
Olivia