Skip to content

Conversation

@stschiff
Copy link
Member

@stschiff stschiff commented Sep 5, 2025

As discussed, and perhaps continued to be discussed, here my suggestion: A mandatory and unique Individual_ID column. No lists, multiple values are not allowed. What do you think?

@nevrome
Copy link
Member

nevrome commented Sep 8, 2025

Ok - I think that is OK.
We should imho also apply the ASCII-only constraint we introcuded for Poseidon_IDs and Group_Names.

@nevrome
Copy link
Member

nevrome commented Sep 8, 2025

OK - I thought about this a bit longer:

  1. Relation_To should from now on feature Individual_IDs (and potentially Alternative_IDs) not Poseidon_IDs.
  2. Some .janno columns could now refer to the sampled individual in their description, not the sample itself, e.g. Species, Collection_ID, Date_*, Chromosomal_Anomalies, MT_Haplogroup, Y_Haplogroup. In Genetic_Sex we use the phrase of the individual derived from this sample or in Alternative_IDs the sampled individual. Maybe we should use these more often now.
  3. Unrelated to the schema: Having a mandatory column that is NOT in the genotype data will require us to rewire some behaviour of trident. Probably the easiest would be to always use the Poseidon_ID in case the Individual_ID is missing. With a warning that this is happening.

@stschiff
Copy link
Member Author

stschiff commented Sep 9, 2025

OK, we just made Individual_ID non-mandatory, after some discussion among the core team. Readers should feel free to comment. Our reasoning was that the schema should be ready to use also for work-in-progress projects, where Individual_ID might be an analysis result that comes in later only. Also, in some edge cases, there might not be Individual_IDs, for example with some sedimentary or residue samples.

Agree with 2)

any validation of Relation_To should by default only happen within a given package.

@stschiff
Copy link
Member Author

OK, important update to @nevrome's comments, after having looked more into it. I have now only changed the Related_To column, as I agree with @nevrome that this should refer to Individual_Id. I don't think Alternative_ID should be permitted in the Related_To, though. I also am more careful about sample vs. individual. Here is the new description for Related_To:

other individuals (by Individual_ID) that are related to the individual this sample derived from, multiple entries separated by ;

Regarding the other fields you suggested, @nevrome, I do not think we should change them:

  • Species: No, this is the species that the sample derives from. Although it would not be wrong to refer to the Individual here, I think it's unnecessary. As Janno lists Samples, not Individuals, we should not add abstraction if it's not needed.
  • Collection_ID: Same
  • Date_*: No, this is an analysis result for the sample (at least in case of C14), not for the individual. In fact, there could be multiple dates for different samples of the same individual.
  • Chromosomal_Anomalies: This is an analysis result of the genetic data of the sample, so it should be the sample
  • MT_Haplogroup and Y_Haplogroup: Same, this is an analysis result so it refers to the entity of analysis, which is the sample.

One more important update: I propose to actually change the wording in Alternative_ID to refer to the sample, not the individual. I think this makes more sense: Since in Poseidon our principle entity of analysis is the sample, not the individual, the Alternative_ID should refer to other sample IDs.

@nevrome
Copy link
Member

nevrome commented Sep 13, 2025

Ok - thanks for thinking all of this through and engaging thoroughly with the individual points. I think you convinced me and I agree with everything -- except for the very last change:

In my opinion the Alternative_ID must operate on the individual level. There we have a well-established need to document multiple different identifiers. An alternative name for a Poseidon_ID, on the other hand, is something that should occur very rarely if you think back to its exact definition (as written up in the paper - we should add this to the schema now!). Calling it a "sample" is a simplication that we only adopted for convenience. In fact it is much more specific.

@stschiff
Copy link
Member Author

OK, I am not quite following, but certainly open to the suggestion. Let's discuss in person.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants