Skip to content

Conversation

@simonhollis
Copy link
Contributor

Summary:

Context

Ownership propagation for incremental databases is costly in our biggest repositories.

In general, ownership sets are small and accessing them is quick. However, the root predicate (Name = "") owns all files and has a huge ownership set. Accessing this set takes significant time.

This change

This change removes the assignment of an ownership set from the root predicate. Prior to change, this set includes all units.

Accessing ownership with change

Now, getOwner(empty_name_fact_id) returns INVALID_USET rather than a large ownership set.

Net effect

Removing the large root ownership set leads to faster ownership set calculations for incremental DBs and faster incremental DB ownership set calculations.

Differential Revision: D78815949

Summary:
# Context
Ownership propagation for incremental databases is costly in our biggest repositories. 

In general, ownership sets are small and accessing them is quick. However, the root predicate (`Name = ""`) owns all files and has a huge ownership set. Accessing this set takes significant time. 

# This change
This change removes the assignment of an ownership set from the root predicate. Prior to change, this set includes all units.

## Accessing ownership with change
Now, `getOwner(empty_name_fact_id)` returns `INVALID_USET` rather than a large ownership set.

# Net effect
Removing the large root ownership set leads to faster ownership set calculations for incremental DBs and faster incremental DB ownership set calculations.

Differential Revision: D78815949
@netlify
Copy link

netlify bot commented Aug 26, 2025

Deploy Preview for fb-oss-glean canceled.

Name Link
🔨 Latest commit 20a2826
🔍 Latest deploy log https://app.netlify.com/projects/fb-oss-glean/deploys/68adbcfb7dd67b000878a28e

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 26, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D78815949

@simonmar
Copy link
Collaborator

@simonhollis I'd really like to discuss this, at face value the change looks very dodgy, we definitely shouldn't be doing anything that depends on predicate names at this level.

Could you explain the problem you're seeing in more detail? I'm not sure what you mean by a "root predicate". My strong suspicion is that whatever problem you're having should be solved at the schema or the indexer level, and not in the core ownership processing.

@simonhollis
Copy link
Contributor Author

Sure. What we're trying to solve here is the processing time for calculating incremental db ownerships by removing the only fact that has a large ownership set. "Root predicate" is badly named - it's what I'm calling the fact that has an empty unit name, i.e. Name = "" - I call it 'root' since its equivalent to a root node in a name prefix graph.

What's the scale? In our larger Glean DBs, the root can reference millions of files in its ownership set, whereas all except for a couple of other facts have ownership set sizes <= 3. The net effect is that performing getOwner on this single fact can take minutes, whereas the other facts return near instantaneously.

The PR proposes just removing this large ownership set, to cause an instant return with INVALID_USET since as far as I can see, the information from the associated fact is not needed for correctly propagating ownership information for incrementals.

@simonmar
Copy link
Collaborator

@simonhollis I'm still really confused. Firstly, regardless of set sizes, getOwner should always be fast, because it just returns a UsetId. It doesn't depend on the size of the set at all.

Secondly:

the fact that has an empty unit name, i.e. Name = ""

Do you mean there is a Unit that is the empty string? If so, we shouldn't be generating that in the first place. The unit names come from the indexer, so you could modify the indexer to not produce empty unit names.

But I suspect I'm misunderstanding the problem. The code change seems to be explicitly looking for predicate names (not unit names) that contain .Name. Why? We definitely shouldn't be adding predicate-specific magic into the core ownership processing here. I'm also concerned that this change could lead to invariants being violated and invalid DBs, which would have some very strange consequences.

@simonmar
Copy link
Collaborator

@simonhollis here's my guess at what's going on.

  1. In some language (Python?) the indexer produces facts of python.Name with the key ""
  2. Since this fact appears in every module, its ownership set will include every unit (as it should)
  3. Something in the implementation of getOwner is slow (?)

It's not unreasonable for some facts to appear in every unit - indeed I'm fairly sure there are lots of examples of this.

I'd start by investigating why getOwner is slow.

@simonhollis
Copy link
Contributor Author

Do you mean there is a Unit that is the empty string? If so, we shouldn't be generating that in the first place. The unit names come from the indexer, so you could modify the indexer to not produce empty unit names.

Yes, there is an empty string Unit in all Python databases. My understanding is that this is a necessary default value to ensure that a) the database is never completely empty; b) catch any facts for which ownership cannot be determined. But, with your response, maybe that assumption needs revisiting. I'll generate an example tomorrow to add to the context.

But I suspect I'm misunderstanding the problem. The code change seems to be explicitly looking for predicate names (not unit names) that contain .Name. Why? We definitely shouldn't be adding predicate-specific magic into the core ownership processing here. I'm also concerned that this change could lead to invariants being violated and invalid DBs, which would have some very strange consequences.

OK, concern noted. We don't need to rush this change and can take time to get it right.

@simonhollis
Copy link
Contributor Author

  1. Something in the implementation of getOwner is slow (?)

I'd start by investigating why getOwner is slow.

I did investigate this part and I realise I mis-stated earlier. It's factOwnership that is slow - the call to this to return the OwnerExpr takes time proportional to the size of the ownership set.

@simonmar
Copy link
Collaborator

@simonhollis factOwnership is only used by the shell's !owner command, so it shouldn't be affecting any production use cases. It's a debugging API really. It has to serialise the whole set, so it's not unexpected for it to perform badly for very large sets.

Let me distinguish 3 separate things, because I think we're getting a bit confused here.

  1. Unit = "". That is, an indexer is producing some facts with the unit set to the empty string. There's no good reason to do this that I'm aware of, so if it's happening I suggest the indexer should be fixed to use a proper unit name instead of the empty string.
  2. No unit at all. A fact is unowned, if it was produced with no unit and is only referenced by other unowned facts. This is quite normal. Unowned facts are visible in the base DB but disappear in a stacked incremental DB, because the stacked DB can only view owned facts in the base DB.
  3. The fact python.Name "" I don't know why the Python indexer needs to produce this fact or what it means - that's probably worth investigating, maybe there's a good reason for it, but it definitely looks suspicious. I mean, the empty string is not a valid name in Python.

@simonhollis
Copy link
Contributor Author

@simonmar Thanks for the further clarifications. I think what we're seeing manifested in latency then is the effect of calling the serialisation of the whole set of owned files by the erroneously generated empty string values. If these don't need to be there for indexing-related reasons, then I agree we wouldn't need the functionality in this patch.

I will dig around the Python indexer side as well as checking the databases from a few more languages to see if the pattern repeats there and report back and we can decide to close or continue. There will be quite a bit of latency on that work though, since I'm taking some time off from tomorrow.

@simonhollis
Copy link
Contributor Author

Looping back on this one: as Simon M pointed out, the Python indexer is outputting facts like python.Name "", which is unexpected. For now, the focus is on working out why this is happening and potentially changing the indexer to prevent this king of fact being generated in the first place, rather than changing the general ownership logic proposed here. I will therefore close this PR.

@simonhollis simonhollis closed this Oct 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants