-
Notifications
You must be signed in to change notification settings - Fork 75
Exclude .Name predicates with key = "" from ownership #579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary: # Context Ownership propagation for incremental databases is costly in our biggest repositories. In general, ownership sets are small and accessing them is quick. However, the root predicate (`Name = ""`) owns all files and has a huge ownership set. Accessing this set takes significant time. # This change This change removes the assignment of an ownership set from the root predicate. Prior to change, this set includes all units. ## Accessing ownership with change Now, `getOwner(empty_name_fact_id)` returns `INVALID_USET` rather than a large ownership set. # Net effect Removing the large root ownership set leads to faster ownership set calculations for incremental DBs and faster incremental DB ownership set calculations. Differential Revision: D78815949
✅ Deploy Preview for fb-oss-glean canceled.
|
|
This pull request was exported from Phabricator. Differential Revision: D78815949 |
|
@simonhollis I'd really like to discuss this, at face value the change looks very dodgy, we definitely shouldn't be doing anything that depends on predicate names at this level. Could you explain the problem you're seeing in more detail? I'm not sure what you mean by a "root predicate". My strong suspicion is that whatever problem you're having should be solved at the schema or the indexer level, and not in the core ownership processing. |
|
Sure. What we're trying to solve here is the processing time for calculating incremental db ownerships by removing the only fact that has a large ownership set. "Root predicate" is badly named - it's what I'm calling the fact that has an empty unit name, i.e. What's the scale? In our larger Glean DBs, the root can reference millions of files in its ownership set, whereas all except for a couple of other facts have ownership set sizes <= 3. The net effect is that performing The PR proposes just removing this large ownership set, to cause an instant return with INVALID_USET since as far as I can see, the information from the associated fact is not needed for correctly propagating ownership information for incrementals. |
|
@simonhollis I'm still really confused. Firstly, regardless of set sizes, Secondly:
Do you mean there is a But I suspect I'm misunderstanding the problem. The code change seems to be explicitly looking for predicate names (not unit names) that contain |
|
@simonhollis here's my guess at what's going on.
It's not unreasonable for some facts to appear in every unit - indeed I'm fairly sure there are lots of examples of this. I'd start by investigating why |
Yes, there is an empty string Unit in all Python databases. My understanding is that this is a necessary default value to ensure that a) the database is never completely empty; b) catch any facts for which ownership cannot be determined. But, with your response, maybe that assumption needs revisiting. I'll generate an example tomorrow to add to the context.
OK, concern noted. We don't need to rush this change and can take time to get it right. |
I did investigate this part and I realise I mis-stated earlier. It's |
|
@simonhollis Let me distinguish 3 separate things, because I think we're getting a bit confused here.
|
|
@simonmar Thanks for the further clarifications. I think what we're seeing manifested in latency then is the effect of calling the serialisation of the whole set of owned files by the erroneously generated empty string values. If these don't need to be there for indexing-related reasons, then I agree we wouldn't need the functionality in this patch. I will dig around the Python indexer side as well as checking the databases from a few more languages to see if the pattern repeats there and report back and we can decide to close or continue. There will be quite a bit of latency on that work though, since I'm taking some time off from tomorrow. |
|
Looping back on this one: as Simon M pointed out, the Python indexer is outputting facts like |
Summary:
Context
Ownership propagation for incremental databases is costly in our biggest repositories.
In general, ownership sets are small and accessing them is quick. However, the root predicate (
Name = "") owns all files and has a huge ownership set. Accessing this set takes significant time.This change
This change removes the assignment of an ownership set from the root predicate. Prior to change, this set includes all units.
Accessing ownership with change
Now,
getOwner(empty_name_fact_id)returnsINVALID_USETrather than a large ownership set.Net effect
Removing the large root ownership set leads to faster ownership set calculations for incremental DBs and faster incremental DB ownership set calculations.
Differential Revision: D78815949