Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch the loading of metadata during entity list rendering (Trac #4290) #4290

Closed
elgg-gitbot opened this issue Feb 16, 2013 · 21 comments
Closed

Comments

@elgg-gitbot
Copy link

Original ticket http://trac.elgg.org/ticket/4290 on 42030340-11-12 by trac user mrclay, assigned to trac user mrclay.

Elgg version:

When displaying a list of entities, particularly of the same type/subtype, it would be wise to attempt to autoload a metadata key for all entities in the list whenever an unloaded key is requested.

E.g. elgg_list_entities is asked to render some blog posts. In the first view rendering, $entity->foo was already pre-loaded (by some other mechanism) but $entity->bar has not been. This, I think, makes it extremely likely that that bar is going to be needed for all entities in the list, so they all should be loaded at that time in as few queries as possible.

In practice, something like the context stack would be needed, but holding lists of entities being currently rendered (and a comma-separated list of their GUIDs for use in queries).

What I imagine complicates batch-loading metadata is that some entities could have bar as an int, some others as an array of strings, etc. but I imagine there's still big opportunity for reducing queries.

@elgg-gitbot
Copy link
Author

ewinslow wrote on 42066506-06-29

I like the idea of bulk loading metadata but I'd vote we just batch load all metadata always when fetching a list of entities (respecting access controls of course). Memory consumption should not be prohibitive here, and this cuts the maximum number of queries out. It's also far simpler to implement.

We could try doing smart things like if an entity's metadata is already loaded we just use that instead of pulling it in twice, but I'd say that's a different ticket.

@elgg-gitbot
Copy link
Author

Milestone changed to Near Term Future Release by ewinslow on 42365333-01-03

@elgg-gitbot
Copy link
Author

ewinslow wrote on 42419165-09-17

In terms of implementation, we would need a way to set metadata on an entity without triggering a db request (i.e. disable auto-save). Either that or we load the metadata into a cache and then reconfigure ElggEntity's metadata getter to hit the cache first, then try the DB.

I'm partial to the second option because we could do it without changing the API (i.e. in 1.8.x), and such a cache stands to get us some big performance improvements across the board.

@elgg-gitbot
Copy link
Author

trac user mrclay wrote on 42419430-06-21

2nd option sounds far better. I think the MetadataCache should sit on the entity (so it gets auto dumped if the entity is removed from the entity cache) and should also cache failed lookups (basically if a key is requested and false/null is returned, that should be cached), because these often occur multiple times within a request. I assume that, on save, the saved metadata value(s) will be put back in the read cache?

@elgg-gitbot
Copy link
Author

trac user mrclay wrote on 42419986-11-22

Just spent a few hours on [https://github.com/mrclay/Elgg-leaf/compare/Elgg:1.8...mrclay:4290-md-cache this commit], just to realize that, since the global metadata fetch and delete functions are in the public API and can be called directly, there's no way to invalidate/replace cached values if I store metadata on the entity object. Another case where the global API forces other things to be global :(.

But at least the mission is clearer, while elgg_delete_metadata and friends are around and don't interact directly with entities, every piece of the metadata cache must be accessible from everywhere.

@elgg-gitbot
Copy link
Author

trac user mrclay wrote on 42420029-12-15

My new work branch on this is [https://github.com/mrclay/Elgg-leaf/compare/Elgg:1.8...mrclay:4290-md-cache-2 here]. Just need to plug it into all public API points for MD read/write...ugh.

@elgg-gitbot
Copy link
Author

trac user mrclay wrote on 42420982-12-30

Made significant progress on [https://github.com/mrclay/Elgg-leaf/compare/Elgg:1.8...mrclay:4290-md-cache-2 my branch] tonight. All fetched metadata is now stored, and delete/disable invalidate as little of the cache as possible. When there's any uncertainty about future DB reads, the cached values are cleared. All (current) tests passing!

The next step is prefetching all the metadata from inside elgg_get_entities, but I'd appreciate some review of the work so far. Known missing are statements to clear the MD cache wherever $ENTITY_CACHE is set to array() (generally in upgrade scripts), but I'm not sure if these are really necessary. I'm sure there will be small bugs creeping in that the current tests won't catch.

@elgg-gitbot
Copy link
Author

trac user mrclay wrote on 42425733-08-12

I think this is as coupled as the metadata cache should get to system. I.e. I think elgg_get_entities should not call cache methods, but rather should populate the cache by calling a hook (e.g. query:entities:after in #4453) and a separate handler should accept the entities and populate the cache.

Part of the reasoning is to stop shoehorning functionality into elgg_get_entities, part is to allow site owners to change the way or the extent to which the MD cache is populated. I'm concerned, after seeing comments in the code mentioning some sites having very large amounts of metadata, that always populating all metadata may not be reasonable.

When querying to pull all the MD, it will be important not to use a LIMIT clause. If you do , you never know if there are MD values that were not loaded, so you cannot assert that the cache is accurate for any key. It may be wise, before pulling data, to first issue a query to determine how many values exist for each (guid|name) combination and character counts on TEXT values (not sure if this is practical). This may let us decide to leave some data out of the auto-populate cache to save memory. Just being paranoid here.

@elgg-gitbot
Copy link
Author

trac user mrclay wrote on 42426260-09-10

Latest [https://github.com/mrclay/Elgg-leaf/commit/e43dee3ebed42b502488c99851e82b6ccff0ffed commit] has function to populate cache based on GUIDs. Only needs two queries, but neither has a LIMIT and the first uses a GROUP BY to count the bytes of all the values in each entity to ensure it isn't over (arbitrary for now) 1MB. Tests included.

if this looks good we can figure out what in what situations this function should be called and whether directly/by hook.

@elgg-gitbot
Copy link
Author

ewinslow wrote on 42426395-05-15

Thanks for the work. Left some comments on the commit. Did I send a big long message to you explaining my thoughts on the issues you raised? I don't see it in this thread...

@elgg-gitbot
Copy link
Author

ewinslow wrote on 42431651-12-29

The tl;dr is that the memory usage here is almost certainly trivial, not something we should worry about until we have hard evidence there is a problem.

@elgg-gitbot
Copy link
Author

trac user mrclay wrote on 42433622-02-10

Great, I'll refactor this a bit based on your suggestions. Thoughts on where/how to call what will be elgg_get_metadata_cache()->populateFromEntities()? In the short term it could go in elgg_get_entities and get_entity (if not served by entity cache).

@elgg-gitbot
Copy link
Author

trac user mrclay wrote on 42434942-05-12

Unless I'm missing something, [https://github.com/mrclay/Elgg-leaf/compare/Elgg:1.8...mrclay:4290-md-cache-2 this branch] completes this ticket. Cache has its own test case, is run with the core units (all passing), global helper functions moved into cache class, some tiny bugs fixed. Cache is integrated in elgg_get_entities. In ElggEntity::getMetadata on a cache miss I go ahead and populate the cache for the whole entity and try again. This should nicely cover any entities that aren't loaded via elgg_get_entities and only when metadata is actually needed.

@elgg-gitbot
Copy link
Author

ewinslow wrote on 42435926-04-19

Took a look at your branch. Made a few extremely minor comments. I assume you've also poked around on a test site just as a sanity check? If everything looks like it's good to go, I'll be very pleased to pull this into 1.9.

Also, just for the record, I found that post I was referring to:


tl;dr: I'm pretty strongly convinced that the extra memory usage here will be trivial.

I had this same concern at first but after thinking through the numbers, I'm not really convinced this is an issue for 99% of Elgg sites. The entire Elgg community has a metastrings table that is ~50MB. The average size of a metastring is ~214 bytes. But I'm betting you that's from long forum posts and comments which should be entities. So the average is going to plummet once fewer metadata fields are actually freetext. But let's just say 5 pieces of metadata take up ~1KB. That means you could load 5000 of those on a page and still be only using ~1MB of memory. That would require a page of 250 entities with 20 metadata each. In my experience, a bare-bones Elgg page uses about 12MB. For the potential performance gains, this seems like a no-brainer tradeoff.

We can continue to refine our implementation to make it as memory-efficient as possible, but my thought is that if it provides more responsiveness for end users, then everyone will be praising us for making the right choice and we can deal with any edge-case too-much-memory-usage problems later. Note that nothing stops us from implementing this naive solution now and refining it when/if it actually becomes a problem.

Finally, for those few sites that might actually be negatively impacted, we can always provide an opt-out configuration option in settings.php or something.

@elgg-gitbot
Copy link
Author

Milestone changed to Elgg 1.9.0 by ewinslow on 42435926-04-19

@elgg-gitbot
Copy link
Author

trac user mrclay wrote on 42491530-02-23

Rebased to 1.9 #285

@elgg-gitbot
Copy link
Author

trac user Steve Clay wrote on 42595471-01-05

Fixes #4290: adds volatile metadata cache, unit tests, and pre-loading for fetched entities
Changeset: 0727083

@elgg-gitbot
Copy link
Author

trac user mrclay wrote on 42678062-02-06

Fresh PR for 1.8: #377

Should I reopen?

@elgg-gitbot
Copy link
Author

Milestone changed to Elgg 1.8.9 by trac user mrclay on 42678062-02-06

@elgg-gitbot
Copy link
Author

trac user Steve Clay wrote on 42776742-07-07

Fixes #4290: adds volatile metadata cache, unit tests, and pre-loading for fetched entities
Changeset: 61274b8

@elgg-gitbot
Copy link
Author

trac user Evan Winslow wrote on 42776742-07-19

Merge pull request #377 from mrclay/4290-1.8

Fixes #4290: adds preloading volatile metadata cache and unit tests
Changeset: 7239da3

brettp pushed a commit to brettp/Elgg that referenced this issue Feb 21, 2013
brettp pushed a commit to brettp/Elgg that referenced this issue Feb 21, 2013
Fixes Elgg#4290: adds preloading volatile metadata cache and unit tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

1 participant