We have moved to Github. Please open tickets there.

Opened 11 months ago

Last modified 4 months ago

#4594 new Enhancement

Remove DISTINCT modificator from default scenario

Reported by: srokap Owned by:
Priority: normal Milestone: Needs Review
Component: Core Version: 1.8
Severity: major Keywords:
Cc: brett@… Difficulty:

Description

Prevously discuissed here: https://docs.google.com/document/d/1NrxIj4YOTjNbeXDGW3tpz2lNvaRwL2NDPBNd7TgRfFk/edit?disco=AAAAAEr6svk

For properly written queries we shouldn't have to use DISTINCT. Currently we have some stupid joins (see elgg_get_entity_metadata_where_sql n_table join) which make it necessary to use. The least we could do is to parametrize appliance of DISTINCT to allow optimizing queries. I'd love to see NOT using DISTINCT as default policy.

Change History (15)

comment:1 Changed 7 months ago by srokap

This pull request introduces 'distinct' parameter with default policy as true. Additionally it applies several uses in places where I can't imagine breaking anything and where 'Using temporary' can be avoided.

comment:3 Changed 6 months ago by mrclay

I really want this in internals but I'd not like to expose this option in the API docs.

Needing private options for public API functions probably means we need to refactor, but in the short term I'd like to see the key named something like __distinct and a note about what it does in a docblock annotation that doesn't get output in the docs.

@brettp thoughts?

comment:4 Changed 6 months ago by srokap

  • Severity changed from minor to major

@mrclay I don't see why not to expose it. If we allow group_by parameter already, it's not very different. Of course you can hurt yourself by not using distinct when necessary, but

  • these situations should be occasional
  • using distinct with join in 90% cases will create temporary table, if we create duplicates by accident (bad join condition), this temp table may become huge and yet we have no apparent feedback that something is wrong
  • as above, if we allow group_by, why not distinct?

As I said, I think that usage of sql DISTINCT modifier should be something exceptional and not used by default (meaning 'distinct' => false as default). I'm changing severity since IMHO it's one of big Elgg scaling problems when dealing with large DB.

comment:5 follow-up: Changed 6 months ago by mrclay

Agreed on severity. My opinion is that adding a public API option for this would break BC with 1.8.0, hence this could not go in 1.8.10, but I'm open to input from other core devs.

Are there any cases where a plugin would WANT to receive a list of entities with duplicates? If not, we could just remove the DUPLICATE and filter out duplicates in the query result set (in PHP).

comment:6 in reply to: ↑ 5 ; follow-up: Changed 6 months ago by srokap

Replying to mrclay:

Agreed on severity. My opinion is that adding a public API option for this would break BC with 1.8.0, hence this could not go in 1.8.10, but I'm open to input from other core devs.

Are there any cases where a plugin would WANT to receive a list of entities with duplicates? If not, we could just remove the DUPLICATE and filter out duplicates in the query result set (in PHP).

I think want to filter duplicates, we may do it in PHP as BC and issue warnig. In 90% duplicates for entities, metadata, relationships and annotations doesn't make much sense. But imagine that we want to list downloads cout for premium users and normal users separately and possibly display 2 positions on listing for the same entity with different counters. Question is if we want to support such cases? This might complicate lots of things.

Very customized SQLs using get_data but not returning core objects are not a concern here.

I think version "default not distinct" most likely has to go to 1.9 due to BC concerns, but version "default distinct" is IMHO completely fine for 1.8.x.

I'm also worried if this duplicates filtering in PHP won't create significant load. We shouldn't introduce it without issuing any warning on unexpected duplicates being found.

comment:7 in reply to: ↑ 6 ; follow-up: Changed 6 months ago by mrclay

Replying to srokap:

Very customized SQLs using get_data but not returning core objects are not a concern here.

Agreed. For 1.8 only bring new behavior if user is expecting entities out (callback is entity_row_to_elggstar).

I'm also worried if this duplicates filtering in PHP won't create significant load.

Even result sets of 100s of entities are pretty small memory structures. Using arrays as hash maps (GUID as key) and isset() makes detecting duplicates easy (not evan a function call).

comment:8 in reply to: ↑ 7 ; follow-up: Changed 6 months ago by srokap

Replying to mrclay:

I'm also worried if this duplicates filtering in PHP won't create significant load.

Even result sets of 100s of entities are pretty small memory structures. Using arrays as hash maps (GUID as key) and isset() makes detecting duplicates easy (not evan a function call).

From my experience we may expect 20-30 times more elements for bad joins (i.e. join entity to metadata without additional filtering). Making proper pagination and entities counting may be huge problem to overcome. How to fetch 100th page of result sets without taking previous duplicates into account? I don't see how we can do it without falling back to distinct after some smart guess that we may have duplicates...

comment:9 in reply to: ↑ 8 Changed 6 months ago by mrclay

Replying to srokap:

From my experience we may expect 20-30 times more elements for bad joins (i.e. join entity to metadata without additional filtering).

Yikes. Yes, that would be completely unworkable. I see why warnings would be required. Need more input on this before we proceed.

comment:10 Changed 6 months ago by srokap

Anyway introducing "distinct by default" seem to be safe, since you explicitly remove distinct cast, so you're expected to know what you're doing. I would treat it as plan minimum.

comment:11 Changed 6 months ago by mrclay

Can you give a specific list of what you think we should do for 1.8.x? That would help other core devs digest this ticket.

comment:12 Changed 6 months ago by brettp

We can't change the API for 1.8.X. Adding an option 'distinct' to ege*() means you could write a plugin that's compatible with 1.8.10 but not 1.8.9. We don't want that. This also applies to making a change and then issuing warnings. I should be able to run the same code in 1.8.9 and .10 and not receive warnings.

With all of that, this sounds like 1.9 work to me.

comment:13 Changed 6 months ago by mrclay

I'd be open to this in 1.8 in a couple ways:

  1. An undocumented option named something like __dontuse_distinct with default true.
  2. A non-API function _elgg_set_no_distinct() which turned on a flag to remove DISTINCT for the next ege() call.

And these would only be used by core lib functions that don't allow arbitrary $options arguments (we could guarantee no JOINs).

What about choosing to use DISTINCT based purely on $options passed in?

comment:14 Changed 6 months ago by mrclay

Re: the 2nd option above, I think in a procedural API this behind-the-scenes info passing is OK. Indeed if this were an OO API we'd have ways to pass in info via private/protected vars, etc.

comment:15 Changed 4 months ago by mrclay

Assuming we can move this in 1.9...

  • Do we absolutely need to expose this as an API option?
  • If so, how do we adequately document it so devs know when to use it?
Note: See TracTickets for help on using tickets.