[Solved] MySQL query that uses HAVING is limiting results incorrectly

EverSQL Database Performance Knowledge Base

MySQL query that uses HAVING is limiting results incorrectly

Database type:

I have an advanced search form that offers a bunch of ways to filter your search. Here's a simplified idea (doesn't include keyword text input or date range searches or other select menus):

Topic: 
<select><option>any</option><option>all</option></select>
[] Aging
[] Environment
[] Health
[] Hunger
[] Poverty

Document type: 
<select><option>any</option><option>all</option></select>
[] Case Study
[] Policy Brief
[] Whitepaper

If someone selects "any" when they choose more than one topic or document type, the query needs to include, eg., topic = "Aging" OR topic = "Health".

If someone selects "all" when they choose more than one topic or document type, the query needs to include, eg., topic = "Aging" AND topic = "Health".

We default to "AND" between these different filters. So when you search for all documents categorized under Aging and all documents categorized as a whitepaper, the query is: topic = "Aging" AND doctype = "whitepaper".

The Problem: We have a query that is working when the search is for "any". But when the search is for "all", according to MySQL's "EXPLAIN" command, we have an "impossible WHERE". :(

Here is the query that works when someone selects "any" for both topic and document type:

SELECT 
DISTINCT * 
FROM research 
JOIN link_resource_doctype ON link_resource_doctype.resource_id = research.research_id 
JOIN doctype ON doctype.id = link_resource_doctype.doctype_id 
JOIN link_resource_issue_area ON link_resource_issue_area.resource_id = research.research_id 
JOIN issue_area ON issue_area.id = link_resource_issue_area.issue_area_id 
WHERE approved = '1' 
AND (doctype.identifier = 'case_study' OR doctype.identifier = 'whitepaper') 
AND (issue_area.identifier = 'aging' OR issue_area.identifier = 'health')

And here's the same query which does not work when someone selects "all" for both topic and document type (this also doesn't work if someone selects just topic or just document type):

SELECT 
DISTINCT * 
FROM research 
JOIN link_resource_doctype ON link_resource_doctype.resource_id = research.research_id 
JOIN doctype ON doctype.id = link_resource_doctype.doctype_id 
JOIN link_resource_issue_area ON link_resource_issue_area.resource_id = research.research_id 
JOIN issue_area ON issue_area.id = link_resource_issue_area.issue_area_id 
WHERE approved = '1' 
AND (doctype.identifier = 'case_study' AND doctype.identifier = 'whitepaper') 
AND (issue_area.identifier = 'aging' AND issue_area.identifier = 'health')

Possible solution but there's a problem: I came across this post on Stackoverflow -- Select row belonging to multiple categories -- which contains a query that I think might solve our problem when someone selects "all". Here it is:

SELECT 
DISTINCT * 
FROM research 
JOIN link_issue_area ON link_issue_area.resource_id = research.research_id 
JOIN link_doctype ON link_doctype.resource_id = research.research_id 
WHERE issue_area.identifier IN ('aging', 'health')
AND 
doctype_id.identifier IN ('case_study', 'whitepaper')
GROUP BY research.research_id
HAVING COUNT(DISTINCT issue_area.identifier) = 2 
AND 
COUNT(DISTINCT doctype.identifier) = 2

The Problem: This query does seem to work for either "any" or "all", except for one problem. Say that a document is categorized under Aging, Health, and Poverty but the person searching only checked off Aging and Health. The document that is categorized under the two topics that are checked off and also Poverty, which wasn't checked, will not appear in the search result list. I think this is because of the HAVING COUNT (DISTINCT issue_area.identifier) = 2 -- the 2 excludes any document that actually has a COUNT that is more than 2. Is there a work-around for this? Or a better query to use here?

Any insight, ideas, assistance much appreciated! Thanks!

Here's an SQLfiddle that gets at all of this too: http://sqlfiddle.com/#!2/847362/1

How to optimize this SQL query?

The following recommendations will help you in your SQL tuning process.
You'll find 3 sections below:

  1. Description of the steps you can take to speed up the query.
  2. The optimal indexes for this query, which you can copy and create in your database.
  3. An automatically re-written query you can copy and execute in your database.
The optimization process and recommendations:
  1. Avoid Selecting Unnecessary Columns (query line: 2): Avoid selecting all columns with the '*' wildcard, unless you intend to use them all. Selecting redundant columns may result in unnecessary performance degradation.
  2. Prefer IN Clause Over OR Conditions (modified query below): Using an IN clause is far more efficient than OR conditions, when comparing a column to more than one optional values. When using an IN clause, the database sorts the list of values and uses a quick binary search.
  3. Use Numeric Column Types For Numeric Values (query line: 18): Referencing a numeric value (e.g. 1) as a string in a WHERE clause might result in poor performance. Possible impacts of storing numbers as varchars: more space will be used, you won't be able to perform arithmetic operations, the data won't be self-validated, aggregation functions like SUM won't work, the output may sort incorrectly and more. If the column is numeric, remove the quotes from the constant value, to make sure a numeric comparison is done.
The optimized query:
SELECT
        DISTINCT * 
    FROM
        research 
    JOIN
        link_resource_doctype 
            ON link_resource_doctype.resource_id = research.research_id 
    JOIN
        doctype 
            ON doctype.id = link_resource_doctype.doctype_id 
    JOIN
        link_resource_issue_area 
            ON link_resource_issue_area.resource_id = research.research_id 
    JOIN
        issue_area 
            ON issue_area.id = link_resource_issue_area.issue_area_id 
    WHERE
        approved = '1' 
        AND (
            doctype.identifier IN (
                'case_study', 'whitepaper'
            )
        ) 
        AND (
            issue_area.identifier IN (
                'aging', 'health'
            )
        )

Related Articles



* original question posted on StackOverflow here.