[Solved] Join information from one table to another without joining twice cross apply T-SQL

EverSQL Database Performance Knowledge Base

Join information from one table to another without joining twice cross apply T-SQL

Situation:

I have two tables. #t1 has logins and emails. #t2 has country associated to each emails. I would like to join the information from #t2 to #t1 without having to join it twice. Joining it only once either in the inner or outer query would break the cross apply logic. My current query uses a cross apply to get rolling information as such (fiddle data below):

SELECT DISTINCT CAST(logins AS DATE) AS Dates,
    count(distinct d.email) AS DAU,
    count(distinct m.MAU) AS MAU
FROM #t1 d
CROSS APPLY (
            SELECT CAST(m.logins as date) as dates, m.email AS MAU
            FROM #t1 m
            WHERE m.logins BETWEEN d.logins and DATEADD(dd, 30, d.logins) 
            ) m
group by CAST(logins as date)

The only way i found to join the two tables without having to break my cross apply was to inner join it in both the outer and inner query which is probably wrong but at least the output is correct. I do that so i can add my second condition in my where statement in the inner query. when i apply the logic to my actual table, the performance is dreadful(fiddle data below):

SELECT distinct CASt(logins AS DATE) AS Dates,
    #t2.country,
    count(distinct d.email) AS DAU,
    count(distinct m.MAU) AS MAU
FROM #t1 d
inner join #t2 on d.email=#t2.email 
CROSS APPLY (
            SELECT cast(m.logins as date) as dates, m.email AS MAU, country.country AS country
            FROM #t1 m
            inner join #t2 country on m.email=country.email         
            WHERE m.logins BETWEEN d.logins and DATEADD(dd, 30, d.logins)  
            and #t2.country = country.country
            ) m
group by cast(logins as date), #t2.country


+-------------+---------+-----+-----+
|    Dates    | country | DAU | MAU |
+-------------+---------+-----+-----+
|  2019-04-01 | france  |   1 |   2 |
|  2019-04-02 | france  |   1 |   2 |
|  2019-04-03 | france  |   1 |   2 |
|  2019-04-10 | france  |   1 |   1 |
|  2019-04-03 | italie  |   2 |   2 |
+-------------+---------+-----+-----+

Objective:

How can i find a way to join information from one table to another without having to join it twice. (fiddle data below)

The result should look like this (output from the second query above):

Fiddle:

create table #t1 (email varchar(max), logins datetime)
insert into #t1 values 
('[email protected]', '2019-04-01 00:00:00.000'),
('[email protected]', '2019-04-02 00:00:00.000'), 
('[email protected]', '2019-04-03 00:00:00.000'), 
('[email protected]', '2019-04-10 00:00:00.000'), 

('[email protected]', '2019-04-03 00:00:00.000'), 
('[email protected]', '2019-04-03 00:00:00.000'), 
('[email protected]', '2019-04-03 00:00:00.000')  

create table #t2 (country varchar(max), email varchar(max))
insert into #t2 values 
('france', '[email protected]'),
('france', '[email protected]'),
('italie', '[email protected]'),
('italie', '[email protected]')

How to optimize this SQL query?

The following recommendations will help you in your SQL tuning process.
You'll find 3 sections below:

  1. Description of the steps you can take to speed up the query.
  2. The optimal indexes for this query, which you can copy and create in your database.
  3. An automatically re-written query you can copy and execute in your database.
The optimization process and recommendations:
  1. Avoid Calling Functions With Indexed Columns (query line: 12): When a function is used directly on an indexed column, the database's optimizer won’t be able to use the index. For example, if the column `dd` is indexed, the index won’t be used as it’s wrapped with the function `DATEADD`. If you can’t find an alternative condition that won’t use a function call, a possible solution is to store the required value in a new indexed column.
  2. Avoid Calling Functions With Indexed Columns (query line: 12): When a function is used directly on an indexed column, the database's optimizer won’t be able to use the index. For example, if the column `logins` is indexed, the index won’t be used as it’s wrapped with the function `DATEADD`. If you can’t find an alternative condition that won’t use a function call, a possible solution is to store the required value in a new indexed column.
  3. Avoid Correlated Subqueries (query line: 6): A correlated subquery is a subquery that contains a reference (column: logins) to a table that also appears in the outer query. Usually correlated queries can be rewritten with a join clause, which is the best practice. The database optimizer handles joins much better than correlated subqueries. Therefore, rephrasing the query with a join will allow the optimizer to use the most efficient execution plan for the query.
  4. Explicitly ORDER BY After GROUP BY (modified query below): By default, the database sorts all 'GROUP BY col1, col2, ...' queries as if you specified 'ORDER BY col1, col2, ...' in the query as well. If a query includes a GROUP BY clause but you want to avoid the overhead of sorting the result, you can suppress sorting by specifying 'ORDER BY NULL'.
The optimized query:
SELECT
        DISTINCT CAST(logins AS DATE) AS Dates,
        count(DISTINCT d.email) AS DAU,
        count(DISTINCT m.MAU) AS MAU 
    FROM
        #t1 d CROSS APPLY (SELECT
            CAST(m.logins AS date) AS dates,
            m.email AS MAU 
        FROM
            #t1 m 
        WHERE
            m.logins BETWEEN d.logins AND DATEADD(m.dd, 30, d.logins)) m 
    GROUP BY
        CAST(logins AS date) 
    ORDER BY
        NULL

Related Articles



* original question posted on StackOverflow here.