SQL Query Conundrum…

I have a brain teaser for ya.. I’m looking for a way to solve a SQL problem efficiently, specifically using MySQL. The goal is to get a count of the number of unique rows returned for a complex query. It’s actually for a pagination system so I can determine the limits necessary to efficiently query the database for the right amount of data rather than return everything and try to brute force it.

Let’s say I have three tables as follows :

mysql> describe person;
+——-+——————+——+—–+———+——-+
| Field | Type             | Null | Key | Default | Extra |
+——-+——————+——+—–+———+——-+
| id    | int(10) unsigned | NO   | PRI | NULL    |       |
| first | char(15)         | YES  |     | NULL    |       |
| last  | char(15)         | YES  |     | NULL    |       |
+——-+——————+——+—–+———+——-+
3 rows in set (0.02 sec)

mysql> describe interests;
+———-+——————+——+—–+———+——-+
| Field    | Type             | Null | Key | Default | Extra |
+———-+——————+——+—–+———+——-+
| id       | int(10) unsigned | NO   | PRI | NULL    |       |
| interest | char(15)         | YES  |     | NULL    |       |
+———-+——————+——+—–+———+——-+
2 rows in set (0.00 sec)

mysql> describe interest_link;
+————-+——————+——+—–+———+——-+
| Field       | Type             | Null | Key | Default | Extra |
+————-+——————+——+—–+———+——-+
| person_id   | int(10) unsigned | NO   |     | NULL    |       |
| interest_id | int(10) unsigned | NO   |     | NULL    |       |
+————-+——————+——+—–+———+——-+
2 rows in set (0.00 sec)

Simple enough. I’m mapping interests to people. I’ve entered data into these tables as follows :

mysql> select * from person;
+—-+——-+———-+
| id | first | last     |
+—-+——-+———-+
|  1 | John  | Doe      |
|  2 | Bob   | Jones    |
|  3 | Joe   | Smith    |
+—-+——-+———-+
3 rows in set (0.00 sec)

mysql> select * from interests;
+—-+———–+
| id | interest  |
+—-+———–+
|  1 | Computers |
|  2 | Music     |
|  3 | Food      |
|  4 | Beer      |
|  5 | Gaming    |
+—-+———–+
5 rows in set (0.00 sec)

mysql> select * from interest_link;
+———–+————-+
| person_id | interest_id |
+———–+————-+
|         1 |           1 |
|         1 |           2 |
|         1 |           4 |
|         2 |           1 |
|         2 |           5 |
|         2 |           4 |
|         3 |           3 |
|         3 |           2 |
|         3 |           4 |
+———–+————-+
9 rows in set (0.00 sec)

So far, so good. Now, I want to do a search to find which users are interested in music. Simple enough search, I’d do this with a simple select statement as follows :

mysql> select * from person as p left join interest_link as il on il.person_id = p.id where interest_id = 2;
+—-+——-+——–+———–+————-+
| id | first | last   | person_id | interest_id |
+—-+——-+——–+———–+————-+
|  1 | John  | Doe    |         1 |           2 |
|  3 | Joe   | Smith  |         3 |           2 |
+—-+——-+——–+———–+————-+
2 rows in set (0.00 sec)

But what if I want to find out who’s interested in music *and* beer?

mysql> select * from person as p left join interest_link as il on il.person_id = p.id where interest_id in (2,4);
+—-+——-+———-+———–+————-+
| id | first | last     | person_id | interest_id |
+—-+——-+———-+———–+————-+
|  1 | John  | Doe      |         1 |           2 |
|  1 | John  | Doe      |         1 |           4 |
|  2 | Bob   | Jones    |         2 |           4 |
|  3 | Joe   | Smith    |         3 |           2 |
|  3 | Joe   | Smith    |         3 |           4 |
+—-+——-+———-+———–+————-+
5 rows in set (0.00 sec)

That’s a problem, now I have 5 rows.. How do I make this a unique list? Well, I’m merely interested in names and ids, so I can do this :

mysql> select p.id, p.first, p.last from person as p left join interest_link as il on il.person_id = p.id where interest_id in (2,4);
+—-+——-+———-+
| id | first | last     |
+—-+——-+———-+
|  1 | John  | Doe      |
|  1 | John  | Doe      |
|  2 | Bob   | Jones    |
|  3 | Joe   | Smith    |
|  3 | Joe   | Smith    |
+—-+——-+———-+
5 rows in set (0.00 sec)

but that’s still 5 rows.. so what now?

mysql> select distinct p.id, p.first, p.last from person as p left join interest_link as il on il.person_id = p.id where interest_id in (2,4);
+—-+——-+———-+
| id | first | last     |
+—-+——-+———-+
|  1 | John  | Doe      |
|  2 | Bob   | Jones    |
|  3 | Joe   | Smith    |
+—-+——-+———-+
3 rows in set (0.00 sec)

Aha! perfect. That’s what I need.. almost. For this particular application, I want to paginate, so I need a total number of matching rows so I can properly identify the limits as well as the upper bound on page numbers. So, I’ll just replace the specific field names with a count(*) :

mysql> select distinct count(*) from person as p left join interest_link as il on il.person_id = p.id where interest_id in (2,4);
+———-+
| count(*) |
+———-+
|        5 |
+———-+
1 row in set (0.00 sec)

And here is where I’m stuck. I need the total count of DISTINCT names, not the total number of rows returned. I tried a GROUP BY, but that didn’t help much :

mysql> select count(*) from person as p left join interest_link as il on il.person_id = p.id where interest_id in (2,4) group by p.id;
+———-+
| count(*) |
+———-+
|        2 |
|        1 |
|        2 |
+———-+
3 rows in set (0.00 sec)

Sure, I get 3 rows, but what I’m looking for here is a single row with the total number of items. … So, what if I count the number of returned rows! :

mysql> select count(*) from (select count(*) from person as p left join interest_link as il on il.person_id = p.id where interest_id in (2,4) group by p.id) as foo;
+———-+
| count(*) |
+———-+
|        3 |
+———-+
1 row in set (0.00 sec)

BUT… at what cost? This seems like a rather complex query that might break down, significantly, when there’s a lot of data.. And the examples above are rather simplistic. In reality, we’re talking about more fields and more tables, so the simpler query gets a little complex to begin with. I’m open to ideas on how to do this properly via SQL. Yes, I am aware of indexing and how that speeds things up. I use indexing, I just eliminated it from the above example to simplify things. I’m open to ideas on how to do this properly via SQL.

I can simply return all the rows with the distinct clause, count them programmatically, and then proceed with the rest of the program, but depending on the selections made by the user, there could be a significant amount of data returned. I’m worried about both memory exhaustion on the part of the scripting language, as well as the processing and transmission time required to pass all of that data back to the program from the SQL database. Besides, this is the sort of problem that SQL was designed to solve.

I don’t think this is a unique problem, so someone out there has a solution. Perhaps the subselect *is* the better solution, but I don’t think so. I’m open to ideas. You can leave a comment here, or hit me up on twitter.

 

Leave a Reply

Your email address will not be published. Required fields are marked *