I am running a mysql server on a mac pro, 64GB of ram, 6 cores. Table1 in my schema has 330 million rows. Table2 has 65,000 rows. (I also have several other tables with a combined total of about 1.5 billion rows, but they are not being used in the operation I am attempting, so I don't think they are relevant).
I am trying to do what I would have thought was a relatively simple update statement (see below) to bring some data from Table2 into Table1. However, I am having a terrible time with mysql blowing through my system ram, forcing me into swaps, and eventually freezing up the whole system so that mysql becomes unresponsive and I need to restart my computer. My update statement is as below:
UPDATE Table1, Table2
SET
Table1.Column1 = Table2.Column1,
Table1.Column2 = Table2.Column2,
Table1.Column3 = Table2.Column3,
Table1.Column4 = Table2.Column4
WHERE
(Table1.Column5 = Table2.Column5) AND
(Table1.Column6 = Table2.Column6) AND
(Table1.Column7 = Table2.Column7) AND
(Table1.id between 0 AND 5000000);
Ultimately, I want to perform this update for all 330 million rows in Table1. I decided to break it up into batches of 5 million lines each though because
- (a) I was getting problems with exceeding lock size and
- (b) I thought it might help with my problems of blowing through ram.
Here are some more relevant details about the situation:
- I have created indexes for both Table1 and Table2 over the combination of Column5, Column6, Column7 (the columns whose values I am matching on).
- Table1 has 50 columns and is about 60 GB total.
- Table2 has 8 columns and is 3.5 MB total.
- I know that some people might recommend foreign keys in this situation, rather than updating table1 with info from table2, but (a) I have plenty of disk space and don't really care about using it to maximum efficiency (b) none of the values in any of these tables will change over time and (c) I am most concerned about speed of queries run on table1, and if it takes this long to get info from table2 to table1, I certainly don't want to need to repeat the process for every query I run on table1.
- In response to the problem of exceeding maximum lock table size, I have experimented with increasing innodb_buffer_pool_size. I've tried a number of values. Even at something as low as 8 GB (i.e. 1/8th of my computer's ram, and I'm running almost nothing else on it while doing this), I am still having this problem of the mysqld process using up basically all of the ram available on the system and then starting to pull ram allocation from the operating system (i.e. my kernel_task starts showing up as using 30GB of ram, whereas it usually uses around 2GB).
- The problem with the maximum locks seems to have been largely resolved; I no longer get this error, though maybe that's just because now I blow through my memory and crash before I can get there.
- I've experimented with smaller batch sizes (1 million rows, 100,000 rows). These seem to work maybe a bit better than the 5 million row batches, but they still generally have the same problems, maybe only a bit slower to develop. And, performance seems terrible - for instance, at the rate I was going on the 100,000 batch sizes, it would have taken about 7 days to perform this update.
- The tables both use InnoDB
- I generally set SET SESSION TRANSACTION ISOLATION LEVEL READ UNCOMMITTED; although I don't know if it actually helps or not (I am the only user accessing this DB in any way, so I don't really care about locking and would do away with it entirely if I could)
- I notice a lot of variability in the time it takes batches to run. For instance, on the 1 million row batches, I would observe times anywhere between 45 seconds and 20 minutes.
- When I tried running something that just found the matching rows and then put only two column values for those into a new table, I got much more consistent times (about 2.5 minutes per million lines). Thus, it seems that my problems might somehow stem from the fact maybe that I'm updating values in the table that I am doing the matching on, even though the columns that I'm updating are different from those I am matching on.
- The columns that I am matching on and updating just contain INT and CHAR types, none with more than 7 characters max.
- I ran a CHECK TABLE diagnostic and it came back ok.
- Overall, I am tremendously perplexed why this would be so difficult. I am new to mysql and databases in general. Since Table2 is so small, I could accomplish this same task, much faster I believe, in python using a dictionary lookup. I would have thought though that databases would be able to handle this better, since handling and updating big datasets is what they are designed for.
- I ran some diagnostics on the queries using Mysql workbench and confirmed that there are NOT full table scans being performed.
- It really seems something must be going wrong here though. If the system has 64 GB of ram, and that is more than the entire size of the two tables combined (though counting index size it is a bit more than 64 GB for the two tables), and if the operation is only being applied on 5 million out of 330 million rows at a time, it just doesn't make sense that it should blow out the ram.
Therefore, I am wondering:
- Is the syntax of how I am writing this update statement somehow horribly bad and inefficient such that it would explain the horrible performance and problems?
- Are there some kind of parameters beyond the innodb_buffer_pool_size that I should be configuring, either to put a firmer cap on the ram mysql uses or to get it to more effectively use resources?
- Are there other sorts of diagnostics I should be running to try to detect problems with my tables, schema, etc.?
- What is a "reasonable" amount of time to expect an update like this to take?