I thought InnoDB page size was always 16k ..

Let me just start off this post saying if you’re not interested in InnoDB tablespace internals, then this might not be the post for you. :)

At any rate, whilst examining individual tablespace files (the .ibd file created when running with innodb_file_per_table option enabled) in their binary format, I noticed the initial page size for a individual tablespace did not appear to be 16k (as the default InnoDB page size is 16k).

Upon examining the actual binary data stored in the data file (just create a basic table and insert 10 rows), you can see data is written at the very beginning of the file (position 0).

One can also see data appearing at several (4) other places in this file.

I was trying to sync up what I was seeing with these InnoDB internals:


Which is an excellent source of information on this topic, btw. It’s just that everything was not quite matching up, though some portions were proving accurate.

At any rate, and to make a long story shorter, I noticed there was a pattern to the 5 positions where data was located, which happened to be:

0  3FF8  7FF8  BFF8  FFF8 (and the last data byte was written to space FFFF)

Pretty “convenient” values, I’d say ..

I knew the initial data at position 0 was related to some overall info, checksums, offsets, tablespace id (I identified this quickly due to previous work done with the ibd files – and see new work on this topic by Aleksandr Kuzminsky for those interested), etc.

I knew the FIL (file) header, page header, and data rows themselves began at BFF8 (as I could see actual row data using a binary reader, and then the FIL and PAGE headers matched up according to the InnoDB internals page).

And I could deduce the value from FFF8 to FFFF was the FIL trailer (again based on InnoDB internals, and the fact the last 4 bytes matched what was in the FIL header) and there are 6 bytes (3 2-byte values) directly preceding it, which is the Page Directory (based on the value of PAGE_N_DIR_SLOTS, also originating from the header field). So this was the end of this initial page.

Thus, this initial page, by all accounts, was 64k and not 16k.

Well, I got to searching the source, and found this in dict0crea.c …

if (srv_file_per_table) {
	/* We create a new single-table tablespace for the table.
	We initially let it be 4 pages:
	- page 0 is the fsp header and an extent descriptor page,
	- page 1 is an ibuf bitmap page,	
	- page 2 is the first inode page,
	- page 3 will contain the root of the clustered index of the
	table we create here. */

Aha! That explains it.

Although, I suppose it wouldn’t have hurt to have just started looking there in the first. ;)

Btw, and fwiw, note that this ‘4’ pages is configurable in the code (just change the definition in fil0fil.h):


Well, that answers that, and now you see how I like to spend my free time … ;)

And for those interested, in additon to the InnoDB internals page listed above, I also found the following sources of information to be helpful with various bits of tablespace details:


And don’t forget the source code (enough for a great start anyways):


IOPS, innodb_io_capacity, and the InnoDB Plugin

In the InnoDB plugin, a new variable was added named innodb_io_capacity, which controls the maximum number of I/O operations per second that InnoDB will perform (which includes the flushing rate of dirty pages as well as the insert buffer (ibuf) batch size).

First off, let me just say this is a welcome addition (an addition provided by the Google Team, fwiw).

However, before this was configurable, the internal hard-coded value for this was 100. But when this became configurable, the default was increased to 200.

For many systems, this is not an issue (i.e., the overall system can perform 200 IOPS).

However, there are still many disks (which is often the bottleneck for IOPS) out there that are not quite at 200 IOPS. Here is a rough estimate based on some external resources which have already documented IOPS for various disk drives (*see links at end for further reference):

15k rpm: 180-210 IOPS
10k rpm: 130-150 IOPS
7200 rpm: 80-100 IOPS
5400 rpm: 50-80 IOPS

So if you are running with a single disk, no raid, and it’s a slower disk (say 5,400 rpm – 10,000 rpm range), then this default value of 200 might be too high for you.

In which case, reverting back to 100 should help (note you cannot reduce this lower than 100). The manual mentions reverting back to 100 to achieve the “old behavior”. However, it neglects to explicitly mention why this might be necessary.

Also, since this option controls the insert buffer batch size, the number of merges per background I/O loop also doubles (5% of innodb_io_capacity, or 5% of 100 in non-plugin).

What this means is that double the writes could be occurring, leaving you with a 1:1 insert to merge ratio, hence much more pages could be written to the secondary indexes, thus hurting performance (in this specific case obvious due to slave lag) and undermining the purpose of the insert buffer in the first place. See bug #48341 for more details on this specific case, as well as this following post from the Facebook Team.

And for those interested in even more details concerning innodb_io_capacity, James and Dimitri both have some excellent comments at the following 2 links, respectively:


*Basis for IOPS general measurements (only to be used as general guidelines):