- ChatGPT really sucks at SQL - August 9, 2024
- We’ve launched our newsletter again - July 24, 2024
- Full Hello Elementor and Elementor Pro FSE compatibility added for Super Speedy Filters - May 30, 2024
If you have a massive website and you’re using any sitemap plugin, you’ve probably found that generating the sitemap kills the performance of your website.
In this article, I compare and optimise 3 different XML sitemap plugins. The plugins optimised in this article include:
- Yoast SEO (the XML sitemap functionality part)
- Google XML Sitemaps
- Better WordPress Google XML Sitemaps
I didn’t bother with the others, largely because they look like they don’t even attempt to be usable on larger websites.
Table of Contents
- What a difference an optimisation makes
- Yoast SEO XML Sitemaps performance
- Yoast XML sitemaps functionality is fundamentally difficult to optimise
- Summary for Yoast
- Google XML Sitemaps plugin performance
- Summary for Google XML Sitemaps
- Better Wordpress Google XML Sitemaps performance
- Additional useful options in the Better WP Google XML Sitemaps plugin
- Summary
What a difference an optimisation makes
These optimisations were performed for a client as part of our Performance Analysis service. This client has about 5 million posts in their wp_posts table. About 1.4 million products and about 4.5 million product variations.
As way of an example of the kind of boosts you can expect, or the kind of slowness you can expect, depending on how you look at it, the Yoast SEO plugin, in order to grab just 100 items for the sitemap (it builds it in pieces) takes 4 minutes and 13 seconds on this site.
4 minutes 13 seconds can be described in seconds as 253 seconds. Once my optimisations were implemented, this time dropped to 0.89 seconds which is a 284-fold speed boost.
You might notice above a missing statement – part 1 of my optimisation involves removing the != (not equals) check against the post_date.
The fact is, the query checks the post is published. If a post is published, it has a date. The only posts which have a date of 00-00-00 are drafts. Using a negative check like not equals prevents the full use of indexes. Indexes are required for large sites to avoid table scans. Table scans involve reading all data in the table or index and are the reason why 250 seconds becomes < 1 second for the same exact query. That’s a table scan versus an index seek.
Yoast SEO XML Sitemaps performance
The Yoast SEO plugin is probably the best SEO plugin available, but the XML sitemap functionality is severely lacking both in terms of performance and in terms of configuration options. Some examples of slowness are above which I’ve optimised below.
The first part of the optimisation involved finding this check for post_date != ’00-00-00 00:00:00′. I logged into the clients site and grepped inside the Yoast plugin to find that the file I needed to fix was wp-content/plugins/wordpress-seo-premium/inc/sitemaps/class-post-type-sitemap-provider.php
I was really hoping that Yoast might have added a filter for the where clause, but even though there is an inkling of hope at the top of the get_posts function inside that same file, it doesn’t let you filter out this post_date clause.
That leaves no option other than to either delete line 580 manually to remove this line:
AND {$wpdb->posts}.post_date != '0000-00-00 00:00:00'
Or, alternatively, a filter can be written on WPDB->query which I’ve done and added directly into our Scalability Pro plugin. If you understand SQL indexes, you will probably be able to figure out a suitable index to support the resulting query yourself, but here’s the one I added:
create index wpi_scalability_pro_sitemaps on wp_posts (post_status, post_password, post_type, post_modified)
The index above is also added to the latest version of Scalability Pro. Now that I’ve optimised the SQL to fetch the items for use in the sitemap, I can switch the sitemap functionality back on.
Where previously the sitemap was stuck, working through 100 items every 5 minutes and killing performance and using up all the RAM, now it’s powering through the sitemap generation:
You can see in the image above that the OFFSET is 909600, then when I ran the SHOW FULL PROCESSLIST command a minute later, I see the OFFSET has increased to 1287000.
Yoast XML sitemaps functionality is fundamentally difficult to optimise
Even after optimising the previous queries, we still have 3 types of slow queries, none of which are trivial to optimise.
Summary for Yoast
I’ve managed to make Yoast faster through the use of a single index and a query filter (both added to my Scalability Pro plugin), but I still wouldn’t recommend the Yoast plugin for your XML sitemaps simply because the other queries are impossible to optimise without fundamentally altering how the plugin code works and they provide no customisation or configuration options for your sitemaps.
Note: I *do* recommend Yoast SEO for the other SEO functionality but not for their sitemap functionality, although with my index and query rewrite it DOES become fast enough to be usable on medium-sized websites (but not large > 100,000 item websites).
Google XML Sitemaps plugin performance
The Google XML Sitemaps plugin is worse performance-wise than Yoast.
But these queries are actually easier to optimise than the Yoast ones since they’re not using != or other index-breaking statements. No alterations to code are required, so I don’t need to hunt for hooks and filters here, I can just add some indexes. This index should help BOTH the slow queries in the Google XML Sitemaps plugin:
create index wpi_google_xml_sitemaps1 on wp_posts (post_status, post_type, post_modified_gmt desc)
The first query drops from 5 minutes 44 seconds (344 seconds) to 5 seconds which is a 68-fold speed boost.
The 2nd query becomes so fast that the query speed can’t be measured with only 2 decimal places. So it gets reported as 0.00 seconds. If we call it 0.01 seconds, this is a 19500-fold speed-increase. That’s no joke. Indexes, and hitting those indexes, is a really important thing if you want to scale.
That leaves a final 3rd query for the Google XML Sitemaps plugin that is slow.
Unfortunately, since that query is counting posts per month, it needs to read the entire table. There’s nothing I can do to optimise this without thoroughly re-writing the code. On top of this, once a user (or googlebot) clicks on the link for December 2017 above, this plugin will try to display 1.1 million URLs on a single sitemap page. That’s craziness.
As a result, I cannot really recommend the Google XML Sitemaps plugin. Many of my clients will have 100s of thousands, if not millions of posts inside one month and this plugin will simply crash – even if you increase your timeout, it will run out of RAM.
Summary for Google XML Sitemaps
Whilst some of the underlying queries are easier to optimise than those in Yoast, the fact that they do not provide pagination for sitemaps is a killer. I cannot recommend this plugin at all for large sites because of this. The idea of loading 1 million+ URLs on a single sitemap is stupid. Clearly the authors did not consider large websites at all when they wrote this code.
Better Wordpress Google XML Sitemaps performance
The Better WordPress Google XML Sitemaps plugin has not been updated in 3 years, but sitemap functionality has not changed at all in that time, so this is not necessarily a red flag.
The promising aspect is that this plugin promises to work on sites of unlimited size, so at least the plugin developers have considered those of us with large websites. It seems the main approach they have is to break down the sitemap into a maximum number of URLs per sitemap.
Firstly, when I install & activate this plugin, then view the sitemap, it’s the ONLY plugin that actually manages to generate the sitemap without any optimisations by myself. That’s very promising. Still – the sitemap index took 205 seconds, or 3 minutes 20 seconds.
The way the Better XML Sitemaps plugin works is it has an index page which links to all the other pages in the sitemap. It needs to know the count of items per post_type because it will pagination these sub-sitemaps. Here’s the slow query to generate this index page:
Since this query is counting everything in wp_posts, you might think adding an index will do nothing to help. However, if we create a covering index, we can make mysql read from the index rather than from the full table. Since this index will be smaller, it will involve far fewer disk reads and should also wipe less cache from memory. Here’s the ideal covering index for this query (note ID is the primary key, so is automatically included in every index, thus it’s not needed in the index):
create index wpi_better_wordpress_google_xml_sitemaps on wp_posts(post_status, post_password, post_type);
After adding this index, the same query completes in 8.6 seconds rather than 200 seconds, so that’s a decent 25-fold speed increase just from this tiny covering index. That makes it possible to load this index page in a decent amount of time (given the size of this site) without having to cache the index page. It also means that the disk is in use for only 8 seconds instead of 200 seconds which means other users won’t notice the impact of your sitemap generation plugin.
The rest of the pages in the Better WordPress Google XML Sitemaps use these simple type of SQL queries. These are trivial to optimise:
SELECT p.*
FROM wp_posts p
WHERE p.post_status = 'publish'
AND p.post_password = ''
AND p.post_type = 'product'
ORDER BY p.post_modified DESC LIMIT 0,1000
There are no IN statements, no OR statements and no != statements. So the above can be made to be blindingly fast, regardless of how large your website ever becomes. That’s no understatement. This type of query, with a supporting index, will be FAST even if you have TRILLIONS of rows in your wp_posts table. There is no supporting index included in the plugin, but the index we need is very, very close to the one we already added – we just need one extra column (post_modified) to be added.
Here is the raw performance without a supporting index:
Instead of creating an additional index, I’ll replace the index I previously created:
drop index wpi_better_wordpress_google_xml_sitemaps on wp_posts;
create index wpi_better_wordpress_google_xml_sitemaps on wp_posts(post_status, post_password, post_type, post_modified);
Because this index will be larger, I’d expect the index query to be slightly slower, so I tested that first – it came back in 8.7 seconds, so pretty much identical. This is to be expected since a datetime field only adds 8 bytes per row in the index. Here is the massive performance boost for the actual sitemap pages:
0.69 seconds compared to the previous speed of 2 minutes 54 seconds is a 254-fold speed increase. The speed boost you see on your own site will depend on how large your wp_posts table is. If your wp_posts table is larger than this reference site then you will see a bigger speed boost whereas if you have fewer rows in wp_posts, you’ll see less impact.
Still – the use of the index means that regardless of how large your site becomes, it will not slow down due to your sitemap generation.
There is, however, one final issue – even with the Better WP Google XML Sitemaps plugin. For some dumb reason, the MySQL query optimizer cannot figure out that it does not need to read all of the data when it’s grabbing rows 600,000 -> 650,000. This causes queries for pages higher than page 1 to become slower and slower. This is fixable by rewriting the query. Instead of:
SELECT p.*
FROM wp_posts p
WHERE p.post_status = 'publish'
AND p.post_password = ''
AND p.post_type = 'product'
ORDER BY p.post_modified DESC LIMIT 600000,50000
We can rewrite the above as the following, which fools the stupid MySQL query optimiser into using the index to get the 50,000 items and only reading all the data at the end.
select p.*
from
(
SELECT p.id
FROM wp_posts p
WHERE p.post_status = 'publish'
AND p.post_password = ''
AND p.post_type = 'product'
ORDER BY p.post_modified DESC LIMIT 600000,50000
) smallset
join wp_posts p
on smallset.id = p.id
The performance difference, as always, is staggering. When grabbing 5,000 rows with an offset of 600,000, the speed changes from 312 seconds to 4.97 seconds which is a 62-fold speed boost.
To fix this and get this better performance, I need to filter the query using the filter available inside wpdb->query. Query is called for get_results which is what BWP Google XML Sitemaps is using.
I’ve coded up this optimisation as a filter in the updated version of our Scalability Pro plugin. Scalability Pro includes many other optimisations too to help my customers and clients who have decent sized websites (> 5000 posts/products, often hundreds of thousands and even millions).
If you’d like to code this up yourself, you can simply modify the BWP Google XML sitemaps plugin and change the SQL code it generates. The function that generates the SQL is called get_results and is inside the class-bwp-gxs-module.php file.
With these optimisations in place, no matter which sitemap page I load, it loads in about 10 seconds. That’s not superfast, but remember it is grabbing 5,000 items, so there’s a network latency effect. Also, the other plugins fail to load sitemaps AT ALL on this large website, even when optimised as far as possible.
Additional useful options in the Better WP Google XML Sitemaps plugin
In addition to being easy to optimise, this sitemaps plugin also has some great performance-related features for large site owners.
- You can cache the XML sitemaps for whatever duration you like. That means they can be generated once then re-used by any search bots without hurting your server or other traffic using your server
- You can prime the cache – that means that the sitemaps will always be quick for the robots and robots won’t abandon indexing your site just because of slow XML sitemaps
- You can alter the size of the index pages. If doing this, you should also change the Advanced Options to allow the same number of rows to be returned as pages in each sitemap index.
Summary
In conclusion, do not use the Google XML Sitemaps plugin because its queries are not optimisable, and their code loads ALL posts from a specific month. That’s sloppy coding because in one month you might perform an import of millions of products and then your sitemaps are broken.
You *can* use the Yoast SEO plugin, but only with the addition of the index specified above and a code alteration (or use our Scalability Pro plugin which includes these edits), and only then on medium-sized websites.
Overall, for performance and functionality, the Better WP Google XML Sitemaps plugin wins our performance and scalability award. There’s still an index required to make it fast, as well as a code alteration – both of these are also included in the latest version of our Scalability Pro plugin.
It may be old, it might not have been updated in 3 years, but it’s the only sitemap plugin that can be made fast enough for large and huge websites.
Don’t hammer your server just to generate your XML sitemaps. Regardless of the size of your WordPress website, it’s possible and fairly easy to keep everything running fast and lightweight.
If you’re interested in having these optimisations yourself, you can create the indexes and alter the code as described above, or you’ll find them conveniently inside our Scalability Pro plugin.
Hi Dave,
Your Scalability Pro plugin saved me a lot of headache! It allowed me to import more than 1 million posts with ease. It just worked!
Now I am trying to build sitemap for this site but Better WordPress Google XML Sitemaps is not available anymore. Do you have any suggestions for a sitemap plugin that will work with Scalability Pro?
Thanks.
Come onto our Discord. I’ve literally just taken over that plugin. It had an XSS scripting vulnerability in it which I’ve fixed and performance on it is looking great. It’s not properly released to the public yet, but come ask about it in Discord and I’ll send you a copy.
Hi Dave, thanks for the artiicle. I wanted to ask whether it is helpful or harmful to have two xml sitemaps for one site…say both yoast and Google xml sitemaps together?
Did you need to change anything in the .conf file to make this work? I get a 404 error when I try to generate a sitemap.
Can you visit Settings -> Permalinks and hit save and try again?
The query you claim to be slow for Yoast SEO (SELECT wp_posts.ID…) is not the actual SQL the plugin executes.
It’s a part of larger query, optimized with join to work around natural MySQL limitations with large offsets. I am not sure why would you cut it out and run separately, that’s not what plugin does.
I had personally worked on refactoring Yoast SEO sitemaps module in the past and closely monitored feedback from people with 100Ks of posts scale sites (it’s in the open on their issue tracker). It _should_ be able handle it fine and, importantly, with native WP database indexes.
If you experience performance issues with it I would recommend to report it on tracker.
“It _should_ be able to handle it fine” – Yes, it should, but it doesn’t, not on large sites.
You are correct, the part I optimised is the sub-query of a larger query. Here is the larger query from directly inside their code:
You can also see this full query including the sub-query with late row lookups in the screenshots I make after optimisation.
From the above code, you can see they have implemented an old optimisation which I also implement for the BWP plugin as part of my optimisations in this article – namely, late row lookup. Here are two reference URLs – the first, is specifically from someone in Germany to get Yoast to optimise their previous query (when they just did select p.* … order by limit…):
https://wordpress.org/support/topic/plugin-wordpress-seo-by-yoast-performance-suggestion/
And here’s a more in-depth article explaining why this optimisation helps:
https://explainextended.com/2009/10/23/mysql-order-by-limit-performance-late-row-lookups/
BUT – since Yoast originally wrote that optimisation, someone else on their team has modified their code to include this redundant check:
Now – if the index was only required for seeking against post_type, post_password, post_status and post_date then this would be the end of the matter. But there is an ORDER BY statement inside the sub-query on post_modified. This post_date != part of the query, even if we have an index on post_type, post_password, post_status, post_date, post_modified prevents use of the index PAST the post_date. Because we did a != instead of =, we cannot guarantee that the remainder of the items in the index are in order. So the query optimiser is forced to perform a sort on the entire remainder of the index. In this case, that’s 1.4 million rows.
By removing the post_date != ’00-00-00 00:00:00′ I allow this index to be used and I eliminate the sorting operation. That’s where this speed boost comes from.
As for why I cut out the sub-query to analyse it separately, I did that because this is the part of the overall query that was slow. This is the part that needed optimisation. Like I said, you can see the full query in the screenshots with the optimised part in the subquery, and the outer query of select p.* from (select p.id …) x join wp_posts p on x.id = p.id …
If you doubt that this index-breaking post_date clause is part of Yoast, you can check it here:
https://github.com/Yoast/wordpress-seo/blob/a1628d1570a8e8cca896312509ac986182836c30/inc/sitemaps/class-post-type-sitemap-provider.php
In the above, you’ll find this line:
AND {$wpdb->posts}.post_date != '0000-00-00 00:00:00'
As for WHY this code got altered, I cannot find the answer. Here’s the blame link:
https://github.com/Yoast/wordpress-seo/blame/a1628d1570a8e8cca896312509ac986182836c30/inc/sitemaps/class-post-type-sitemap-provider.php
In there, there’s one red herring, which is really just someone mostly refactoring, but is not when this != was added:
https://github.com/Yoast/wordpress-seo/commit/0aaf711796d62656ff78d14ec53309014742d077
Anyway, as per your suggestion, I’ve submitted a bug to Yoast’s bug tracker. This is the 2nd bug I’ve submitted to them (although the first was over twitter) and the last time they were very quick at implementing the suggestions.
https://github.com/Yoast/wordpress-seo/issues/12161
Great article, Dave.
I feel like one of the biggest reasons to use the same website for SEO and XML Sitemaps is that they will be in sync when it comes to noindexed URLs.
When using different plugins, noindexing URLs (or entire post types, taxonomies, etc.) requires changing the settings twice: once in the SEO plugin, the other in the XML sitemaps plugin. It’s easy to make mistakes and forget noindexed URLs in the sitemaps or having URLs which should be indexable that are missing from the sitemap.
I would recommend you leave your XML sitemaps to include all your URLs, and use your SEO plugin to control noindex.
Imagine you have a page which is currently in your sitemap and is currently set to indexable. Image you change it to noindex. Because it is in your sitemap, googlebot will find it more quickly and will spot the meta noindex tag and will thus quickly remove it from their index.
I’m not sure how that would impact the crawl budget. I don’t think search engines are too interested to crawl URLs which are not meant to be indexed in the first place. Often times, these URLs are not linked from anywhere else besides the sitemap.
Many plugins (such as page builders) create post types for their templates which are not actually public. XML Sitemaps plugins will usually pick them up and list them in the sitemap.
In the end, there is a solution to each problem. I prefer using the same plugin for convenience purposes.
Sure – when you set up your sitemap at first, you should ensure that you don’t include anything in it that you don’t want indexed, simply because it will use up your crawl budget. But – if there IS an item in your sitemap index which is subsequently set in your normal SEO plugin to noindex, then having this in the sitemap will help the bots index the fact that this page is now no-indexed immediately and can help you remove the item very quickly from the google/search engine index. If you don’t have the item in your sitemap, how can google then read the fact that this page is now set to noindex? Especially if it’s not linked elsewhere on your site anymore?
So, what I’m saying, is you should be able to set up your sitemaps nicely once, and maybe revisit if you add further custom post types, but other than that, you can have your seo plugin manage the noindex, and leave these noindex items in the sitemap. They have a ‘lastmod’ property on them so when they get set to noindex, they will notify bots, they’ll see the changed date, they’ll read noindex, they’ll remove the item from the cache, they won’t index it again until the lastmod property changes again.
Hello Dave, thanks for the insights. Would be a good idea for you to look into the Seopress (seopress.org)?. Thanks, Luigi
Thanks for the suggestion. I didn’t check it, because they don’t have anywhere close to the options in BWP and because of this I presume they don’t paginate their results, or optimise their queries. I might be wrong however. Are you using it on large sites?