Speed up your catalog entries indexing

Even with the raising popularity of FindCommerce, I suspect that many Commerce customers are still using the search provider system, as it comes with an undeniable benefit: It can be used within Commerce Manager. And while I suspect a majority of you have the eventual indexing turned on (aka the entry will be indexed as soon as it’s changed), many still index the entries on demand. Aka manually, or periodically via scheduled job.

We received a report from a customer recently as the indexing timeout, as he was indexing a large amount of entries (about 250.000 entries on 8 catalogs). When we looked into the problem, we discover a possible improvement which is almost free for you.

Unlike the eventual indexing, or the rebuild index (which the entire index will be rebuilt), the manual indexing must care about the changed entries. It of course should not index the unchanged entries because that’d be a waste of resources. But it also must not miss any changed entries – otherwise the search results might not be correct. The manual indexing is done through 3 steps:

  • Identify the changed entries (usually from the last time of rebuilding the index until now)
  • Get the changed entries
  • Send them to the search provider to index.

The one we are going to speed up to day is the first step – identity the changed entries. How are they identified? Well, an entry is considered “changed” and needed to re-index if it has one of these changed:

  • If its metadata is changed
  • If one of its node-relations is changed
  • If its CatagEntryDto information is changed.

The first two are quite easy to determine – if an entry’s metadata is changed, the Modified field of CatalogContentEx will be updated, and if it’s node relation is updated, the Modified field of NodeEntryRelation will be updated. Some simple SQL queries will get you that. But the last one is … tricky. There is no Modified column in CatalogEntry. So how Commerce pulls that information. Probably magic. No, it’s from ApplicationLog. Every time an entry is changed, either added, updated and deleted, a logentry will be added to ApplicationLog, and that will be used for several purposes, such as Auditing, or in this case, to know which entry was edited.

(You might be questioning the design, but I honestly don’t know. It’s like that since long long ago, even before I joined Episerver).

This is when we might have performance problem, because ApplicationLog is an ever-growing table, and it will get bigger and bigger over time. If it is big enough, (like the one I tested is 40 millions rows strong), the time to pull out the entries we need might take a long time. And this is the place where we can improve the performance by adding a covered index:

Beware, this can take a pretty long time to run. On above database, it took 11 minutes to run on my machine. But IMO, it’s worth it, because a query to identity the changed entries which previously ran in more than 2 minutes, now runs in less than 1 second! More than 100x times faster!

Before
And after

Wait, will Episerver integrate this change in upcoming releases? I don’t know – adding the index into a newly created table is easy enough, but it can be troublesome if you are upgrading a very big ApplicationLog table. The query to create the index might need to run separately, possibly in context of SQL Server Management studio to avoid issues such as timeout.

And is it possible to truncate the ApplicationLog table? Well, after you do a full reindex, it’s only good for auditing reasons, so we know who did what and when. You might simply move the rows in to a “copied” table and truncate the original one. Note that, you’ll have to do that at your own risks!

11 thoughts on “Speed up your catalog entries indexing”

  1. Great post!
    Boosting the speed of indexing sounds intriguing.
    I am currently looking into just that with our solution.
    A boost from 2 minuttes to less than a second is great, but does this happen more than once when building the index? If not, the 100+% speed gain will fade a bit.
    – Is it a more than a one time gain, when building the index?

    I’ve had the idea of parallel processes (threads), for our solution.
    Currently, we have three catalogs (soon four), and the index builder is just taking a catalog at a time, chewing through the data, which is pretty time-consuming. We’ve hooked into the OnCatalogEntryIndex method by override, enriching the intrydata, which also takes a little extra time.
    – Is it possible to split the index service in parallel threads, each handling a catalog of it’s own?

    The index process is handling 100 entries at a time. Is there a way to alter this value, in order to grab 500 for instance?

    I really look forward to hear from you, and you are more then welcome to contact me, should you need additional information or the like.

  2. Hi Henrik,
    The index process handles 100 entries at a time is for a reason: bigger batch size can actually slow thing down, or eventually get some error like out of memory or connection time out. We have to balance between performance and reliability.
    I however don’t provide support via this blog (it’s pretty personal)- if you need further assistance, you are welcome to contact Episerver developer support service – they will assist you to pinpoint the problem and find a solution (this might come down to development team, but it’s the process)

    1. Hi Quan Mai

      Thank you for your answer. Regarding the entries, I’ll just leave it at that and stop looking for a way to alter the value.

      My first question was focused on the content of your post, being whether you saved the 2 minuttes only, by indexing the table or a lot more. I havn’t been able to decifer if the pull query is called more than once.

      Regarding my question about parallel threading, I understand your choice on work/private domain completely and I’m glad you’re straight forward about it.
      I merely found the link, posted by yourself, on world.episerver.com (http://world.episerver.com/Modules/Forum/Pages/Thread.aspx?id=174634), which is the only reason I brought it up πŸ™‚

      Still a nice blog, and I’d look into your other blogs as well for sure.
      I’ve browsed a bit in between work, and my interest has definitely been woken. A lot of great topics.

      Best regards, Henrik

      1. Hi,
        Regarding your questions:
        – 2 minutes is once per building index. Because the indexer will get the “entries to index” once, and then starts indexing those.
        – Running multiple threads to index in parallel is an interesting idea – but it’s quite hard to guarantee thread-safety. I don’t think the system support that by default, so you are on your own now. But I might add some more comments if I know about your catalog size (how many entries in total?), your index size (how many do you index in one go, on average?), and time (how long does it take you?). If it’s being too slow then Episerver can help somehow.

        1. Hi Quan Mai

          Thank you for your answers.

          – Currently we’re building our index from scratch every time.
          I have been thinking about incremental changes, but a steady and quick complete build is better for us at this very moment.

          – I appreciate you’re sharing thoughts regarding my off topic subject and additional comments are most welcome of course, but just remember your own choice about no support on the blog πŸ™‚
          That being said, thread safety is one of my primary goals indeed. A fast build isn’t worth much if data is incomplete or partly missing.
          The build yesterday went through approximately:
          Catalog1 – 40.000
          Catalog2 – 70.000
          Catalog3 – 54.000
          This took around 3Β½ hours, which is way to long in my opinion. As mentioned in my very first post, we have hooked into the OnCatalogEntryIndex event, further enriching the entry data. That taken into consideration, I still find the index job slow paced.

          Have a nice day.

          Best regards, Henrik

          1. 3 and half hours for that number of entries is a bit too long, in my opinion. However how much do you “enrich” the data? Was there a reason for not using the incremental index build, instead of full rebuild?
            Yes, I don’t provide “support” here. If there is a question I can understand and think of something in a relatively quick way (< 5 minutes, 1 minute is my favorite) - I can answer it here. Otherwise I always recommend to go to Episerver developer support service.

  3. Hi Quan

    I hope you’re doing great πŸ™‚

    Sorry for my late reply.
    I had some private life stuff to attend to, and a release to plan and handle.

    We’ve been told by our EPiServer partner, that it is harder and more difficult to implement the incremental build. Not sure though, if they base that on our already working (but slow) index build vs. an entirely new base.

    About the enrichment of data. We have a list of metaFields, which marketing has marked as searchable. These fields are added for each product. Product variants and campaigns are enriched as well, but in a minor scale.
    I am not sure whether or not all of these fields are relevant, but I do know that all data data for the webside is stored in the index. Prices and images will be fetched on the fly though.

    Lately I’ve noticed, that our build sometimes index less than 10% of a catalog. The two first catalogs might be indexed completely, but then our last catalog will report an entry count between 4.000-8.000. Even though the catalog has well above 50.000.
    Is this an issue you’ve come across in previous versions of EPiServer? If so; any quick fix?

    I really appreciate your help and guidance. Five minuttes for a thought is a long time though. Stick to three minuttes as a max, and get some additional time for gaming πŸ˜‰
    Are you only playing PS, or are you a PC gamer as well?

    I’m looking forward to hear from you.

    Best regards, Henrik

    1. No worries. But I haven’t seen anything like you said – the third catalog should report the correct number regardless.
      It might help if you can run some profiler (I can recommend dotTrace, personally) – to see if there is any bottleneck during indexing. If you can identify the problem, it would definitely help finding a solution.
      Can you ask your partner on the reason of not using incremental indexing? I’m not seeing the big picture here (I don’t know many things about your implementation), but that sounds quite strange to me. Incremental indexing should be one parameter (a flag true/false) from rebuild index.
      All in all this sounds like a support case to me. The performance is not great, but there is no obvious reason for that. We (as Episerver) will need access to your code/database to see what is wrong.

      1. Hi Quan

        Hope you’ve had a nice weekend. Have you had som time for gaming? πŸ™‚

        I don’t know if you can upgrade ReSharper to ultimate for the difference in cost. Otherwise it might be an issue. My boss just payed the normal version for me. DotTrace is only licensed as a part of ReSharper Ultimate. But I will look into it.
        It would also be nice, to monitor performance and timings in general πŸ™‚

        For what I’ve heard previously, the incremental index brought up some issues earlier. I believe the index is updated incrementally throughout the day, when changes happen.
        But a fresh index is made every morning, to avoid errors. Don’t know what kind of errors though.

        I’d try to setup some timings, to see if I can find the bottleneck.

        Will you be attending the “EPiServer Update 2017” event in Copenhagen September 28th?

        Best regards
        Henrik C. Jensen

        1. Hi Henrik,
          dotTrace has a trial license which you can use for 10 days, so it should be enough for you to pinpoint the problem. And no – incremental indexing should not bring any issues – not what I know of. If you found any problems with that, let us know and we will work on it ASAP.
          I unfortunately haven’t heard anything regarding Episerver Update 2017 – we (Commerce development team) have not been asked to join. But I guess Episerver planned bring some other people to talk about new things in Commerce already – so it’s in good hands. We might have chance to meet in person in another time πŸ™‚

          1. Hi Quan

            Trial version isn’t necessary anymore. My ReSharper license was updated to Ultimate yesterday, which include dotTrace.
            I have a release tomorrow, but after that I might try setting it up, if I am given the time to do so. I am the only developer in our company, which leaves a lot to do.

            I will also look into the incremental build, to speed up our processes and lower the server load.

            I am looking forward to the event.
            Meeting in person could have been fun, but I guess it will be some other time then.

            Best regards,
            Henrik

Leave a Reply

Your email address will not be published. Required fields are marked *