Sitecore on Docker: logging intro and how to avoid losing log entries

Sitecore 10 on Containers use the Windows Container Tools by default for logging. This is configured through a json file which is located in C:\LogMonitor\logmonitorconfig.json. It will look something like below:

{
  "LogConfig": {
    "sources": [
      {
        "type": "EventLog",
        "startAtOldestRecord": false,
        "eventFormatMultiLine": false,
        "channels": [
          {
            "name": "system",
            "level": "Error"
          }
        ]
      },
      {
        "type": "File",
        "directory": "c:\\inetpub\\logs",
        "filter": "*.log",
        "includeSubdirectories": true
      },
      {
        "type": "File",
        "directory": "c:\\inetpub\\wwwroot\\App_data\\logs",
        "filter": "log.*",
        "includeSubdirectories": false
      }
    ]
  }
}

This configuration ensures the logs get send to STDOUT and the logging driver can pick up the logs from there. Docker support several logging drivers, a full list can be found on their site.

Logs getting lost intermittently

This works well initially and logs will be collected properly by the logging solution. However at scale logs might get lost intermittently, in some cases a significant percentage of logs will be lost. Sitecore uses a rolling style of logging, which is not properly supported by the Windows Container Tools. There is an issue created on their Github Repo here but it has received little attention so far.

One potential workaround is to configure a large maximumFileSize in Sitecore’s logging configuration. In some cases this can prevent the issue completely if the log does not need to rollover. At scale this will at best mitigate the issue but it will not address the root cause.

Solution; mount logs directly on the host

The only fix that I’m aware of is to mount the log folder on the host to prevent the issue described above. If anyone else found a better solution please let me know in the comments section below. Docker provides good documentation on how a volume should be mounted which can be found here. Below is some sample code to mount the volume, this needs to be specified using Windows filename semantics without a leading slash, and the destination log directory needs to be empty:

docker run -v c:\logs:c:\inetpub\wwwroot\App_data\logs ...

One thing to keep in mind here is that the host can potentially run multiple containers which share the same log folder on the host. One potential workaround would be to have a separate folder for each container but the best solution will be different depending on many factors.

Advertisement

Adjust dependencies in Readiness probe

It is important to understand the Liveness and Readiness probes when you run Sitecore on Kubernetes. Their documentation provides more detail around these 2 probes:

  • Liveness: Indicates whether the container is running. If the liveness probe fails, the kubelet kills the container, and the container is subjected to its restart policy.
  • Readiness: Indicates whether the container is ready to respond to requests. If the readiness probe fails, the endpoints controller removes the Pod’s IP address from the endpoints of all Services that match the Pod.

In Sitecore these probes can be found under /healthz/live and /healthz/ready. The rest of this post will focus on the Readiness probe. There is a great article by Vitalii Tylyk which discusses these probes at a high level. By default, a Sitecore XP install checks a variety of xDB services as well as Solr as part of this probe. If these services are not all healthy then the probe will fail and no requests will be send to this pod.

It is important to understand this default behavior in the context of your Sitecore solution. If having xDB and Solr up are critical to a solution then this default behavior does not need to be changed. The risk with this setup is that all pods can be pulled from the load balancer if there is an issue with Solr or xDB and the site will be completely down. If these services are not critical to the Sitecore site then they can be removed from the readiness probe.

Below patch file shows how to remove all these checks so the readiness probe still returns healthy even if Solr and xDB are completely down. In many real world scenarios only a subset of these will have to be removed, for example removing xDB services but leaving Solr as the solution has a critical dependency on it.

<?xml version="1.0"?>
<configuration xmlns:patch="http://www.sitecore.net/xmlconfig/">
  <sitecore>
    <services>
      <configurator type="Sitecore.ContentSearch.SolrProvider.DependencyInjection.ContentSearchServicesConfigurator, Sitecore.ContentSearch.SolrProvider">
        <patch:delete />
      </configurator>
      <configurator type="Sitecore.XConnect.Client.Configuration.HealthCheckServicesConfigurators.XConnectCollectionHealthCheckServicesConfigurator, Sitecore.XConnect.Client.Configuration">
        <patch:delete />
      </configurator>
      <configurator type="Sitecore.XConnect.Client.Configuration.HealthCheckServicesConfigurators.XConnectConfigurationHealthCheckServicesConfigurator, Sitecore.XConnect.Client.Configuration">
        <patch:delete />
      </configurator>
      <configurator type="Sitecore.XConnect.Client.Configuration.HealthCheckServicesConfigurators.XConnectSearchHealthCheckServicesConfigurator, Sitecore.XConnect.Client.Configuration">
        <patch:delete />
      </configurator>
      <configurator type="Sitecore.Xdb.Common.Web.Xmgmt.XdbEnabledHealthCheckServicesConfigurator`2[[Sitecore.Reporting.Service.Http.XConnectClient.XdbReportingWebClient, Sitecore.Reporting.Service.Http.XConnectClient],[Sitecore.Reporting.Service.Http.Abstractions.Routes, Sitecore.Reporting.Service.Http.Abstractions]], Sitecore.Xdb.Common.Web.Xmgmt">
        <patch:delete />
      </configurator>      
      <configurator type="Sitecore.Xdb.Common.Web.Xmgmt.XdbEnabledHealthCheckServicesConfigurator`2[[Sitecore.Xdb.ReferenceData.Client.ReferenceDataHttpClient, Sitecore.Xdb.ReferenceData.Client],[Sitecore.Xdb.ReferenceData.Client.Routes, Sitecore.Xdb.ReferenceData.Client]], Sitecore.Xdb.Common.Web.Xmgmt">
        <patch:delete />
      </configurator>
      <configurator type="Sitecore.Xdb.Common.Web.Xmgmt.XdbEnabledHealthCheckServicesConfigurator`2[[Sitecore.Xdb.ReferenceData.Client.ReadOnlyReferenceDataHttpClient, Sitecore.Xdb.ReferenceData.Client],[Sitecore.Xdb.ReferenceData.Client.Routes, Sitecore.Xdb.ReferenceData.Client]], Sitecore.Xdb.Common.Web.Xmgmt">
        <patch:delete />
      </configurator>
      <configurator type="Sitecore.Xdb.Common.Web.Xmgmt.XdbEnabledHealthCheckServicesConfigurator`2[[Sitecore.Xdb.MarketingAutomation.ReportingClient.AutomationReportingClient, Sitecore.Xdb.MarketingAutomation.ReportingClient],[Sitecore.Xdb.MarketingAutomation.ReportingClient.ReportingRoutes, Sitecore.Xdb.MarketingAutomation.ReportingClient]], Sitecore.Xdb.Common.Web.Xmgmt">
        <patch:delete />
      </configurator>
      <configurator type="Sitecore.Xdb.Common.Web.Xmgmt.XdbEnabledHealthCheckServicesConfigurator`2[[Sitecore.Xdb.MarketingAutomation.OperationsClient.AutomationOperationsClient, Sitecore.Xdb.MarketingAutomation.OperationsClient],[Sitecore.Xdb.MarketingAutomation.OperationsClient.OperationRoutes, Sitecore.Xdb.MarketingAutomation.OperationsClient]], Sitecore.Xdb.Common.Web.Xmgmt">
        <patch:delete />
      </configurator>      
    </services>
  </sitecore>
</configuration>

Remove items from Resource File

Sitecore recently introduced Resource Files which contain Sitecore items. This is a great improvement and is beneficial when upgrading Sitecore especially while running on containers. There are some good resources out there to learn more about them for example this post by Martin Miles.

One limitation of this feature is that it is not possible to delete Sitecore items included this way. This makes sense in most cases, items included through this file are Sitecore system items and should generally not be deleted. However there are a few valid reasons why you would want to delete such an item.

The rest of this post describes how you can get rid off these items, without actually modifying the resource file itself. Modifying the resource file itself is not a good approach as you would get a new Resource File during an upgrade and any deleted items would be back.

The Resource files get read by the ProtobufDataProvider and it stores the items in a few dictionaries in memory. The solution below will remove the items from these dictionaries. Below code inherits the ProtobufDataProvider and adds the functionality to remove the items:

using Microsoft.Extensions.DependencyInjection;
using Sitecore.Abstractions;
using Sitecore.Configuration;
using Sitecore.Data.DataProviders.ReadOnly.Protobuf;
using Sitecore.DependencyInjection;
using System;
using System.Linq;

namespace Foundation.Providers
{
    public class RemoveItemsProtobufDataProvider : ProtobufDataProvider
    {
        public void RemoveItem(string item)
        {
            Guid itemGuid;

            if (Guid.TryParse(item, out itemGuid))
            {
                bool foundParent = false;

                if (base.DataSet.Definitions.ContainsKey(itemGuid))
                {
                    var parentID = base.DataSet.Definitions[itemGuid].ParentID;
                    foundParent = true;

                    //the item will also live as a child under the parent item, remove it from here as well
                    base.DataSet.Children[parentID] = base.DataSet.Children[parentID].Where(x => x.ID != itemGuid).ToArray();
                    base.DataSet.Definitions.Remove(itemGuid);
                }

                if (base.DataSet.Children.ContainsKey(itemGuid) && foundParent)
                {
                    base.DataSet.Children.Remove(itemGuid);
                }

                if (base.DataSet.ItemsByTemplate.ContainsKey(itemGuid) && foundParent)
                {
                    base.DataSet.ItemsByTemplate.Remove(itemGuid);
                }

                if (base.DataSet.LanguageData.ContainsKey(itemGuid) && foundParent)
                {
                    base.DataSet.LanguageData.Remove(itemGuid);
                }

                if (base.DataSet.SharedData.ContainsKey(itemGuid) && foundParent)
                {
                    base.DataSet.SharedData.Remove(itemGuid);
                }
            }
        }

        public RemoveItemsProtobufDataProvider(ObjectList filePaths) : base(filePaths.List.OfType<string>().Where(s => !string.IsNullOrEmpty(s)), ServiceLocator.ServiceProvider.GetRequiredService<BaseLog>())
        {
        }
    }
}

This dataprovider will need to be patched into Sitecore, below patch file will do this for the master database. The removeItems list, highlighted in the XML below, contains the list of the items which need to be removed:

<configuration xmlns:patch="http://www.sitecore.net/xmlconfig/" xmlns:set="http://www.sitecore.net/xmlconfig/set/" xmlns:role="http://www.sitecore.net/xmlconfig/role/">
	<sitecore>
		<databases >
			<database id="master" role:require="ContentManagement or StandAlone">
				<dataProviders >
					<dataProvider>
						<param>
							<protobufItems>
								<patch:attribute name="type">Foundation.Providers.RemoveItemsProtobufDataProvider, Foundation</patch:attribute>
								<filePaths>
									<patch:delete/>
								</filePaths>
								<param desc="filePaths" hint="list">
									<filePath>$(dataFolder)/items/$(id)</filePath>
								</param>
								<!-- Add items which need to be removed below -->
								<removeItems hint="list:removeItem">									
									<item>{22222222-2222-2222-2222-222222222222}</item>
									<item>{33333333-3333-3333-3333-333333333333}</item>
								</removeItems>
							</protobufItems>
						</param>
					</dataProvider>
				</dataProviders>
			</database>
		</databases>
	</sitecore>
</configuration>

Fix NullReferenceException in CompositeDataProvider

Sitecore 10 comes with a new Dataprovider which merges items from disk and the database. Jeremy Davis already wrote a great article about this. Sitecore is using this new approach by deploying all their items through an Item Resource file.

TDS now also supports creating item resource files, which makes it convenient to deploy your own items this way too. There are a few uncommon scenarios where this will result in a uncaught NullReferenceException, for example when there is an item without a version. Many areas of Sitecore and your custom solution will most likely be broken if this happens. The stack trace will look something like this:

FATAL Uncaught application error 
Exception: System.NullReferenceException 
Message: Object reference not set to an instance of an object. 
Source: Sitecore.Kernel 
   at Sitecore.Data.DataProviders.CompositeDataProvider.GetItemVersions(ItemDefinition itemDefinition, CallContext context) 
   at Sitecore.Data.DataProviders.DataProvider.GetItemVersions(ItemDefinition item, CallContext context, DataProviderCollection providers) 
   at Sitecore.Data.DataSource.LoadVersions(ItemDefinition definition, Language language) 
   at Sitecore.Data.DataSource.GetVersions(ItemInformation itemInformation, Language language) 
   at Sitecore.Data.DataSource.GetLatestVersion(ItemInformation itemInformation, Language language) 
   at Sitecore.Data.DataSource.GetItemData(ID itemID, Language language, Version version) 
   at Sitecore.Nexus.Data.DataCommands.GetItemCommand.GetItem(ID itemId, Language language, Version version, Database database) 

One way to solve this is by identifying all the items which cause the issue and fix each of them. Another way it to override the CompositeDataProvider and wrap a try/catch block around this logic. The advantage of this approach is that it addresses the root cause and the issue will not reoccur in the future. Following code can be used to catch the exception:

using Sitecore.Collections;
using Sitecore.Configuration;
using Sitecore.Data;
using Sitecore.Data.DataProviders;
using Sitecore.Data.DataProviders.ReadOnly;
using Sitecore.Diagnostics;
using System;
using System.Collections.Generic;

namespace Foundation.Providers
{
    public class SafeCompositeDataProvider : CompositeDataProvider
    {
        public SafeCompositeDataProvider(IEnumerable<ReadOnlyDataProvider> readOnlyDataProviders, DataProvider headProvider) : base(readOnlyDataProviders, headProvider) { }

        public SafeCompositeDataProvider(ObjectList readOnlyDataProviders, DataProvider headProvider) : base(readOnlyDataProviders, headProvider) { }

        public override VersionUriList GetItemVersions(ItemDefinition itemDefinition, CallContext context)
        {
            try
            {
                return base.GetItemVersions(itemDefinition, context);
            }
            catch (Exception ex)
            {
                Log.Error($"SafeCompositeDataProvider: Caught exception for item {itemDefinition.ID} {itemDefinition.Name} {ex}", this);
                return null;
            }
        }
    }
}

This new provider can be patched in through a configuration file like this:

<configuration xmlns:patch="http://www.sitecore.net/xmlconfig/" xmlns:set="http://www.sitecore.net/xmlconfig/set/" xmlns:role="http://www.sitecore.net/xmlconfig/role/">
  <sitecore>
    <databases >
      <!-- target to CM or standalone, no master database in CD-->
      <database id="master" role:require="ContentManagement or StandAlone">
        <dataProviders >
          <dataProvider >
            <patch:attribute name="type">Foundation.Providers.SafeCompositeDataProvider, Foundation.Providers</patch:attribute>
          </dataProvider>
        </dataProviders>
      </database>
    </databases>
  </sitecore>
</configuration>

Setup Sitecore Databases in AWS RDS

Last year Sitecore announced it supports AWS RDS. I’ve previously blogged about how to setup Sitecore databases in AWS using SIF. Containers are the preferred deployment model in Sitecore 10 now. A popular option when deploying Sitecore on AWS is to use RDS instead of Containers for the SQL Databases. In this case it can be a little bit challenging to figure out how to get the databases in RDS. This blog will walk through a sample solution.

Step 1: Take a backup of the databases

Take a backup of the databases, for example from a development environment VM which has the databases installed. Sitecore 10 comes with a Graphical setup package which is great to quickly setup a new Sitecore 10 environment including databases.

Step 2: Upload to S3

At this point you should have a backup of all your databases in a .bak format. Upload all these databases to AW S3. Make sure the S3 bucket is in the same region as your RDS database instance.

Step 3: Restore Databases through RDS’ SP

Databases need to be restored through RDS’ Stored Procedures. This page has a detailed overview but restoring a database can be done through the rds_restore_database SP. It is recommended to not change the database name during restore as it can cause some issues, see troubleshooting section for more details. See example below:

exec msdb.dbo.rds_restore_database 
	@restore_db_name='database_name', 
	@s3_arn_to_restore_from='arn:aws:s3:::bucket_name/file_name.extension',
	@with_norecovery=0|1,
	[@kms_master_key_arn='arn:aws:kms:region:account-id:key/key-id'],
	[@type='DIFFERENTIAL|FULL'];

Troubleshooting

A few common issues can occur during this process:

  • Contained database issues: in RDS you cannot enable contained database authentication through sp_configure stored procedure. Instead it needs to be set on the parameter group, see my other blog post for more details.
  • Issues with xDB containers not getting healthy: if the databases were renamed during the restore there is a good chance some of the xDB roles will not get healthy. The Xdb.Collection.ShardMapManager database has some tables which contain rows with connection info including names of other databases which contain the shards. If the databases have been renamed while they have been restored then this connection info will be incorrect. This can be fixed manually or through the SQL Sharding Deployment Tool.