This week, the replication of the subscription of one of our most important databases stopped working. The distribution agent job stopped. In the job history, it had logged the following error:
The distribution agent failed to create temporary files in ‘C:\Program Files\Microsoft SQL Server\130\COM’ directory. System returned errorcode 5.
The same error was logged in the table [MSdistribution_agents] in the distribution database. Granting Modify permission on this folder to the replication agent account solved the issue (note: in our case, the replication agent account is not the same as the SQL Server Agent account). But we’ve had this setup for some years now, and all other subscriptions in production use the same replication agent account, and the same folder on the same server, and all those subscriptions continued working fine.
My initial Google search pointed to this Microsoft article:
https://docs.microsoft.com/en-us/troubleshoot/sql/replication/error-run-distribution-agent
Even though this article has the correct solution, it does not describe our situation accurately. It applies to the agent profile Distribution Profile for OLEDB streaming, but we use the default agent profile.
Recently, we migrated all replicated databases from Windows 2012 and SQL Server 2014 to Windows 2016 and SQL Server 2016. However, I doubted that that had anything to do with it; on the internet, I found various reports of this error for different versions of SQL (just with a different version number in the path, as 130 stands for version SQL 2016).
So, I wanted to find out exactly why this error suddenly occurred, and why the solution was to set permissions that were never needed in the years before, or on the other subscriptions.
There are two subscriptions on this database, with different articles (and therefore a different distribution job). Looking at the subscription I noticed two new tables. These tables contain an XML column. Since none of the other tables have XML columns, this might just be the difference that caused the error.
Luckily, these tables also have a timestamp column, so I could determine that 4 records had been added to one of these tables between the crash and the last successful run of the distribution agent. So in our test environment, I tried to reproduce the issue. Since our test environment runs on different servers with a different replication agent account, the permissions on the COM folder were still the default.
The first attempt to reproduce the issue failed. The XML I used for the first attempt was too small: it had a data length of 2900. The largest XML value in production had a data length of roughly 5900, so I tried that. With this bigger record, I was able to reproduce the issue. Immediately after inserting this record on the publisher, the distribution agent job on the distributor failed with the same error. And after adding the Modify permission to the replication agent account, I was able to restart the distribution agent job, and the replication continued as usual.
So if you need to replicate XML data, and the XML data gets above a certain threshold, the distribution agent needs some temporary files in the COM folder to buffer this data; with the default permissions, it is unable to do this.
To further analyze what temporary files are created, I used the tool Process Monitor to see exactly what actions are triggered by the replication process inside this folder.
Process monitor is part of the Sysinternals suite. So first, I downloaded the latest version of the Sysinternals Suite, and started procmon on the distributor. I added a filter to include only events on the COM folder:
After that, I cleared the procmon screen and performed another update on the XML column. This triggered a lot of events in the COM folder. In particular, the replication process created a LOB file, and performed several writes to this file.
LOB stands for Large OBject. Therefore, this same issue might occur with other data types. There are several large object data types: text, ntext, image, varbinary(max), and xml. For more info on these data types, see:
https://docs.microsoft.com/en-us/sql/t-sql/data-types/data-types-transact-sql?view=sql-server-ver15
After demonstrating that temporary files are indeed created in the COM folder, I removed the extra permissions for the replication agent account on this folder, and updated the XML value once more. And as expected, the replication failed, and Process Monitor captured the error:
After granting the additional permissions for the replication agent account on the COM folder, replication started working properly once more.