One of the most interesting things you can get to experience when you're working with development is bug reports from a customer that impacts their business, but no one is really sure what is the underlying cause. Another thing that we developers love is war stories from the trenches, and this is one of those war stories.

We got a report stating that operations people were starting to see the following TransactionInDoubtException exception. At first intermittently, then more frequently.

====> System.ComponentModel.Win32Exception
The wait operation timed out
====> System.Data.SqlClient.SqlException
Timeout expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
--data--> HelpLink.ProdName = Microsoft SQL Server
--data--> HelpLink.EvtSrc = MSSQLServer
--data--> HelpLink.EvtID = -2
--data--> HelpLink.BaseHelpUrl =
--data--> HelpLink.LinkId = 20476
   at System.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at System.Data.SqlClient.TdsParserStateObject.ReadSniError(TdsParserStateObject stateObj, UInt32 error)
   at System.Data.SqlClient.TdsParserStateObject.ReadSniSyncOverAsync()
   at System.Data.SqlClient.TdsParserStateObject.TryReadNetworkPacket()
   at System.Data.SqlClient.TdsParserStateObject.TryPrepareBuffer()
   at System.Data.SqlClient.TdsParserStateObject.TryReadByte(Byte& value)
   at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand
cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler,
TdsParserStateObject stateObj, Boolean& dataReady)
   at System.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj)
   at System.Data.SqlClient.TdsParser.TdsExecuteTransactionManagerRequest(Byte[] buffer, TransactionManagerRequestType request, String transactionName, TransactionManagerIsolationLevel isoLevel, Int32 timeout, SqlInternalTransaction transaction, TdsParserStateObject stateObj, Boolean isDelegateControlRequest)
   at System.Data.SqlClient.SqlInternalConnectionTds.ExecuteTransactionYukon(TransactionRequest transactionRequest, String transactionName, IsolationLevel iso, SqlInternalTransaction internalTransaction, Boolean isDelegateControlRequest)
   at System.Data.SqlClient.SqlDelegatedTransaction.SinglePhaseCommit(SinglePhaseEnlistment enlistment)
====> System.Transactions.TransactionInDoubtException
The transaction is in doubt.
   at System.Transactions.TransactionStateInDoubt.EndCommit(InternalTransaction tx)
   at System.Transactions.CommittableTransaction.Commit()

So, a natural place to turn when you get such an error would be to Microsoft's documentation for TransactionInDoubtException. This helpfully tells us this:

This exception is thrown when an action is attempted on a transaction that is in doubt. A transaction is in doubt when the state of the transaction cannot be determined. Specifically, the final outcome of the transaction, whether it commits or aborts, is never known for this transaction.

Okay, so that was not particularly helpful. Meanwhile we did, of course, question the usual suspects: "have you deployed a new version of the system recently" and "did you change the production environment setup". Eliminating these does, however, still leave us with TransactionInDoubtException. The next question you could ask yourself could be: "what is a transaction, and what states can it be in?" Hopefully most developers know that a transaction is a synchronization primitive that ensures atomicity of the actions that it guards, but people tend to get a bit more hand-wavey when it comes to what states it can be in. Also, there are different types of transactions that you can work with, typically "local" transactions and distributed transactions. For now, let us take a look at the states that a local transaction can be in and ignore the distributed transactions for a while (they were not in use in the production environment, but understanding them is important nevertheless). A local transaction is a transaction that your program controls issuing transaction statements to a third party without it participating in the exact transaction. That is the transaction is only present in your program and it controls a "shadow transaction" on the database server, but the two are not directly connected apart from through ADO.NET's inner workings. If ADO.NET passed the transaction to the server, it would instead be a distributed transaction.

For simplicity's sake, let us say there are four states that a local transaction can be in. Here we define local as a transaction in which two parties participate, the program and a database. For our purposes this is a Microsoft SQL Server instance. The four states are: active, committed, rolled back, and in-doubt.

  • A transaction is active when both the program and the database agrees that the transaction has not altered state, nor been timed out.
  • A transaction is committed once the program has requested a commit and the database has agreed to the commit.
  • A transaction is rolled back if the program requests to roll back, or if the database decides that it cannot commit when the program requests it to commit.
  • A transaction is in-doubt if the program has no idea what the database has decided to do after it has requested either a roll back, a commit, or the transaction timed out and the database did not acknowledge this.

So how are we getting to a state where the program doesn't know whether the database has decided to commit or not?

Meanwhile, the operations people were investigating transaction status on the database side, and all the transactions there were always either committed or rolled back; none of them were in-doubt, so this was strictly a problem on the program side.

When we are using a local transaction, all the transaction handling is performed inside the .NET framework code that the program is built on top of, so it is probably time to revisit the .NET framework code and see what it is doing. If we revisit the stack trace at the top of this post, most of the latest stack frames seem to be internal book-keeping and protocol handling stuff, so if we step outward a few frames, the two most interesting methods to look inside are probably System.Data.SqlClient.SqlInternalConnectionTds.ExecuteTransactionYukon, and System.Data.SqlClient.SqlDelegatedTransaction.SinglePhaseCommit. Let us read through these in reverse order in the reference source, i.e. from outside in to where the exception is triggered.

  • SinglePhaseCommit executes:
    connection.ExecuteTransaction(SqlInternalConnection.TransactionRequest.Commit, null, IsolationLevel.Unspecified, _internalTransaction, true);
    and if an exception occurs and the internal transaction is neither committed or aborted (rolled back), then it will cause a TransactionInDoubtException.
  • ExecuteTransactionYukon executes:
    _parser.TdsExecuteTransactionManagerRequest(null, requestType, transactionName, isoLevel, ConnectionOptions.ConnectTimeout, internalTransaction, stateObj, isDelegateControlRequest);
    Where the 5th argument, ConnectionOptions.ConnectTimeout is the timeout parameter for the underlying code.

Wait, what? Why is the .NET framework using ConnectTimeout as a timeout for the transaction operation? That sounds bizarre, let us see what it is documented to be in the SqlConnection.ConnectionString documentation:

The length of time (in seconds) to wait for a connection to the server before terminating the attempt and generating an error.

The default value for this is 15 seconds. Looking at our configuration files, we have set the connect timeout to 10 as we generally want to fail earlier than build up congestion in the system. However, this is for getting a connection to the database; when we are doing transaction operations, we are decidedly not establishing a connection, we already have one and we're just passing values over the wire using the TDS protocol. So what gives?

To me, this seems to indicate that someone at Microsoft has decided to use a wholly unrelated timeout to also control the amount of time that individual transaction actions are permitted to last before they are forcibly timed out. Any time this happens, a TransactionInDoubtException is thrown.

Okay, great, so now we have uncovered what causes TransactionInDoubtException, but how can we use this knowledge to eliminate it? There are several potential root causes to look at here: the network (the program and database are on separate systems, and as such, the network throughput and bandwidth has an influence on this), as well as the server load on both the program server and the database server, as high CPU load may cause either to be too slow at processing the response.

CPU load has the benefit of being fairly easy to monitor on servers, so we turned our attention to this first. Neither the program server, nor the database server were experiencing markedly high loads and both kept a CPU load well under 50% utilisation, so they should in no way be the primary cause of the in-doubt transactions. This left the network as the likely culprit. After measuring it, it was identified as the problem source, but since replacing a production setup's network is not something that happens overnight, we started looking at what we could do to mitigate the network congestion.

So, a quick recap before we start looking at mitigation techniques: we're getting TransactionInDoubtException in the production system between our product and the database server, it is triggered by a timeout that Microsoft has strangely tied to ConnectTimeout, and the primary part of the system that is to blame is the network.

Previously, I indicated that there were two types of transactions that were interesting for our purposes: "local" and distributed. We now turn our attention to distributed transactions. While most people have a fairly good understanding of "local" transactions, things tend to get a bit more fuzzy when it comes to distributed transactions, so allow me a brief digression to bring everyone up to speed. If you already know everything there is to know about distributed transactions, you can jump ahead past all the explanations.

A distributed transaction is a transaction in which one or more resource managers participate, typically across multiple hosts. The transactions are controlled through transaction managers, typically one per host in the participating hosts. A resource manager is most often a database server, but it can be many other things, e.g. message queues, distributed caches, etc. The role of transaction manager is played by Microsoft's Distributed Transaction Coordinator (MSDTC). Life as a transaction becomes somewhat more complex in this new-fangled distributed world. Now, many consider operating through MSDTC to be tantamount to signing your soul over to the devil, and that's on a good day. I hope that the following will demystify its role and workings a bit (even if you still think you have to sell your soul later).

A transaction starts out its life as a local transaction in a program. If this program, or one of the third parties it communicates with, figures out that it would be a nice idea to promote the transaction to a distributed state, then, and only then, will it become a distributed transaction. This raises two questions: why would either party choose to promote the transaction, and what happens when a transaction gets promoted?

A transaction must be promoted to a transaction manager when more than one resource manager wishes to participate in a single transaction, or when a program instructs more than one resource manager to do so. A typical case is when you have your date in several databases in SQL Server and you access both of them in a single transaction. This requires a promotion of the transaction as SQL Server implements transactions per database. This can happen either when you invoke a statement on one database and it is linked to another (e.g. through a view, stored procedure, or function) or if you explicitly execute one or more statements in several databases.

With the "why" clearer, let us turn to the "how". When working with MSDTC, the transaction manager supports two different ways to propagate a transaction: pushing it to a new participant, or pulling it from the originating MSDTC instance from a resource manager. In order to push a transaction, the program needs to know the address of the MSDTC for the target resource manager. The semantics of the two different models are ever so slightly different, but let us focus on the one that the system uses for now (when the program wishes to span multiple databases, for instance). When we are using ADO.NET's SQL Server provider, it will be using a push model as can be seen in SqlInternalConnection.GetTransactionCookie. This means that when ADO.NET wishes to join a transaction to multiple resource managers, it will query the resource manager for its MSDTC address (called whereabouts in the transaction parameters in .NET), and it will instruct its local MSDTC to generate a cookie for the resource manager's MSDTC that it can then communicate with. When the resource manager receives this cookie, it will enlist in the transaction through its local MSDTC.

Okay, so a brief recap: a program starts a transaction, the program decides it needs to communicate with several databases (resource managers), so it generates a cookie for each of these systems after obtaining their MSDTC's address, and each of the databases asks its own MSDTC to communicate with the program's MSDTC indicating that they are now enlisted in the transaction. Great, how does that help us? This might be the moment that you are inquiring someone, anyone, why we need all this to make a transaction work across multiple databases.

When we are working with a "local" transaction, the program instructs the database to commit, the database checks everything is okay, proceeds to commit and everyone is happy. When we are working with multiple databases, imagine that we did the same thing, we first tell the first database to commit, it checks everything is okay, and proceeds to commit. We then tell the second database to commit, but it decides that everything is not okay, and proceeds to roll back. Uh oh, we are now in an inconsistent system state.

Now, distributed transactions would be fairly dull if they did not solve this exact case. The way that most modern distributed transaction systems solve this conundrum is by way of the two-phase commit protocol (2PC). First, every participant (resource manager) is asked to prepare for commit. A participant can either respond that it is prepared, or request to abort. If a single participant wishes to abort, all participants must abort. If everyone is prepared, the transaction can then be committed. There is an implicit guarantee that if a participant respond that it is prepared, it must be able to commit. Preparation is also called phase one, and commit (or roll back) is phase two.

That is all very well, but how does any of this help us with TransactionInDoubtException occurrences in the system? What if MSDTC in general does not cause TransactionInDoubtException to be triggered? Let us take a look at the ways in which this could happen...

In order to investigate the different ways that a program interacting with MSDTC could cause a TransactionInDoubtException let us write a resource manager that we can control to act in interesting ways. To do that in .NET we need two things: a transaction, and enlisting in this transaction as a durable (e.g. database) resource. We will just pretend to be durable, we will make no effort to actually be durable.

A durable enlistment can fail in different methods: prepare, commit, rollback, and in-doubt (since we only have one resource in our test, the resource manager's in-doubt callback will never be invoked, so we can ignore that). Prepare, commit, and rollback can fail (or succeed) in the following spectacular ways: the resource manager could crash, it could do its work as appropriate, it could throw an exception (this is treated the same as a crash), or it could be a reaaaaaaally long time about completing stuff correctly (if anyone is still keeping tabs on what we're trying to accomplish, this is the cause of our TransactionInDoubtException in the "local" use). To test all these scenarios, let us create a simple, "durable" resource manager.

To start out with, let us define the actions that we can take. For simplicity's sake, let us keep them all in a single enum:

enum Action { Prepare, Abort, Commit, Sleep, Crash }

Now it is a bit ambiguous what Action.Prepare means in a commit or roll back scenario, so let us just define what we mean in each of the scenarios here:

Prepare Commit Roll back
Prepare Prepared Done Done
Abort Force rollback Done Done
Commit Prepared Done Done
Sleep Sleep 15 minutes, then prepared Sleep 15 minutes, then done Sleep 15 minutes, then done
Crash Terminate Terminate Terminate

The 15 minutes of sleeping is chosen a bit arbitrarily to really indicate when we are waiting for so long. Termination will be simulated using Environment.Exit(). Enough with the thinking, let's write some code! We define a simple WCF service that allows us to set each of the actions to use. To simulate our use-case we want to use push propagation of the transaction with the service, so we also need a function giving us the service's whereabouts. Note that WCF already supports distributed transactions using transaction flows, but to make things really explicit, we'll construct our own transaction propagation logic. In order to add a bit of logic, we'll be creating a ledger service, imagine a simple service for a bank account that holds a number of money transactions (not to be confused with distributed transactions), and it can tell us the current sum on the ledger. To be able to call methods on the same instance from a client, we'll use a session based approach to the service. Without further ado:

[ServiceContract(Name = "LedgerService", SessionMode = SessionMode.Required)] [ServiceBehavior(ConcurrencyMode = ConcurrencyMode.Single, InstanceContextMode = InstanceContextMode.PerSession)] public class LedgerService : IEnlistmentNotification { private Transaction _transaction; private Ledger _ledger; private readonly List _transactions = new List(); private Action _prepareAction = Action.Prepare; private Action _commitAction = Action.Commit; private Action _rollbackAction = Action.Abort; const string ResoureGuid = "{40C37652-00E0-4E4B-9E31-54B5631116B6}"; [OperationContract] public byte[] GetDtcAddress() { return TransactionInterop.GetWhereabouts(); } [OperationContract] public void LoadLedger(Guid identifier, byte[] transactionCookie) { _transaction = TransactionInterop.GetTransactionFromExportCookie(transactionCookie); _transaction.EnlistDurable(new Guid(ResoureGuid), this, EnlistmentOptions.None); _ledger = new LedgerRepository().GetLedger(identifier); } [OperationContract] public int Sum() { if (_transaction == null || _ledger == null) throw new InvalidOperationException(); return _ledger.Balance + _transactions.Sum(); } [OperationContract] public void AddTransactions(IEnumerable transactions) { if (_transaction == null || _ledger == null) throw new InvalidOperationException(); _transactions.AddRange(transactions); } [OperationContract] public void SetPrepareAction(Action action) { _prepareAction = action; } [OperationContract] public void SetCommitAction(Action action) { _commitAction = action; } [OperationContract] public void SetRollbackAction(Action action) { _rollbackAction = action; } }

Before we can work on a ledger, we need to load it through LoadLedger. This method is given an exported cookie, and we get a transaction from it, and then we enlist on this transaction with ourselves. In order to implement a durable resource manager, and pass this to EnlistDurable, we also need to implement IEnlistmentNotification. To do this, we add the required methods: Prepare, Commit, Rollback, and InDoubt:

public void Prepare(PreparingEnlistment preparingEnlistment) { switch (_prepareAction) { case Action.Abort: preparingEnlistment.ForceRollback(); break; case Action.Commit: case Action.Prepare: // if we were really durable, also save recovery token here preparingEnlistment.Prepared(); break; case Action.Crash: Environment.Exit(1); break; case Action.Sleep: Thread.Sleep(TimeSpan.FromMinutes(15)); preparingEnlistment.Done(); break; } } public void Commit(Enlistment enlistment) { switch (_commitAction) { case Action.Abort: case Action.Commit: case Action.Prepare: enlistment.Done(); break; case Action.Crash: Environment.Exit(2); break; case Action.Sleep: Thread.Sleep(TimeSpan.FromMinutes(15)); enlistment.Done(); break; } } public void Rollback(Enlistment enlistment) { switch (_rollbackAction) { case Action.Abort: case Action.Commit: case Action.Prepare: enlistment.Done(); break; case Action.Crash: Environment.Exit(3); break; case Action.Sleep: Thread.Sleep(TimeSpan.FromMinutes(15)); enlistment.Done(); break; } } public void InDoubt(Enlistment enlistment) { enlistment.Done(); }

With a bit of luck, the code mimics what we've indicated that we are going to do in the table. Now it is just a matter of creating a simple client that starts the service if it is not already running, trying each of the combinations of prepare, commit, and rollback, and recording the results. Rather than present all the individual results of that interesting endeavour, as there are a lot of results, let us do a quick montage and go directly to the interesting bit (note that you can also reason out the results if you understand distributed transactions, but theory and practice tends to differ, so always test your theoretical understanding in practice).

While we have a bunch of different actions, the interesting two, really, are Crash and Sleep as they signify common error scenarios. If we sleep in any of the scenarios, the program will be held up for the amount of time we sleep, nothing adverse will happen, apart from the fact that things are going reaaaaally sloooooow all of a sudden. Yes, the program will wait for the transaction to finish processing for the full 15 minutes, if it starts doing it prior to the transaction's timeout. We'll get back to why that is okay a bit later. If the resource manager crashes in either commit or rollback, we continue on our merry way and the program will get the commit or rollback signal back appropriately. MSDTC will record the outcome of the transaction, and until the resource manager has acknowledged the commit or rollback, it will keep the information around. Since our resource manager is durable and has recorded recovery information for each prepared transaction, when it becomes alive again, it can contact its MSDTC and do recovery. Our resource manager and its durable enlistment isn't really durable because we do not do this, as we are just trying to verify our understanding, but if you ever were to write a durable resource manager, you should definitely follow the recovery protocol appropriately!

This leaves what happens if we crash during the prepare phase: the program gets a TransactionInDoubtException! We are in doubt because one of the durable resource managers that is in play in the transaction is neither telling us whether it is prepared, or whether it succeeded or wants to abort. This means that every other participant in the transaction is now in-doubt about what to do, i.e. in-doubt is the only sensible answer that can be given back. Now, this happens only if a resource manager crashes during the prepare. If it is slow, we just wait for it.

A quick aside before we try to conclude anything on this: if you actually run all combinations of the service you will be left with a lot of abandoned transactions in your MSDTC that you manually need to clear, and you can only clear them one at a time, so proceed with caution.

If we return to "local" transactions to remind ourselves what is at stake, we will get a TransactionInDoubtException if we at any time exceed ConnectTimeout in any transactional step (begin, commit, rollback). If we use a distributed transaction manager, we will get a TransactionInDoubtException if and only if a resource manager participating in the transaction crashes during the prepare phase. Okay, so this means that using MSDTC is better in regards to avoiding this exception, right, and we can stop reading now, please? Well... yes and no (it had to be complicated).

If you're still reading, then I applaud your perseverance, but we'll be here for a while yet. So the short and slightly incorrect answer is: yes, using MSDTC here will be better in so far as you will not get as many TransactionInDoubtException instances in our program. However, there is no such thing as a free lunch in this business of ours, so this affirmative answer comes with some caveats. The first, and simpler, of these is of course: "what about my performance, man?" The second, and more annoying one, is: "so, this is supported in all of my usage scenarios, right?"

The first question is fairly easy to answer: We can do a simulation of running a large number of queries using "local" transactions, and again using distributed transactions. To do this we create a table, tbl1 in a database, db1 with a single integer column as primary key and insert the values 1 through 7 in it. In the local case we will connect, and select a random row from the table. In the distributed case, we will query the MSDTC for its whereabouts, create an export cookie, and then do the same as in the local case, just with a distributed transaction. We'll do this 200,000 times as that's a nice, big, round number, and we'll measure the individual actions and do some simple statistical analysis on the numbers, then we'll do the same for the distributed case. Furthermore, we'll do this on a fairly idle local machine hosting both the program and the SQL Server instance to avoid any network variance affecting the test.

We'll measure the average, mean, mode, variance, standard deviation, min, max, and range and try to say something incredibly intellectual about all of that. We'll be using Windows' high performance timers through Stopwatch and measure in ticks, as that is the highest resolution we are going to get, and high resolution is neat. We'll skip the first 1,000 or so measurements to filter away any noise caused by starting the MSDTC service (it is started on demand), and getting SQL Server up and running, etc.

Mean Median Mode Sample variance Sample std. deviation Min Max Range
Local 402.09 394 392 4265.20 65.31 384 10003 9619
Distributed 1379.26 1369 1364 16399.18 128.06 1256 9121 7865

This is, of course, the numbers generated from running the test on my machine, but they are very repeatable. What is interesting is, of course, in the relation between local and distributed transactions. So what do all these numbers mean? A very brief mathematics introduction for those needing it: the mean is the normal average that we are taught in public school, the median is the middle element in the ordered list of measurements, mode is the most often occuring number, the sample variance is the average less one of the squared difference from the mean, the sample standard deviation is the square root of the sample variance, the minimum and maximum should be fairly selfevident, and range is max subtracted min. We're using sample deviation and variance to correct for not measuring the entire population here. So the standard deviation tells us something about what is "normal" within our measurements, i.e. a +/- change of 65 ticks is normal with local transactions, and +/- change of 128 ticks is normal with distributed transactions.

If we take the mean's difference, then we have a difference of approximately 977 ticks. That's about 2.86 milliseconds of overhead to use distributed transactions, on average. If we add a standard deviation to the mean of both the local and distributed transactions, we get a difference of 1040 ticks, or about 3.04 milliseconds within a standard deviation in a "worst-case" scenario (of course, sometimes the actual measurements will be off by more than a standard deviation, but it gives us a guideline of what we should probably expect). Considering an average overhead of 2.86, every about 350 or so transactions will add a second of overhead when using distributed transactions. Unless all of your transactions are extremely simple, there's a good chance that the 2.86 milliseconds will be a mere trifle compared to how long the transaction usually takes.

Okay, so there is a small cost, but if that removes the TransactionInDoubtException instances from our system, we probably prefer paying that overhead rather than having to do manual analysis of whether an in-doubt transaction is actually committed or rolled back in the overall system.

Wait, we almost forgot the "so, this is supported in all of my usage scenarios, right?" question...

To understand everything that follows, let us turn our attention back to a topic that we skipped quickly over: recovery of durable enlistments. For every transaction that our enlistment's Prepare method is invoked for, the resource manager is supposed to save the recovery token and save it until it has had its Commit or Rollback method invoked. After the Prepare and until it receives the final call, the resource manager is in-doubt. So before we tell the transaction manager that we are prepared, we make damned sure that our recovery token is safely written to permanent store in case we crash, then we indicate that we are prepared, and then the resource manager crashes.

When our resource manager restarts, it is its responsibility to take all the recovery tokens that it is in-doubt about (i.e. all the ones that aren't committed or rolled back), contact its transaction manager and call the recovery call on the transaction manager for a given token, then the transaction manager will call the resource manager back with the final state call (or in-doubt if other parties were unable to resolve the transaction as well). Finally, the resource manager tells the transaction manager that it is done recovering the transactions it knows about; this allows the transaction manager to clean up any pending transactions that it knows about that the resource manager was supposed to receive, but that it never responded to.

So, recovery: resource manager contacts its transaction manager and goes through the recovery steps as indicated above.

Let us divert our attention from the resource and transaction managers' roles and take a look at how you can deploy SQL Server. You can, obviously, host it on a single machine, do backups once in a while, and if the server ever goes down you can panic, flail your arms, and run around like a headless chicken. This is, of course, not a particularly constructive solution. To accomodate the wishes of enterprise customers who want to be able to continue working if a server suddenly fails, SQL Server supports a number of "high availability" features (let's just imagine we're using the latest version; sadly, this is rarely the case, but for simplicity, if we can still find that, let's pretend):

  • AlwaysOn Failover Cluster Instances (FCI): Failover at the server-level, inside a Windows Server Failover Clustering (WSFC)
  • AlwaysOn Availability Groups: Failover at the database level, used in conjunction with WSFC.
  • Database mirroring: Single mirrored database (deprecated, to be removed).
  • Log shipping: Database-level duplication of transaction logs to maintain one or more "warm" stand-by databases.

Pretty much all of these are in use with some of our customers, so we need to consider whether distributed transactions can be used with each of them. To skip a bit ahead, we can consider FCI on its own, and the three others together as a single issue (you can see where this is going, right?). Since this piece is already fairly long, we will not go into detail about the different high availability solutions that SQL Server supports, but merely outline how MSDTC affects them. Let's start with everything but FCI:

When a distributed transaction is in progress, a transaction manager records knowledge about it. Say we are in the preparation phase of a transaction and the database has just committed to prepare to commit, and is as such in-doubt, when a database fails (say, the primary server loses power). In either of the AlwaysOn Availability Group, database mirroring, or log shipping case, we are moving the logs from one SQL Server instance to another. Since we lost power after being prepared, SQL Server has recorded the transaction recovery token in the transaction log. The log has been shipped off to the secondary SQL Server instance and it is now completing the missing transaction log statements in order to become the primary SQL Server (or database) in use. When it gets to our distributed transaction, it tries to contact its local MSDTC, but this transaction manager knows nothing of the transaction (only the MSDTC on the original SQL Server instance machine does so), and it will fail recovery and the database will be in-doubt until a human can come in and fix it (note that it is possible to tell SQL Server to automatically assume either committed or rolled back in case of in-doubt, but it is highly dangerous to do so unless you really understand all the systems and their dependencies that participate in your distributed transactions, so please do not do this).

Thus, the short answer is that for the three latter high availability options that SQL Server supports, we can not use distributed transactions with them. One could theoretically use a shared MSDTC between the machines, but that would just create a different single point of failure, and we really do not like single points of failure in high availability situations. In fact, Microsoft pretty much says that none of these solutions support distributed transactions: Cross-Database Transactions Not Supported For Database Mirroring or AlwaysOn Availability Groups. Note that log shipping is not mentioned here, but it suffers from the same problem, unless you choose to only apply the transaction log to the latest time you know that you have a consistent state (this may cause inconsistency with other participants in the transactions, so I would not advise this approach).

This leaves AlwaysOn Failover Cluster Instances where you failover entire servers and not individual databases. Now, the good news is that it is possible to support distributed transactions and have high availability here, but configuring it just right is a fun journey. Allan Hirt has the de facto guide to how to setup MSDTC with FCI for SQL Server high availability in How to Properly Configure DTC for Clustered Instances of SQL Server (Revised).

Microsoft recommends that you use FCI if using a shared disk solution (e.g. a SAN), or AlwaysOn Availability Groups for data protection through SQL Server, and you can easily use both at once. As such it is highly likely that you will encounter customers running your program in either (or both) of these configurations and you need to be able to outline these concerns for them to make an educated choice on how to best operate your program in their production environment.

Okay, so this is probably quite far enough down the rabbit hole to venture, so let's return to the very beginning: We're seeing TransactionInDoubtException with local transactions, due to an overloaded network, and we want to know if using distributed transactions will help us. So, if you've come all the way through the text, or if you've just jumped to the very ending in the hopes that someone would please summarise all those words, my very succint recommendations for handling this thing:

  • If you do not use high availability, or you use FCI as a high availability solution (let's pretend your operations guys have configured it correctly):
    • Use distributed transactions.
    • Improve the network performance.
    • Increase the ConnectTimeout value (this is not recommended as it can hide issues in your production environment and cause the program to behave sluggishly and possibly below SLA requirements).
  • If you use any of the other high availability solutions for SQL Server in your production environment:
    • Make the network perform better, or
    • Increase the ConnectTimeout value (same comment as in the parenthesis above).

I hope this helps someone understand the TransactionInDoubtException occurrences in .NET programs when using ADO.NET for SQL Server without distributed transactions and how it can be circumvented, mitigated, or solved.

If all of this has made you go from "the MSDTC will devour my first-born and curse my descendants until end of days" to having your interest whetted, you can read a lot more about how it works in [MS-DTCO]: MSDTC Connection Manager: OleTx Transaction Protocol.