IRL #1: Non-yielding IOCP Listener

IRL – internet lingo for “In Real Life”

This is first in what I hope will be a regular series of posts that highlight a problem my team encountered on a SQL Server, what the cause was, and how we fixed it.

We will of course anonymize the details.  Part of this is to tell interesting stories, but a bigger part is to help future Googlers find how we resolved it.

IRL #1 – the Non-Yielding IOCP Listener

Problem: SQL Server running a major ERP system suddenly becomes non-responsive, effectively shutting down an entire retail company.  Online sales, Distribution, Customer Service…all of it.  2 weeks before Christmas.

Background:

  • SQL 2012, SP3 – Enterprise
  • 2-node Failover Cluster Instance
  • Replication Publisher
  • Other databases on this instance as well
  • Beefy hardware
  • No issues like this in at least a year that I have known them
  • No changes to the SQL code, SQL Instance or any application deployments

Investigation:

ERRORLOG (always go to the SQL Errorlog early in the process):

IO Completion Listener (0x1d8c) Worker 0x00000000056AC160 appears to be non-yielding on Node 1. Approx CPU Used: kernel 0 ms, user 15000 ms, Interval: 15002.
Using 'dbghelp.dll' version '4.0.5'
**Dump thread - spid = 0, EC = 0x0000000000000000
***Stack Dump being sent to F:\MSSQL11.MSSQLSERVER\MSSQL\LOG\SQLDump0001.txt
* *******************************************************************************
*
* BEGIN STACK DUMP:
*   12/14/18 12:43:38 spid 7328
*
* Non-yielding IOCP Listener
*
* *******************************************************************************
* -------------------------------------------------------------------------------
* Short Stack Dump
Stack Signature for the dump is 0x000000000000027C
External dump process return code 0x20000001.
External dump process returned no errors.

IO Completion Listener (0x335c) Worker 0x00000000036AC160 appears to be non-yielding on Node 0. Approx CPU Used: kernel 0 ms, user 15000 ms, Interval: 15002.
IO Completion Listener (0x1d8c) Worker 0x00000000056AC160 appears to be non-yielding on Node 1. Approx CPU Used: kernel 0 ms, user 23890 ms, Interval: 25005.

***Unable to get thread context for spid 0
* *******************************************************************************
*
* BEGIN STACK DUMP:
*   12/14/18 12:44:33 spid 13940
*
* Non-yielding Scheduler
*
* *******************************************************************************
Stack Signature for the dump is 0x000000000000027C
External dump process return code 0x20000001.

Note there are 2 different Non-yielding messages – Schedulers and IOCP Listener.

Also, stack dumps are being generated.  I do not have the tools or skills to analyze a stack dump.  I rely heavily on Microsoft for this when needed.

The stack dump .txt and .log files contained nothing other than what was in the ERRORLOG.

Resolution Attempts:

  • Failover to the B node.  Because its a cluster, and this might be a one-time glitch in the Matrix, right?
    • Post-failover the system was up for 30 minutes as the Application servers were restarted one by one.
    • When app server “4” was started we almost immediately got a repeat of the above behavior.
  • Try again, with fewer app servers – nope.
  • I suggested sending the stack dump to MS early on…but we kept trying failovers and restarts, thinking the App server load might be the issue.
  • We upgraded to the current SP, as this error in SQL 2008 was resolved in service packs, and in CUs for 2016/2017.  This did not change anything.
  • Lather, rinse, repeat in various ways for <redacted> hours until we decided to get Microsoft Involved.  We opened a Premier CritSit case, which triggers a LOT of attention and internal alarm bells on the MS side.  This has not changed since my last contract there in 2006.
  • Inside of 2 hours, MS had analyzed the dump and sent back this:
    • NonYielding Call Stack
      # Call Site
      00 <non-SQL Server .dll>
      01 0x0
      02 0x0
    • “We see the third party DLL located in the SQL Server address space: C:\Windows\System32\<non-SQL Server>.dll.”
    • A really good article that does reference the situation
    • “We see <non-SQL Server>.dll being called and it is interfering with the IOCP listener which is the thread that accepts incoming connections/TDS packets therefore preventing additional connections (the behavior we see when we cannot connect via SSMS, for example). We see the thread stalled for 15 seconds and 15 threads in SQL that have that on top of their stack. What is the <non-SQL Server>.dll?  It is calling a jmp detour in ntdll.dll.”

Solution:

  • Identified the .dll and verified with the vendor what it was doing
  • Uninstalled the application that put it there (both nodes)
  • Restarted everything with no further issues.

Lessons you can benefit from:

  • Stay current on Service Packs – many of the issues are fixed in SPs and CUs
  • If you get recurring stack dumps, call MS sooner rather than later…even on a credit card, $500 is cheaper than Prod being down
  • Karthik PK from MS has some good posts here and here on this

I am specifically leaving out the name of the offending .dll for two reasons:

  • Client anonymity
  • We don’t shame vendors around here…was probably a small bug in a product update

If you enjoyed this post and want to see more like it, please leave a comment, or ping me on Twitter

Other posts in the IRL – In Real Life series:

https://dallasdbas.com/category/sql/irl/

Thanks for reading!

Kevin3NF

Leave a Comment

Sign up for our Newsletter

%d bloggers like this: