Scenario:
Since we implemented our new exchange
infrastructure, users were having trouble receiving emails with large
attachments from external senders.
Puzzling factor was that delayed emails eventually flow through and the
delay time seems to be completely random ranging from 1 day to 4 days for each
message. Meantime senders were getting Delivery is delayed to these recipients or groups: notifications
like the one below.
Environment:
- Windows Server 2012 R2
- Exchange 2013 with CU3
- Sophos Puremessage Spam filtering/AV on
Mailbox role servers
- Cisco infrastructure
Troubleshooting steps:
Header analysis:
When these email eventually came through,
header analysis revealed that the bottleneck was between the last Internet
email relay and the organization’s edge Client Access server.
Two
good online header analyzer tools can be found here
If you notice you will see that , Hop 8 has
taken 3 days to complete. Internal
emails and the emails sent to outside parties get delivered immediately.
Exchange Logs:
I enabled Exchange logging to see what’s
causing this bottleneck. What I wanted
to check was the SMTP transactions between the external hosts and the
organization’s edge CAS server. For this
you need to analyse Exchange Protocol logs.
Notice that the external relay makes few
delivery attempts but only one out of many pass through. And also notice
the SMTP 451 4.7.0 Timeout waiting for client input message that get logged
before the failed delivery session
times out.
So what is causing these timeouts? These are the areas (I thought) need to be checked:
1. Time out setting in the Exchange receive connector
2.
Sophos Puremessage/AV scan hindering the email flow
3. Packet MTU size settings in the Client
access servers and the edge router
You can check the timeout settings for the
Exchange receive connectors by using Get-ReceiveConnector cmdlet on the
Exchange powrshell and change it to any desired value by using
Make sure you restart the Exchange
Transport service or restart the server for this change to take effect.
In my case I changed this from default 10
minutes to 20 minutes. It didn’t fix the
problem.
Most people, in response to this type of
time outs, suggest that it can be caused by a spam filtering/AV solution in the
server. In my case I disabled all AV/Spam filtering services installed in the
CAS and Mailbox servers and restarted the servers. No luck. Emails were still getting delayed.
In few places , people have suggested that
the MTU packet size issues can
cause mails with large attachments to
fail. See this Microsoft article - How to Troubleshoot Black Hole RouterIssues for more details and the steps to
troubleshooting MTU related issues.
This can be a real issue if your firewall
is blocking incoming/outgoing ICMP traffic.
Nodes along the way need to do
Path MTU discovery using ICMP protocol and blocking that will hinder this
discovery process.
So, make sure that you can ping the edge CAS server using large packets. Apart from
that you don’t have to muck around with MTU settings in the routers or windows
hosts as some people are suggesting.
So!,
what ultimately was causing this email delay?
In the end, on one forum someone suggest
that any type of SMTP analysis/filtering done on a firewall can cause SMTP timeouts.
“Is there a firewall? I'm suspicious that
you have a firewall that's interfering with the SMTP traffic.”
Analyzing the zone based Cisco IOS firewall on
the edge router revealed that it is configured to “Inspect” SMTP traffic from outside to the
CAS server in DMZ.
Instead of inspecting SMTP traffic specifically I’ve change it to TCP/UDP traffic inspection
and suddenly all mails with large attachments trying to get through since days started pouring in.