Monday, 2 February 2026

SDWAN issue

 I recently handled tricky network issue in sdwan team and the root cause wasn't expected!


The isolations pointed to the root cause , also the switches' packet captures deeper analysis helped later to finalize the root cause and we resolved the issue.

Problem description: after recent site migration to two sdwan cedge routers ,also the access and core switches get replaced , the users that are in vlan A weren't able to join MS teams meetings or send screenshots , other applications and traffic were intact , but users in vlan B were able to join MS meetings and send screenshots!

Affected path:
user--vlanA--(access switch stack)--trunk--(core switch)-SVI VlanA--lan--Tengig0/0/X.10--cedge--NAT DIA--Internet

Working path:
user--vlanB--(access switch stack)--trunk--(core switch)--lan--Tengig0/0/X.B--cedge--NAT DIA--Internet

Troubleshooted points:

* We compared the working and non-working scenarios from sdwan cedge perspective, affected path uses vrf C which uses default action as drop in the data policy , also the NAT DIA sequence uses PCG , but working path uses vrf D and it doesn't have PCG in NAT DIA , also default action was accepted in the data policy.

* We tried to delete the PCG and to use only nat-use vpn 0 DIA action , also we changed the default action to accept , but still same issue!

* We checked ip mtu and tcp mss values and found it is similar between both scenarios, also there was No indications for ISP drops as if ISP drops then vlan B should be affected also which isn't the situation.

* We took FIA packet traces on cedge router and packet capture on the lan cedge interface ,No drops from cedge, all tcp MS signaling and udp media streaming traffic were allowed and forwarded through the NAT DIA.

Also from the pcap analysis we found that tcp and SSL handshake were exchanged successfully, also we noticed application data traffic, also UDP media steam traffic was fine, we were filtering based on specific MS teams traffic and all was good!

* We noticed from user pcap that some return tcp packets have ip mtu 1506 bytes fragment reassemble packets so we tried to lower ip mtu to 1300 and tcp mss to 1250 bytes on the lan sub-interface , this allowed the cedge to lower mtu and mss for the return packets from wan to lan direction , this also didn't resolve the issue so this indicated No drops also in switches sides!

* Later from switches' pcaps, the core switch didn't send some MS traffic to the cedge lan interface as the packets' destination mac wasn't the vrrp mac address Note: vrrp mac should be as 00-00-5E-00-01-{VRID} so we concluded that some MS teams traffic weren't forwarded to the cedge lan , from core switch routing table, it had two MS teams routes (another MS public subnets) pointed to mpls router ip instead of cedge lan ip so we resolved the issue by PBR on core switch interface VlanA to force all MS traffic to be routed to cedge lan.

No comments:

Post a Comment

🔥 The Hidden Risk of “Wide Open” Internal Policies — And How To Remove Them Safely

In one of my recent projects, I noticed a wide open internal traffic policy in place. Later, I was asked to work on this issue and remove th...