Unreliable notifications in 10.2??

I have some reports from users that they are failing to get notifications (mostly live alerts) since we upgraded to 10.2. I'm not sure where to start on this, because as far as I can see notifications are working. I'm aware of user settings, and that browser settings might interfere, but these are existing users who were getting notifications before the upgrade.

We did previously have one or two reports of odd behaviours, such as notifications that could not be dismissed - so a user always had one or two notifications, irrespective of any action to view the relevant thread, dismiss the notification, etc. However, the reports for 10.2 are new.

Historically, we've had similar reports regarding emails. Some users report notifications not arriving for weeks, and then starting again, and then stopping. I've always put those down to hyper-active anti-spam systems blocking emails, rather than anything within Telligent, but it's hard to be certain.

Parents
  • To chip this apart a bit, as these seem possibly unrelated:

    Regarding live alerts:

    Being offline would prevent a live alert from being received. The notification itself would still exist and would show up in the notification queue on the next refresh. But the live/pushed alert might not pop up. The notification emails would also still be delivered (as per users' settings). Essentially, the live alert is just one possible distribution method of a notification and it does not prevent the others.

    As for why the live alert might not show:

    These depend on the user (1) being logged in (not a stale authentication token) (2) being connected to the socket endpoint and not seeing one of those you've gone "offline" messages and (3) having a message bus successfully running.

    For 1 (being logged in) - it's possible the user is technically no longer authenticated. We see this sometimes in environments where there is a web farm but no sticky sessions keeping a user attached to a single server or where the machine keys are not set to be the same across web app instances to ensure a user redirected to a different server would still be authenticated. Both of these would break the authentication token and result in a "you've gone offline" message, no receipt of live alerts, and requirement to re-login on next navigation, and other broken functionality like inability to post content. This would be my first place to look.

    For 2 (connection to the socket endpoint) - this also depends on authentication working. This is also dependent upon there not being some sort of load balancer or proxy that is breaking these connections. The connections auto-recover when possible, but can only go so far if there is infrastructure preventing them. To be specific, the connection is held open to the url "socket.ashx". 

    For 3 (message bus) - what kind of setup do you have? How many web nodes? Is the job server running? Are you running the socket message bus service? It's important to ensure that either the Socket Message Bus Service or Database Message Bus is properly configured. You can check this in Administration > Site > Message Buses. If neither is active, that needs to be resolved. If one of them is active, review its Diagnostics tab to ensure that all web nodes and the job server appear green and connected to the bus.

    So to review:

    1. How is authentication configured?
    2. Are sticky sessions are enabled?
    3. Are machine keys shared among web app servers?
    4. Is a message bus is properly configured? (ideally the Socket Message Bus Service, but could also be the fallback Database Message Bus)
    5. It could be helpful to review a fiddler trace capturing the time span of one of these "you've gone offline" messages as well as attempting to navigate the site after receiving the message. This could show details about how/if authentication is being lost or if anything is preventing the socket.ashx connection.
    6. Any exceptions logged during these times that you would expect a notification?

    Regarding inability to dismiss notifications:

    We did previously have one or two reports of odd behaviours, such as notifications that could not be dismissed - so a user always had one or two notifications, irrespective of any action to view the relevant thread, dismiss the notification, etc.

    This has not been reported before. Are you still seeing this? If so, can you share any server and client side exceptions when they occur?

    Regarding email notifications:

    Historically, we've had similar reports regarding emails. Some users report notifications not arriving for weeks, and then starting again, and then stopping.

    Are you still seeing these too? This would be unrelated to live alerts. It could be user configuration, site configuration, or even issues with the job server delivering the emails. If you're seeing these, can you provide more detail?

Reply
  • To chip this apart a bit, as these seem possibly unrelated:

    Regarding live alerts:

    Being offline would prevent a live alert from being received. The notification itself would still exist and would show up in the notification queue on the next refresh. But the live/pushed alert might not pop up. The notification emails would also still be delivered (as per users' settings). Essentially, the live alert is just one possible distribution method of a notification and it does not prevent the others.

    As for why the live alert might not show:

    These depend on the user (1) being logged in (not a stale authentication token) (2) being connected to the socket endpoint and not seeing one of those you've gone "offline" messages and (3) having a message bus successfully running.

    For 1 (being logged in) - it's possible the user is technically no longer authenticated. We see this sometimes in environments where there is a web farm but no sticky sessions keeping a user attached to a single server or where the machine keys are not set to be the same across web app instances to ensure a user redirected to a different server would still be authenticated. Both of these would break the authentication token and result in a "you've gone offline" message, no receipt of live alerts, and requirement to re-login on next navigation, and other broken functionality like inability to post content. This would be my first place to look.

    For 2 (connection to the socket endpoint) - this also depends on authentication working. This is also dependent upon there not being some sort of load balancer or proxy that is breaking these connections. The connections auto-recover when possible, but can only go so far if there is infrastructure preventing them. To be specific, the connection is held open to the url "socket.ashx". 

    For 3 (message bus) - what kind of setup do you have? How many web nodes? Is the job server running? Are you running the socket message bus service? It's important to ensure that either the Socket Message Bus Service or Database Message Bus is properly configured. You can check this in Administration > Site > Message Buses. If neither is active, that needs to be resolved. If one of them is active, review its Diagnostics tab to ensure that all web nodes and the job server appear green and connected to the bus.

    So to review:

    1. How is authentication configured?
    2. Are sticky sessions are enabled?
    3. Are machine keys shared among web app servers?
    4. Is a message bus is properly configured? (ideally the Socket Message Bus Service, but could also be the fallback Database Message Bus)
    5. It could be helpful to review a fiddler trace capturing the time span of one of these "you've gone offline" messages as well as attempting to navigate the site after receiving the message. This could show details about how/if authentication is being lost or if anything is preventing the socket.ashx connection.
    6. Any exceptions logged during these times that you would expect a notification?

    Regarding inability to dismiss notifications:

    We did previously have one or two reports of odd behaviours, such as notifications that could not be dismissed - so a user always had one or two notifications, irrespective of any action to view the relevant thread, dismiss the notification, etc.

    This has not been reported before. Are you still seeing this? If so, can you share any server and client side exceptions when they occur?

    Regarding email notifications:

    Historically, we've had similar reports regarding emails. Some users report notifications not arriving for weeks, and then starting again, and then stopping.

    Are you still seeing these too? This would be unrelated to live alerts. It could be user configuration, site configuration, or even issues with the job server delivering the emails. If you're seeing these, can you provide more detail?

Children
No Data