question

DavidBeavon-2754 avatar image
0 Votes"
DavidBeavon-2754 asked KranthiPakala-MSFT edited

Does self-hosted IR need to be rebooted periodically?

This past week one of our integration runtime services (SHIR) started taking longer and longer to run some pipelines that copy data to ADLS. This happened over the course of a few days. See image below. You can see that pipelines are running longer and longer each day. The durations grew rapidly between May 5 and May 7. At the end they reached about ~10x their normal durations.


There wasn’t any obvious problem with CPU or RAM on the machine, but when I restarted the IR service, things went back to normal. Daily pipelines went back to completing in a reasonable amount of time.


In the past when I've reported issues, the ADF support team has been quick to ask me to reinstall the IR. Is it also a necessary course of action to regularly restart the IR on a daily or weekly basis? Would that avoid this type of behavior ? Is that what others are doing? I suspect ADF is more profitable when pipelines take all day to complete. But that isn't the desired behavior from a customer perspective.


95623-image.png


azure-data-factory
image.png (198.6 KiB)
· 18
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

HI @DavidBeavon-2754,

Thanks for reaching out. It is not necessary to restart IR on regular basis.
When the pipelines are running for long time or queued. when the pipelines are running for long time or queued, as a first step it is recommended to restart IR to see if it resolves the issue.
As SHIR is hosted on a machine, many other factors like system resources, network related issues, windows updates may lead to such long running issues. Resource usage also depends heavily on the amount of data that is moved. When multiple copy jobs are in progress, you see resource usage go up during peak times.

0 Votes 0 ·

Could you please tell us how frequently you are seeing this issue? How many times IR has been restarted in last few weeks to resolve such issues? Have you noticed this behavior before May 4th? The reason for checking on this is because we haven't seen other users reporting this issue in last few weeks. If this is something related to SHIR, I guess many other users would have experienced the same behavior with IR, else it could be more specific to the machine that is hosting IR.

Here are some related resource, see if those help

  1. https://docs.microsoft.com/azure/data-factory/self-hosted-integration-runtime-troubleshoot-guide#concurrent-jobs-limit-issue

  2. https://docs.microsoft.com/azure/data-factory/create-self-hosted-integration-runtime#prerequisites

  3. https://docs.microsoft.com/azure/data-factory/pipeline-trigger-troubleshoot-guide#pipeline-status-is-queued-or-stuck-for-a-long-time




0 Votes 0 ·

@KranthiPakala-MSFT I know there are complex issues that can cause pipelines to behave differently at times. My primary question is how people generally troubleshoot. ie. is there a series of common steps that people use to "kick" things and make them go back to normal again (eg. step 1. reboot as needed or reboot regularly, step 2. reinstall SHIR, step 3. investigate for any legitimate sources of delays that are caused by some misbehaving customer code rather than by Microsoft code in IR).

I suspected networking issues but that wouldn't explain why things went back to normal right after restarting these services.

One thing to point out is that we use the new VNET-enabled IR. This may make our SHIR unique...

Another thing to point out is that Microsoft may already be aware of a generalized issue, which is why I'm asking. It is likely that they could chart the problem based on customer billings alone. eg...

0 Votes 0 ·
Show more comments

@KranthiPakala-MSFT I know there are complex issues that can cause pipelines to behave differently at times. My primary question is how people generally troubleshoot. ie. is there a series of common steps that people use to "kick" things and make them go back to normal again (eg. step 1. reboot as needed or reboot regularly, step 2. reinstall SHIR, step 3. investigate for any legitimate sources of delays may be found within the customer's code itself ).

I suspected networking issues but that wouldn't explain why things went back to normal right after restarting these services.

Microsoft may already be aware of a generalized issue, which is why I'm asking. It is likely that they could chart the problem based on customer billings alone. eg...

96099-image.png


0 Votes 0 ·
image.png (61.8 KiB)

Hi @DavidBeavon-2754 ,

Thanks for your response and details. Here is a self help troubleshooting document for IR related issues, please see if this info helps - Troubleshoot copy activity on Self-hosted IR

I suspected networking issues but that wouldn't explain why things went back to normal right after restarting these services.

  • In order to know the appropriate reason for this, I would recommend filing a support case so that a support engineer can involve relevant product team member to further troubleshoot into SHIR logs. Also you mentioned that you have raised Support Tickets in the past for SHIR issues, could you please share those SR#'s with us, so that I can pass the feedback internally.

Thank you and sorry for the inconvenience because of this issue.





0 Votes 0 ·

Hi @DavidBeavon-2754 , Following up to see if you are still interested to know about the root cause for the issue with your SHIR. If so could you please share few additional details as requested in my previous comment. Also we would need the Data factory name, SHIR details as well as SHIR logs. It would be great if you can file a support ticket so that a support engineer can have a screen share call with you and also can involve the relevant product team to take a deeper look.

0 Votes 0 ·
Show more comments

0 Answers