Verint Job keep leaving "Running since ..." ghost job entries after successful execution

This issue is related to a private support ticket #1160782

Introduction

- We had implemented multiple custom Verint Jobs for a extension called Video Manager, this extension is intended for managing and processing multiple videos on Verint instance

- This extension have two primary job that handles processing videos on server VideoManagerContentCheckJob and VideoManagerJob

- VideoManagerContentCheckJob - is running once a day + once at server restart, it re-checking all content on server for unprocessed video and schedules them for processing

- VideoManagerJob - is running every 1 minute to check is there any video scheduled for processing, if there no video it exits immediately

Server info:

- Verint version 11.1.8.16788

Issue

1. On server launch, as scheduled VideoManagerContentCheckJob perform checking of all videos on server

2. We have very detailed event logging that shows that main method of this job Execute(...) is exiting successfully 

3. After job is finished, very frequently, we keep seeing VideoManagerContentCheckJob in a list of active job, that stay there marked as active forever

4. Server restart is not helping to get rid of them, and new fake active items keep appearing

4.1. We can be totally sure that this active Job are "ghosts" because we have detailed log that shows that main job method is exiting and we have a named mutex lock inside every Job runs that will prevent multiple job instance to run in parallel

Screenshot from Job administration panel, all of these Job are actually not active and was finished successfully long time ago

Parents
  • A few notes from taking a look at your VideoManagerContentCheckJob code:

    1. Your Initialize() method attempts to Schedule the job manually - you do not need to do this. As long as the job is enabled and has a valid schedule configured, it will automatically follow that schedule. Scheduling manually is causing it to run as a dynamic job, resulting in the multiple instances you are seeing. Test this first and see if it resolves the issue before trying the other recommendations.
    2. We recommend that asynchronous patterns not be used. The job service is already managing background process usage according to the server resources available, and even with Cancellation and Task.WaitAll there is too much possibility of deadlock when blocking in a synchronous context. This is especially true when accessing an external resource like a SQL database, as you do. Two alternatives:
      1. Revert to synchronous/sequential method calls
      2. Separate this job into individual jobs for each "section" of content (MediaValuesChecker, ForumsChecker, etc) and let the job server manage scheduling and running them all individually. This could also assist in pinpointing any specific issues with particular content types. You could use a base class to avoid code repetition and only re-implement key components like JobTypeId, Name, and Checker logic.
    3. Similarly, we recommend that Mutex locking should not be used as it may have adverse effects on the job service's management of job execution and scheduling.
Reply
  • A few notes from taking a look at your VideoManagerContentCheckJob code:

    1. Your Initialize() method attempts to Schedule the job manually - you do not need to do this. As long as the job is enabled and has a valid schedule configured, it will automatically follow that schedule. Scheduling manually is causing it to run as a dynamic job, resulting in the multiple instances you are seeing. Test this first and see if it resolves the issue before trying the other recommendations.
    2. We recommend that asynchronous patterns not be used. The job service is already managing background process usage according to the server resources available, and even with Cancellation and Task.WaitAll there is too much possibility of deadlock when blocking in a synchronous context. This is especially true when accessing an external resource like a SQL database, as you do. Two alternatives:
      1. Revert to synchronous/sequential method calls
      2. Separate this job into individual jobs for each "section" of content (MediaValuesChecker, ForumsChecker, etc) and let the job server manage scheduling and running them all individually. This could also assist in pinpointing any specific issues with particular content types. You could use a base class to avoid code repetition and only re-implement key components like JobTypeId, Name, and Checker logic.
    3. Similarly, we recommend that Mutex locking should not be used as it may have adverse effects on the job service's management of job execution and scheduling.
Children
  • Hi  

    We had applied recommendation #1 - but by this we lost the intended behavior to start this job just after job server was restarted

    Is there any other recommended way how we can do this?

  • We recommend that asynchronous patterns not be used. The job service is already managing background process usage according to the server resources available,

    I also had a question about this

    We had many code places that can be I/O blocked by waiting for DB/HTTP/etc operation, which can be quite inefficient if we will write regular non-async code

    Does Job server run as many jobs as logical cores available?

    Couldn't we rely on ThreadPool/TaskScheduler and limiting amount of concurrent tasks based on some sane factor (logical core, free thread etc)

  • Job Scheduler manages retrieving and running jobs simultaneously based on the number of available cores (4 per core). In normal operation, it will continue to process long running jobs as necessary and block re-execution of the same job until prior instance is completed/cancelled. One way to take advantage of the job service's resource management is to break up monolithic jobs into smaller discrete jobs.

    The additional requirement of firing the job immediately on restart is the reason multiple instances are being scheduled. When a job is scheduled manually through the job scheduler API, it is run as a dynamic job vs a scheduled job, which does not limit the number of simultaneously running jobs (similar to EmailNotificationSendJob and others that have many instances fired as needed to complete out-of-process tasks). Is the goal to ensure that newly distributed versions of the job are scheduled to run as quickly as possible?

    One alternative way to handle this is to store the last execution time and a version number for the job in a separate data table, and have the job run on a very short schedule (5 min) but cancel execution early if the version matches and the last execution time is less than 24 hours ago. Then increment the version number of the job when a new version is pushed.

  • > the goal to ensure that newly distributed versions of the job are scheduled to run as quickly as possible?

    No, the goal is to immediately process any video items that wasn't process before, e.g. if video processing was terminated during recent server restart or if job wasn't enabled before to process video 

    So probably it's better to say we need immediately run this job as soon as plugin being loaded/enabled

  • You backend process should just track what was not finished or needs done and let that get picked up with the next scheduled execution.   Plugins can reload and initialize as the result of someone clicking save on ANY plugin, you just going to create the same issue.

  • Our other job is doing processing exactly in that way, it processing new content as it being created

    But as I said, this approach doesn't handle cases when plugin was turned off for some period of time

    That's why we are need for doing full content re-check from time to time

    Also multiple instances of job was never the issue for us, only we was controlling that only 1 job was starting, the issue was that job scheduler doesn't mark exited jobs as done

  • Do the Job Server logs contain both "[STARTING] ..." and "[FINISHED] ..." entries for the job? 

    Does the issue (not marking exited job as done) still occur if the job is coded syncronously?

    You may also be running into issues with your implementation of tasks. Task.WaitAll combined with Task.Factory.StartNew will produce unexpected results if those Tasks also spawn Tasks. While we still recommend syncronous pattern, you could test with Task.Run instead.

  • Does the issue (not marking exited job as done) still occur if the job is coded syncronously?

    After watching how Verint Job server is working for last 3 months, I could conclude that this problem is most likely related to usage of Task and async methods, we rewrite this job plugin to parallelize computation w/o using of Task or async, and now it works ok