Define more usable queue DX for multithreaded background work

@totten Please see https://github.com/civicrm/civicrm-core/pull/15422 It's CRM_Queue_Queue_SqlParallel

Good thoughts on this, interesting!

I think 'parallel' or 'async' might be better than 'multi' as we need to be clear that there's the liklihood of tasks not running in order. But as long as people know this it doesn't matter.

Using the name field is basically pushing metadata into an existing field, which is good beacause it maintains backwards compatibility, but feels like a bit of a shortcut to properly defining queue metadata. I don't know enough about theoretical queuing needs but I suspect that there would be other metadata that are possibly required and the munge-it-all-into-a-name aproach might be quickly restrictive. e.g. wouldn't it be better to link a queue item to a queue table which could define the runner class that's needed? That metadata table could also handle other running info, like locks and status stuff so you could get an overview: "3 queue runners are active"...

e.g. For my new MailchimpSync extension I need to queue updates to Mailchimp. Later I need to process that queue taking up-to 1000 items at a time and batching them up, submitting them to an external API, and if that was accepted I need to move them off the queue - so I need to claim up to 1000 at a time, and there's a possibility of releasing those leaving them on the queue again. Maybe this is a specialist situation and not suitable for the queue situation, but the queue design is agnostic of CRM_Queue_Task objects and there's definitely value in that for situations like this example - where it would be inefficient and awkward for each item to have to call a callback.

e.g. Another queue feature that I need is the possibility of them failing, and then not holding up the rest of the queue, but not being disposed of so that the items could be later retried (possibly after manual intervention, or at a later time). This could be achieved by adding them to a separately named "failed" queue and then having a process to switch them back to the runnnable queue. e.g. sending emails or a call to an external API might fail for a temporary network failure.

I like the idea of the examples. I think there's also scope for a bigger exploration of how queues are run. e.g. it's a good idea to include the max time and max items if there's some reason for this. Reasons could be:

you're running it over http and have to keep within timeouts
you're running it as a scheduled job and don't want to delay everything else those jobs do by processing a massive queue.

These are useful features for those who are limited to administering them through a UI/standard install without knowledge or access to the system itself. But if you have a sysadmin on hand then you have the possibility of using system's cron separate from civi's normal one, which could have different requirements, or even background services that continually look for items to process, but that's getting quite specialised and hard to generalise about.

But I do like the idea of an API for some standard queue running situations. I have just found that the examples in real life have been quite specific. I'll try to think more about it.

I think 'parallel' or 'async' might be better than 'multi' as we need to be clear that there's the liklihood of tasks not running in order. But as long as people know this it doesn't matter.

Yup, those are fine by me. The names should probably be picked inpairs (e.g. single/multi, sync/async, parallel/linear).

Using the name field is basically pushing metadata into an existing field, which is good beacause it maintains backwards compatibility...

Yeah, the dialectic for me went like this:

"There should be formal declarations of the queues (i.e. in the container) so that the params are nice and tidy and hookable... but requiring the declaration is an extra step which makes people think it's too complicated to use."
"There should just be one default queue so that it's easy to throw stuff in... but how do we indicate that we need a single-threaded/blocking or multi-threaded/non-blocking queue?"
"There should be a few queues, but the names should follow Convention Over Configuration."
"If we have a naming Convention, there's enough room in a name fit in a descriptive word and a basic type."

One could reconcile those impulses by defining Civi::queue($name) along the lines of this pseudocode:

public static function queue($name) {
  $queueDrivers = ['linear' => 'CRM_Queue_Queue_Sql', 'parallel' => 'CRM_Queue_Queue_SqlParallel'];
  $c = Civi::container();
  if ($c->has($name))
    return $c->get($name);
  else
    list ($basename, $type) = parse($name))
    $queueDriver = $mapping[$type];
    return CRM_Queue_Service::singleton()->create(...$name, $queueDriver, reset=>FALSE);
}

For my new MailchimpSync extension I need to queue updates to Mailchimp. Later I need to process that queue taking up-to 1000 items at a time and batching them up, submitting them to an external API, and if that was accepted I need to move them off the queue - so I need to claim up to 1000 at a time, and there's a possibility of releasing those leaving them on the queue again.

FWIW, in the handful of general-purpose queues that I've used, I can't recall ever seeing an operator along the lines of "take 100 items from the queue". Could you adjust the grain of the queue s.t. each queue-item is a batch. To wit:

$contactIds = [101, 123, 456, 789, ...];
$batchSize = 1000;
foreach ($offset = 0; $offset < count($countIds); $offset += $batchSize) {
  $batch = array_slice($contactIds, $offset, $batchSize);
  Civi::queue()->createItem(new CRM_Queue_Task('_mailchimp_send_batch', [$batch]));
}

If you need to put things back in the queue, then _mailchimp_send_batch() could just add another batch before it wraps up?

added ~332 triaged type:proposal labels

Thanks for the response @totten sorry this stalled while we focussed on other stuff at the sprint.

I understand your thinking - make it simple for people to use and they might use it instead of reinventing the wheel. Yep, definitely.

re batching: it's not such an issue for a general purpose queue, but it's an issue for CiviCRM's Task implementation on top of the general purpose queue infrastructure we have because that assumes/requires that each queue item can be processed on its own.

Some tasks must be processed in bulk. e.g. mailchimp updates. If we send 50k separate API requests they'll tell us off. We must use their Bulk API, max 1000 at a time.

We could queue the data (not tasks) one by one. Then have a batcher job which groups them into a separate queue using CiviCRM's tasks:

$jobQueue = ...; // not a task based queue
$count = 1000;
do {
  $item = $queue->takeFirstItem()
  if (!$item) break;
  $batch[] = $item;
  $count--;
} while ($count > 0);

$taskQueue = ...; // the task based queue
$taskQueue->createItem(new CRM_Queue_Task('_mailchimp_send_batch', [$batch]));

But this is not efficient as there is no benefit to the time lag between the batcher finishing and the tasks that submit to mailchimp running. But there will always be special cases that require other solutions. As I say, I don't need Civi's queues to work with this particular case as I've settled on another solution. I'm just feeding in from accademic interest and in case my experience needing queue-type systems was any help in thrashing out something for general use.

Perhaps another challenge for the theory of a general purpose queue might be: can we replace CiviMail's queues with this tool?

I'm just feeding in from accademic interest and in case my experience needing queue-type systems was any help in thrashing out something for general use.

Yup, that's the same reason I'm following that thread of consideration.

I'm fully with you on the goal of sending your 50k records in batches of 1k, but the pseudocode surprised me. ("Why would you want that?!") I'm trying to read between the lines, and maybe there's a distinction between these use-cases:

Planned/upfront/proactive batching: We do some search of the system and identify (e.g.) 50k records - for which we need to perform work in chunks of 1k. This would be appropriate for, say, CiviMail delivery or scheduled reminders.
- This was the case I had in mind, and it should be easy: when scheduling tasks, just pass in parameters to identify the 1k items. I don't see a need for multiple queues or whatnot.
Organic/tailend/reactive batching: In this case, we monitor for changes as they happen organically (updates via contribution-pages, profiles, staff-edits, etc) and we need to arrange batches after-the-fact. For example, if you get an influx of 50k independent updates over an hour, and you're passing data along to a remote API with a throttle of 1 API-call per minute (payload max of 1k rows), then you would try to group them into batches.
- Does this seem like a closer description of your example scenario? If so, then I can accept the hypothetical.

Perhaps another challenge for the theory of a general purpose queue might be: can we replace CiviMail's queues with this tool?

It feels to me like a typical case of "general" vs "bespoke": the generalist queue is simpler and cheaper; the bespoke queue is higher-quality and more-expensive.

In particular, I expect that both could do the job and both would be scalable (supporting concurrent work on multiple nodes). Using the general queue would be thinner (fewer bespoke codes+configurations), but it might require a trade-off based on the batch sizes for the last I/O step (e.g. larger batches in the last step can improve delivery efficiency, but they also make it harder to recover if a batch has a failure - the current bespoke queue system strikes better balance).

Is the general "good enough"? Depends on the org/use-case. In the Java world, I vaguely recall using dev frameworks where email delivery was just a thin application of the queue system -- seemed fine at the time, and I imagine it was fine for other folks. OTOH, if you're pushing x*100k msgs per blast, then bespoke will be better...

Organic/tailend/reactive batching

↑ yep, that's the use case I was referring to.

I agree that with planned/upfront batching where you have the data already, it would be trivial to store each batch as a queue item as you suggested.

And yes, batches are difficult when there are failures. Not because it's about queues, but just for sympathy(!): mailchimp requires I send batches, here's what it means:

I need to send batches, as in the organic/tailend etc. way and track those.
I then receive notification when the batch completes.
I then request a download url to the batch results.
I download the file
I gunzip the file
I parse the file, for which I had to write a minimal tar extraction routine because the tar format used by mailchimp is not extractable by PHP's phar!
Inside the tar file are two randomly named json files which need processing in turn.
I then loop the responses from massive json arrays within each file and deal with the errors.

Bleaugh! Just one of the reasons I try hard as poss to push people away from Mailchimp!

mentioned in issue #1395 (closed)

Suppose you have a consumer monitoring a queue (eg it polls MySQL's civicrm_queue_item or awaits data on a socket-connection to some dedicated queue daemon). A new task comes into the consumer: "Go run some function ($theJobFunction) with CiviCRM!"

How does the consumer dispatch $theJobFunction? Is it isolated from other queue-tasks? Does it require extra bootstraps or initializations? Depends on the approach... here are a few approaches:

Single Thread: The queue-consumer starts CiviCRM itself. It runs for a while and serves multiple requests. Whenever a task comes in, it simply invokes call_user_func($theJobFunction).
HTTP Request Per Task: Whenever a new task comes in, issue an HTTP request to the CiviCRM site to execute the task (eg https://example.com/civicrm/queue/run?taskSpec={$signedJson})
Process Per Task: Whenever a new task comes in, launch a new process (eg proc_open("cv ev $theJobFunction()"))
Fork Per Task: Whenever a new task comes in, fork the active process. (eg pcntl_fork()... call_user_func($theJobFunction)).
Process Pool: The queue-consumer creates several workers (calling proc_open() a few times). Workers may be called multiple times. Workers may have specialized roles (per domain/user).
Fork Pool: The queue-consumer creates several workers (calling pcntl_fork() a few times). Workers may be called multiple times. Workers may have specialized roles (per domain/user).

There are some important qualitative differences. To name a few:

Isolation: In "Single threaded", isolation is non-existent. The "Per Task" variants have good isolation. For "Pool" approaches, the level of isolation can be tuned (eg create separate pools for separate domains or users).
Performance: Each strategy has slightly different overhead costs and different affinity for parallel-processing.
Compatibility: All Civi deployments support HTTP. Other services (cron, proc_open(), pcntl_fork(), etc) are divided (eg often supported on dedicated hosting but not on highly-managed shared-hosts).

We are interested in allowing queue-tasks to "run-as" specific contacts - however, this makes "Isolation" of queue-tasks more important. (If a queue-consumer runs a task for Alice and then runs a task for Bob in the same PHP process, it is liable to mix-up via caches/global/singletons in either the CMS or Civi.) This raises a question: how much of a penalty (performance or compatibility) do we take for adding isolation?

I figured the performance question merited some empirical examination (it's better to see numbers than to speculate...) So I hacked together some prototypes+benchmarks:

https://github.com/totten/civicrm-core/blob/master-queuebench/ext/queuebench/bin/benchmark.php#L164

Some observations on performance:

The "Per Task" variants generally swing around (getting better+worse than baseline) depending on the #workers and #tasks. (Interpretation: As volumes go up, the parallelism benefits eventually outweigh the higher overhead).
"Fork Pool" clearly does better than the baseline across-the-board. (I haven't tested "Process Pool" - but expect it to be very similar to "Fork Pool".)

Wow, great work @totten some random comments

I'm not surprised the fork pool works well because to my understanding it's not having to do all the bootstrap each time.
compat: I don't value http the method; this seems to introduce a lot of extra could-go-wrong links in the chain. e.g. often the CLI version of PHP has quite different config: infinite run time; more memory etc. http requests can time out leading to situations that differe based on the way php is implemented. I feel like this sort of set-up is naturally quite sysadmin heavy, so do we need to be able to offer this for off-the-shelf hosting?

Thanks @artfulrobot.

I don't value http the method; this seems to introduce a lot of extra could-go-wrong links in the chain.

Interestingly, I had a chat with @seamuslee (before doing the prototypes/benchmarks), and it sounded like he had the mirror reaction, which I might paraphrase like this: many PHP admins+developers aren't comfortable writing or installing daemons, and there's a lot that could go wrong. But they all have daemons with public-facing HTTP-PHP worker-pools (PHP-FPM/mod_php/etc) attached.

e.g. often the CLI version of PHP has quite different config: infinite run time; more memory etc. http requests can time out leading to situations that differe based on the way php is implemented.

Right, it's a double-edge sword.

Thesis: Standard HTTP-PHP pools should have safe-guards (time-limits/memory-limits/single-use-processes/etc) to ensure overall availability of the system.
Antithesis: Many safe-guards (in the wild) limit performance and prohibit useful scenarios. Background-workers are specifically useful because they don't have these constraints.

Maybe the synthesis is like: Safe-guards are good all around, but the specific safe-guards (time-limit/memory-limit/single-use-processes) should be different for foreground-workers and background-workers.

I wonder if (conceptually) it makes sense for each queue-name (eg "upgrade_steps" vs "civimail_blast" vs "external_group_sync") to enforce different values for these parameters:

Memory limit (all workers handling queue $Q must stay stay within memory limit $X; or else worker dies)
Time limit, per task (each tasks in queue $Q must stay within this time limit $X; or else worker is killed)
Time limit, per worker (any worker for queue $Q may run for $X seconds; then it stops taking new tasks)
Task limit, per worker (any worker for queue $Q may execute up to $X tasks; then it stops taking new tasks)
Worker limit (the queue $Q my have up to $X concurrent workers)

I feel like this sort of set-up is naturally quite sysadmin heavy, so do we need to be able to offer this for off-the-shelf hosting?

I do think it's a factor. In my mind, each of these deployment styles are widely used:

Mass Market (or CMS-Tuned) Web Host: Uploading PHP files via SFTP, GIT, or similar. Little control over services. (Buzzwords: Pantheon, WPEngine, Godaddy)
Dedicated Server/VPS: Install one Civi+CMS on a dedicated system. Good control over services. Sysadmin comfort+attentiveness vary. (Buzzwords: Linode, Digital Ocean).
Multisite Server: Deploy a dozen Civi sites on one server. Higher access/skill/attentiveness (Buzzwords: Aegir, civibuild+vdr, Ansible, Spark)
Cloud Style: Run several services across many servers. Higher access/skill/attentiveness (Buzzwords: Docker, AWS, K8s)

If we're going to really use background-processing (eg for tasks in core or core-extensions or popular-contrib-extensions), then it seems like each of these profiles needs some viable path to executing tasks. They don't have to be performant (eg I don't mind if mass-market web-hosts go a bit slower), but I think each of them should work.

@totten good analysis.

However, mass market: I don't see how you can hope to offer the things you want to offer in this env? e.g. killing workers? Changing memory limits? Using http you have such little control over many of the params.

I'm left wondering what the benefits of async queues are for those environments? Seems the benefits of async are: faster processing of bulk jobs that can be monitored/managed. And by being separate to it, not blocking cron.

I suppose you could have a 2nd (3rd...) cron job firing http requests, but each request would need to be so guarded against getting killed off by timeouts/memory etc. You couldn't rely on one request triggering a batch of updates, as it would be more likely to hit a timeout. So then you'd need a gazillion requests, and if these are done via cron, the async-ness becomes pretty limited (one a minute?)? And then you'd have no control or sight over what was actually running.

Maybe there's something I don't know or am not understanding here.

However, mass market: I don't see how you can hope to offer the things you want to offer in this env? e.g. killing workers? Changing memory limits? Using http you have such little control over many of the params.

Hmm, yeah.... mass market do impose harder constraints. But I need that to be more tangible, as in:

HTTP request timeout is 30s. (The official default for PHP's max_execution_time is 30s. The official default to Apache+Nginx HTTP requests is 60s. So that means 30s effectively. Some could be more limited, but my gut says that default is a realistic value.)
The official default for PHP's memory_limit is 128m. Civi's installer complains if the limit is <64mb.
There may be functional limits on certain APIs - eg pcntl, pthreads, proc_*, posix_*. (Basically, any API that can directly spawn longer work is a threat to the reliability of the frontend worker pool, so those APIs are liable to get blocked.)

So suppose the limit is 30s+64mb. Civi-oriented hosts are probably more generous, but 30s+64mb should cover a large portion of the mass-market hosts.

Then I guess we should sketch "the things you want to offer" and identify tasks that would (would not) fit. Like here are some examples that come to mind:

Send a transactional email ( that should fit into 30s+64mb)
Perform an incremental-update on some computed-field ( that should fit into 30s+64mb)
Re-compute a computed-field for all records in the DB ( that is unlikely to fit)
Populate geocoding cache for a handful of addresses ( that should fit into 30s+64mb)
Populate geocoding cache for a million addresses in one go ( that is unlikely to fit)
Generate a report on a site with a few thousand records ( that should fit into 30s+64mb)
Generate a report on a site with several million records ( that is unlikely to fit)
Make a backup of the full SQL DB and upload it somewhere offsite ( that is unlikely to fit)

Is that representative? Maybe there are some others to track?

I suppose you could have a 2nd (3rd...) cron job firing http requests, but each request would need to be so guarded against getting killed off by timeouts/memory etc. You couldn't rely on one request triggering a batch of updates, as it would be more likely to hit a timeout. So then you'd need a gazillion requests, and if these are done via cron, the async-ness becomes pretty limited (one a minute?)? And then you'd have no control or sight over what was actually running.

Let's try it with a concrete use-case for background work. Here's one that's already in civicrm-core. The scheduled-task Job.geocode is a daily task which purports to do a couple things: (a) split street_address/street_name/street_number/etc and (b) optionally resolve longitude/latitude for each address. This runs via cron which can be fired via HTTP.

The authors of Job.geocode appear to believe it needs some mechanism for pagination/throttling (hence options start=<contactId>, end=<contactId>, geocoding=<bool>, throttle=<int>). I think they're right to have pagination/throttling, but the current framing creates two possibilities - both weak:

Enable the task and ignore/omit parameters. The job runs without batching. This is easy, but it obviously fails (exceeds time limit) as the #contacts grows.
Call the task manually, twiddling the start/end parameters as needed to achieve different batches. This works, but it's hard for the sysadmin.

I submit that a queue (even if it used cron-based workers with curl https://.../bin/cron.php and even if it suffered HTTP/PHP time-limits) would allow a better+easy solution, with the following approach:

Job.geocode is still a daily task. All it does is to scan the DB for high and low contact IDs, then enqueue corresponding updates. This should fit within 30s+64mb.

[$low, $high] = sql('SELECT min(id), max(id) from civicrm_contact');
for ($i = $low; $i < $high; $i+=1000) {
    enqueueTask('geocode', ['start' => $i, 'end' => $i +999]);
}

Whenever Civi cron runs (aka Job.execute aka bin/cron.php aka hook_civicrm_cron), it pulls out one task and runs it.

Of course, it would run quicker with a queue-monitoring process and a worker-pool and larger page-size, but (even with http limits) the arrangement is better than the status-quo because it doesn't crash (as #contacts grows) and it doesn't require manual twiddling of batches.

@totten

I suspect defaults vary. e.g. I've noticed since using PHP-FPM and nginx that while nginx has a fastcgireadtimeout default of 60s, that does not lead to it killing the php script, it's just when nginx stops listening; a PHP script with set_time_limit(0) will continue to completion - just with no way to report back to the client.

So do you imagine that there's one crontab lines per task?

* * * * * *  curl https://mysite/queuedo?task=geocode &>/dev/null
* * * * * *  curl https://mysite/queuedo?task=somethingElse &>/dev/null

And therefore this is async from a task perspective? Or do you imagine crons like this to fire 5 workers?

* * * * * *  for w in 1 2 3 4 5 ; do curl https://mysite/queuedo?task=geocode &>/dev/null & ; done; wait

I could imagine this getting into knots quite quickly. e.g. One day there's a network fault, geocoding requests take 10s each and eventually error. Each job is now relying on the php sapi to kill long running workers. If the job removed the time limit, it could keep running; then another; then another every minute.

So we'd probably need a way to try to monitor running workers via the SQL db. e.g. insert a record when the job starts; delete it when it completes; count these at start up to check that we're not exceeding sensible limits. You've got no way to kill a job that's gone rogue; and you'd need a clean up task that would have to assume jobs had crashed after a timeout, and remove them from the active jobs table.

It also means that you can't get through jobs very quickly unless you can guarantee accurately the time it takes to run them. e.g. I use queues with petition sites that can quickly generate 10k, 50k, 300k signatures in a queue. I have implemented a scheduled job to run these and the way it works is that it is assigned a max_run_time param. After each queue item it processes it checks if it's exceeded that time, and stops if so.

Even though I run this via CLI without time limit, I use this so that other cron tasks get a look in. I could, of course, not use Scheduled Jobs, and instead have cron entry that triggered just this job.

Cron-triggered http request based:

"background" is achieved by cron; sysadmins need to be able to set these up (we can assume they can as civi needs cron, but there's more to set up and maintain/manage)
need at least one cron job per task. "async" is achieved if you have multiple cron jobs, or one spawns several requests.
need some db backed management of running requests
devs who write queue runners need to implement their own way of ensuring the job can complete well within 30s, since it might get killed by the php supervisor after that. Yet they also need to account for queue items that take differing processing times.
potential for unused dead time, slowing down queue processing. e.g. if one run does x records, but actually it only takes it 20s, then there's 40s left before the next runner gets fired by cron.
not nice-able
jobs can't be killed.
can't do long jobs safely.

Q: is this better than what we have already? I think we can do REST requets to trigger a particular API job, so as long as the job is written to not take 30s, we already have the infra in place for background, async processing.

Contrast to a system of forked workers running CLI:

could have a generalised queue runner CLI script, allowing tasks to describe their queue items' needs, and allowing the queue runner to manage necessary workers. (It could even check system load and scale so that it works harder when the site is quieter.)
only one crontab entry required (I'm imagining the runner script to be long-running, but to avoid memory leaks, it could be programmed to quit after every N minutes/jobs, then cron would restart it)
all that could be API accessible and visualised via the admin UI
jobs could be killed; workers gracefully scaled up or down.
we don't need to worry about time-outs.
we can run long jobs
nice, ionice etc. able
once queue is idle, the process could even sit and poll every few seconds, and jump back into action quickly when a queue item appears, rather than exiting and waiting for cron to restart it.

I get quite excited about the latter, I can see it has a lot of potential for safe, manageable background processing. It could be combined with info on tasks to generate info that would help tweak the system over time, e.g. "Geocoding, average job time: 0.5s, max queue length today: 10,000 items, queue items processed today: 2,000, average memory use: 45MB"...

...while nginx has a fastcgireadtimeout default of 60s, that does not lead to it killing the php script, it's just when nginx stops listening; a PHP script with set_time_limit(0) will continue to completion - just with no way to report back to the client.

Fascinating. I've been wondering about that scenario. :)

Worth noting that set_time_limit(0) is important for a process-management thread, but for worker threads it's a threat to reliability (even for background workers). I expect that mass-market hosts already have enforceable time limits, and it'd be good for background pool to also allow enforceable limits.

I have implemented a scheduled job to run these and the way it works is that it is assigned a max_run_time param. After each queue item it processes it checks if it's exceeded that time, and stops if so.

Agree, this is the ideal balance for the background workers of a bootstrapped PHP application.

So do you imagine that there's one crontab lines per task?
* * * * * *  curl https://mysite/queuedo?task=geocode &>/dev/null
* * * * * *  curl https://mysite/queuedo?task=somethingElse &>/dev/null
And therefore this is async from a task perspective? Or do you imagine crons like this to fire 5 workers?
* * * * * *  for w in 1 2 3 4 5 ; do curl https://mysite/queuedo?task=geocode &>/dev/null & ; done; wait

I imagine it without the task-specificity -- so that you only need one item in the crontab. The item could be any one of these:

* * * * * *  curl "https://mysite/queuedo"
* * * * * *  for w in 1 2 3 4 5 ; do curl "https://mysite/queuedo" & ; done; wait
* * * * * *  for w in 1 2 3 4 5 ; do curl "https://mysite/queuedo?wid=${w}" & ; done; wait

The last variant (with a worker ID) would allow you to limit the total number of concurrent workers; eg the queuedo logic is like:

$workerId = CRM_Utils_Request::value('wid', 'Positive');
if ($workerId < 1 || $workerId > $workerLimit) exit('Invalid worker id');
if (!Civi::lockManager()->acquire('worker.queue.' $workerId)) exit('Worker already running');

$start = time();
$completedTasks = 0;
while (time() - $start < $workerTimeLimit && $completedJobs < $workerTaskLimit) {
  // (1) Dequeue a task. Execute it.
  // (2) If no tasks are pending, it's OK to sleep for a couple seconds and try again.
  // (3) If next task requires incompatible environment (eg change active domain_id or change active
  //     CMS user), then release task and exit. We'll try again on the next run.
}

Agree that's not optimal in a few ways. I like your bulleted list about benefits of a pooling daemon. (Agree about describing queue-needs and longer runtimes. Hadn't thought of nice/ionice/ps/kill - those are great points. Disagree about crontab - both models can work with a single crontab or systemd-unit.)

But... suppose we proclaimed: "There shall be one way to run queued/background tasks, and it requires installing this service/daemon." Here are a few consequences that I would expect:

(Not fixable) Break compatibility with mass-market/CMS-oriented hosts where you cannot install daemons.
(Not fixable) Break compatibility with Windows development boxes where POSIX/PCNTL APIs don't work.
(Fixable) Break compatibility with distributed/cloud-style deployments that have relied on HTTP pools/php-fpm pools/reverse-proxies for their scale-out.
(Fixable) Break compatibility with existing Bitnami/docker arrangements (because they don't start the daemon)
Increase the effort required to setup new sites (for dev or prod)

Or maybe look at it like this - we can intervene in different ways (do nothing; implement Cron-HTTP runner; implement CLI runner) and have impacts on different deployments and use-cases. This gives a matrix:

Intervention	Type of Deployment	Support for small background items (<30s)?	Support for big background items (>30s)?
Status quo	Dedicated Server, Multisite Server	Not really - chicken/egg problem. As a dev, I don't write background tasks because I worry that admins cannot setup runners. As an admin, I don't setup runners because devs don't require them. We end up with funny things like `Job.geocode start=... end=...`.	Not really
Status quo	All other	Not really	Not really
Support only PHP-CLI task runner	Dedicated Server, Multisite Server	Great	Great
Support only PHP-CLI task runner	All other	No/doesn't work	No/doesn't work
Support either Cron-HTTP or PHP-CLI task runner	Dedicated Server, Multsite Server	Great (w/PHP-CLI).	Great (w/PHP-CLI).
Support either Cron-HTTP or PHP-CLI task runner	All other	Decent (w/Cron-HTTP).	OK or terrible, depending on HTTP/PHP cfg. System-status shows warnings.

(I'm enjoying this thread!)

Just to be clear: I have no interest in locking out mass market installs. If anything I don't want to make it harder for them, and I'm not sure we need to. I'm also in favour of http as long as it's not http or cli: i.e. both options would be good.

I would like this to be a progressive enhancement:

you're on mass market hosting: things work, but optimisation options are few. Queues get processed slowly.
you're on some civi-optimised hosting: things work better, with known constraints like max execution time and background/async job runs (may still use http)
you have control over the env: you have a lot of options available to improve efficiency like CLI jobs
you're not on Windows: you have more visibility/control over background jobs in (3)

For mass market, couldn't we have a fallback queue runner that is civi cron powered? It won't be super performant, but if you're in mass market, you can't expect that?

Windows process control: it's true I don't know about this, but maybe that level of management is for envs that support that?

Re our disagreement on cron: I don't get how your single http cron job is going to spawn N workers? Sure you could use, say, guzzle's async to make N http requests, but if you are then going to sit and wait for them in the process that is only allowed 30s itself, it's a short fuse before boom they all get killed (or not...). The only way I can see that working is if the webserver configuration is such that a request only needs to be started by http, and will continue even when the client has disconnected? (aside: I recently wrote something using a pool of hundreds of guzzle async requests simultaneously, I was surprised how efficient it was in time and memory use. But they were made to an endpoint a little more performant than Civi and I knew I could wait for them to complete.)

With the do a job, get next job and if it requires different env just stop idea - I'm not sure I fully get this, but it seems like if your jobs ended up a bit fragmented (100 jobs, but requirements alternated) it would be a big bottleneck.

I think we could document how to write queues in a way that allows progressive enhancement, 30s max run times, etc. and I think once that documentation is there, we might resolve the chicken and egg thing.

@totten pinged me about my take on this in regards to hosts like Pantheon. Pantheon can use either http or cli via "Terminus" https://github.com/pantheon-systems/terminus. We set up Terminus and a cronjob on our own server to call Civi's cron on Pantheon. (If that's too complicated, or someone is using a third party "ping" service, they would need to use http cron.) I'm trying to wrap my head around the task runner part. If the task runner needs to be able to "watch" or "listen" in order to execute, I'm not quite sure how Terminus would help. Pantheon also has https://pantheon.io/docs/quicksilver which can listen for certain events but they seem to be focused on devops.

(@herbdool) We set up Terminus and a cronjob on our own server to call Civi's cron on Pantheon.

Aah, thanks for explaining that approach. Do you have any sense for how long a request to Terminus/ssh/drush is allowed to execute?

Suspicion: It sounds like Terminus/SSH maybe resembles their HTTP service - eg you send a transactional request to port 22 ("Execute drush command foo"), and they dispatch it to some non-specific physical node, and then it closes the connection.

If the task runner needs to be able to "watch" or "listen" in order to execute, I'm not quite sure how Terminus would help

Yeah, that's tricky. You probably can't use Terminus to launch a long-term process (eg one that runs for many hours). Hypothetically (if one were really trying to tune things for Pantheon), maybe Terminus/SSH could be the transport-medium for the queue? SSH is better than HTTP for passing frequent bidirectional messages ("Here's a new task", "Thanks I started it", "All done").

Like imagine an offsite box which opens a control-connection over SSH (pseudocode):

function setupControlChannel() {
  $controlChannel = connect("ssh foo@bar.com -- drush civicrm-queue-watcher");
  $controlChannel->on('data', function() {
    fwrite($controlChannel, "reserve {$job->id} 30min\n");
    spawn($job)->then(function() {
        // Once the job completes, acknowledge it.
        fwrite($controlChannel, "finish {$job->id}\n");
    })
    ->else(function(){
        // If the job fails, put it back on the queue
        fwrite($controlChannel, "release {$job->id}\n");
    });
  });
  $controlChannel->on('close', function (){
    // Oops, somebody closed the control-channel. Let's make a new one.
    setupControlChannel();
  });
}

But... in the theme of "progressive enhancement", that's an optimization (reduces turn-around time on queued task). It sounds like an HTTP fallback approach (where Civi just uses Job.execute/cron.php) would be functional.

(@artfulrobot) I'm enjoying this thread!

Glad to hear it! It's an intricate topic!

...I would like this to be a progressive enhancement... ...couldn't we have a fallback queue runner that is civi cron powered... Re our disagreement on cron: I don't get how your single http cron job is going to spawn N workers?

:) OK, I think we agree on these things but crossed some wires because the words "cron" and "http" both appear in different arrangements, eg

(A) Stay close to the current cron-runner (Job.execute aka cron.php): Just have cron.php pick off a few tasks from the queue. (Pseudocode in this comment)
- Upside: Same compatibility/deployment as current HTTP cron.
- Downside: The polling-loop and the task-execution run in the same the thread, so you don't get isolation. (The exit-if-different technique provides a mitigation/workaround.)
- Comment: This is probably the most obvious approach for a fallback on less-configurable systems... since it's close to what already works there...
(B) Start the pool (process-manager) via Cron-HTTP: You make an HTTP request for /civicrm/queue/mgr. It becomes a long-running PHP process (within php-fpm) which acts as a process-manager, starting+stopping more processes in the same way that a daemon might.
- Upside: Performance and isolation benefits of pooling and subprocesses. Use same logic as the PHP-CLI runner.
- Downside: Only a superficial fix for mass-market hosts. We expect these hosts to have other resource limits (max_execution_time) and/or functional limits ("disable proc_open()") that would frustrate the process-manager.
- Comment: @artfulrobot, it sounds like you were highlighting problems with (B), and I completely agree -- it's not viable for one HTTP request to spawn N workers.
(C) Start the pool (process-manager) via local PHP-CLI but launch workers via HTTP: This is what the prototype/benchmark did when I posted before.
- Upside: Many of the performance and isolation benefits of pooling and subprocesses. The process-manager doesn't have to run directly on the main server (just like crontab doesn't have to run on the main server). It could be an offsite desktop, Raspberry Pi, or a tiny cloud VM that merely orchestrates HTTP requests.
- Downside: Resource limits (max_execution_time, max_children, etc) are shared between frontend+backend requests - and (one presumes) the limits are tuned for frontend requests (eg max_execution_time=30s). To check the queue from off-site, you either need to traverse a firewall or provide a REST-y protocol for task management.
- Comment: We haven't talked through (C) much?

I do like the progressive-enhancement philosophy that you mention.

mentioned in commit 8e7dbe16

mentioned in commit 245c061f

mentioned in commit f6e10271

mentioned in commit de39f93e

Hi,

Thank you for raising an issue to help improve CiviCRM. As you may know, this issue has not had any activity for quite some time, so we have closed it.

We would like you to help us to determine if this issue should be re-opened:

If this issue was reporting a bug, can you attempt to reproduce it on a latest version of CiviCRM?
If this issue was proposing a new feature, can you verify if the feature proposal is still relevant? Did it get the concept-approved label? Have other people also shown interest? Could it be implemented as an extension?

If the answer to either question is yes, please feel free to comment or re-open the issue. Please also consider:

Is it something that you could help implement, either by sending a patch or hiring someone who can?
Would you be willing to fund the work, either through a CiviCRM partner, or the Core Team?

Thank you for your help and contributions to CiviCRM.

P.S. This is an automated message, see infra/gitlab issue 20. We understand that automatic responses are annoying, but given the number of open issues as the project evolves, we need a bit of help to triage and prioritise the most relevant issues.

closed

Define more usable queue DX for multithreaded background work

General sketch of usage

Some tasks

Child items ...

Activity