74 Commits

Author SHA1 Message Date
Sybren A. Stüvel
3e72391cbf Restartable workers
When the worker is started with `-restart-exit-code 47` or has
`restart_exit_code=47` in `flamenco-worker.yaml`, it's marked as
'restartable'. This will enable two worker actions 'Restart
(immediately)' and 'Restart (after task is finished)' in the Manager web
interface. When a worker is asked to restart, it will exit with exit
code `47`. Of course any positive exit code can be used here.
2023-08-14 16:00:09 +02:00
Sybren A. Stüvel
02fac6a4df Change Go package name from git.blender.org to projects.blender.org
Change the package base name of the Go code, from
`git.blender.org/flamenco` to `projects.blender.org/studio/flamenco`.

The old location, `git.blender.org`, has no longer been use since the
[migration to Gitea][1]. The new package names now reflect the actual
location where Flamenco is hosted.

[1]: https://code.blender.org/2023/02/new-blender-development-infrastructure/
2023-08-01 12:42:31 +02:00
Sybren A. Stüvel
77db55bb14 Manager: when worker signs off, only remember specific statuses
Limit which worker statuses are remembered (when they go offline) to
those that we want to restore when they come back online. This is now
set to `awake` and `asleep`. This prevents workers from being told to go
to states that they cannot handle, such as `error` or `starting`.
2023-06-23 11:38:37 +02:00
Eveline Anderson
4d2200bb0c Fix #99549: Remember Previous Status (#104217)
Fix #99549: When sending Workers offline, remember their previous status

When the status of a worker goes offline, the Manager will now make the status of the worker to be remembered once it goes back online. So when the Worker makes this status change (so for example `X → offline`), Manager should immediately set `StatusRequested = "X" ` once it goes online.

Reviewed-on: https://projects.blender.org/studio/flamenco/pulls/104217
2023-06-02 22:50:07 +02:00
Sybren A. Stüvel
28cc7b7a3f Manager: improve logging when workers register
The info message that a worker registered now also includes its UUID.
Any failure hashing the password will now also log the worker name + UUID.
2023-04-04 12:13:21 +02:00
Sybren A. Stüvel
e77bd9b841 Fix workers immediately switching state on a lazy request
Fix an issue where workers would switch immediately on a state change
request, even if it was of the "after task is finished" kind.

The "may I keep running" endpoint wasn't checking the lazyness flag, and
thus any state change, lazy or otherwise, would interrupt the worker's
current task.
2022-10-20 12:30:37 +02:00
Sybren A. Stüvel
449c83b94a Manager: broadcast worker update after assigning task
The Manager now broadcasts a worker update to SocketIO clients when a
worker gets a new task assigned. This ensures the "current task" shown in
the worker details view is up to date.
2022-08-01 14:29:08 +02:00
Sybren A. Stüvel
b26374d480 Manager: when worker goes to sleep, log in task log which worker
When a worker's tasks get requeued because it goes to sleep, the task log
will now mention the worker identification (name + UUID). This aids in
figuring out what happened to tasks.
2022-07-28 14:27:44 +02:00
Sybren A. Stüvel
de80a09223 Manager: include job UUID in "last-rendered image received" log entries
This makes it possible to collect all "last-rendered image received"
entries for a single job.
2022-07-19 18:40:22 +02:00
Sybren A. Stüvel
696b97c553 Re-queue tasks of worker after changing to non-'awake' state
When a Worker changes state from `awake` to something else, it cannot
run tasks any more. This now triggers a requeue of its active task
(should be one at most, if things are sane) so that another worker can pick
it up.
2022-07-19 15:38:36 +02:00
Sybren A. Stüvel
3baac0a2d8 Manager: reduce log level when worker asks task but has wrong status
This can happen quite often and it's fine, so it's not worth a warning.
2022-07-18 19:26:49 +02:00
Sybren A. Stüvel
24f921b0c8 Manager: add more logging when worker cannot be marked as 'seen'
SQLite often errors out on this with only `interrupted (9)` as message.
This logging should at least tell us whether it's our own "background
context" timing out, or whether something else fishy is going on.
2022-07-18 19:04:15 +02:00
Sybren A. Stüvel
bfd6746f78 Manager: consult the sleep schedule on worker sign-on
If there is no status change queued for the Worker, the sleep schedule
should determine its initial status.
2022-07-18 18:25:24 +02:00
Sybren A. Stüvel
bc725ea7dc Manager: mark worker as 'seen' when calling the WorkerState operation
Fix workers timing out when they're `asleep`. When sleeping, the Worker
will call the `WorkerState` operation to see if they have to wake up, but
that didn't mark the workers as "seen". As a result, a sleeping worker
would always time out.
2022-07-18 17:56:56 +02:00
Sybren A. Stüvel
0697f71b62 Manager: run some operations in a background context
Run some API operations in a background context. This should prevent some
of the SQLite "interrupted" errors, as those can occur when the context
closes while a query is running.

The API operations that Workers use are now mostly running in a separate
background context, at least from the moment onward when they can run
independently of the Worker connection.
2022-07-18 16:26:06 +02:00
Sybren A. Stüvel
0e4ed1c54d Manager: move worker password hasher into a struct + interface
Move the Worker password hashing/comparison functions into a struct, and
use it via an interface. This will make it easier to switch to different
hashing algorithms.

Even with a low number of iterations, BCrypt is quite slow. That's good for
security, but not for Flamenco Worker authentication -- the password is
more as "nice check to avoid accidentally reusing the same ID" than
something for security.
2022-07-15 15:08:00 +02:00
Sybren A. Stüvel
d25151184d Add a "Last Rendered" view
Add a "Last Rendered" view to the webapp.

The Manager now stores (in the database) which job was the last
recipient of a rendered image, and serves that to the appropriate
OpenAPI endpoint.

A new SocketIO subscription + accompanying room makes it possible for
the web interface to receive all rendered images (if they survive the
queue, which discards images when it gets too full).
2022-07-01 12:34:40 +02:00
Sybren A. Stüvel
0fc5ba0bc6 Manager: broadcast last-rendered image info via SocketIO
After processing an image in the "last-rendered" processor, a SocketIO
object is sent to clients to indicate the last-rendered image needs to
be (re)loaded.

This also moves the previously existing "done callback" from a single
function to a per-image callback, so that it can be called with the
right information in there, and only when that particular image is
actually done processing.

The notification message sent via SocketIO also contains the necessary
info to render the image, so that the web client doesn't have to call
the `fetchJobLastRenderedInfo` operation.
2022-06-30 18:36:24 +02:00
Sybren A. Stüvel
e687c95e5d Manager: add "last rendered image" processing pipeline
Add a handler for the OpenAPI `taskOutputProduced` operation, and an
image thumbnailing goroutine.

The queue of images to process + the function to handle queued images
is managed by `last_rendered.LastRenderedProcessor`. This queue currently
simply allows 3 requests; this should be improved such that it keeps
track of the job IDs as well, as with the current approach a spammy job
can starve the updates from a more calm job.
2022-06-24 16:51:11 +02:00
Sybren A. Stüvel
1586c37b32 Manager: mark task as active as soon as it is assigned to a worker
Move the task to 'active' status so that it won't be assigned to another
worker. This also enables the task timeout monitoring.
2022-06-20 13:00:49 +02:00
Sybren A. Stüvel
b95bed1f96 Refactor: rename RequeueTasksOfWorker to RequeueActiveTasksOfWorker
Soon there will be another function to requeue tasks of workers by other
criteria, so being clear in the name helps.

No functional changes.
2022-06-17 15:49:16 +02:00
Sybren A. Stüvel
6feee74c54 Cleanup: Manager, move worker task update handling code into its own file
Move the code related to task updates from workers to
`worker_task_updates.go`. It's going to get more complex with the
blocklisting in there; this prepares for that.

No functional changes.
2022-06-17 11:46:07 +02:00
Sybren A. Stüvel
9ab41984ac Adjust Go code for Nickname -> Name change
This fixes a bug where 'Worker undefined changed status' was logged in
the web interface, as that was (back then incorrectly) `workerupdate.name`.
Now that code is correct.
2022-06-16 11:03:18 +02:00
Sybren A. Stüvel
5f2712980e Manager: task scheduler, check for requested worker status change first
Before checking whether the Worker is allowed to do work (i.e. is in
`awake` state), check any queued-up status changes. Those should be
communicated, before saying "no work for you", so that the Worker can
actually respond to it.
2022-06-16 10:48:38 +02:00
Sybren A. Stüvel
ee53373878 Cleanup: compare worker state to constant instead of hard-coded state
Use the `requiredStatusToGetTask` constant to compare the worker status,
and not just for logging.

No functional changes, just better code.
2022-06-16 10:46:50 +02:00
Sybren A. Stüvel
be0b10400f Manager: count workers as 'seen' even when there is no task
Fix a bug where a worker would only be counted as 'seen' by the task
scheduler if it actually got a task assigned.
2022-06-16 10:39:42 +02:00
Sybren A. Stüvel
6e12a2fb25 Manager: keep track of which worker failed which task
When a Worker indicates a task failed, mark it as `soft-failed` until
enough workers have tried & failed at the same task.

This is the first step in a blocklisting system, where tasks of an
often-failing worker will be requeued to be retried by others.

NOTE: currently the failure list of a task is NOT reset whenever it is
requeued! This will be implemented in a future commit, and is tracked in
`FEATURES.md`.
2022-06-13 18:41:38 +02:00
Sybren A. Stüvel
ec5b3aac52 Manager: on getting task update from Worker, write log before status change
When receiving a `TaskUpdate` from a Worker, write to the task log, before
handling any task status change.

If both log and task status change are sent, the log will likely contain
the cause of the task state change. Any subsequent task logs, for example
generated by the Manager in response to the status change, should be
logged after that.
2022-06-13 18:40:42 +02:00
Sybren A. Stüvel
5dac3c2dc0 Manager: mark workers as 'seen' when they send updates
Update the 'last seen at' timestamp of workers when they:
- sign on
- sign off
- get a task assigned
- send a task update
- check whether they can keep running their task

Note that this commit is necessary to not have the workers time out
immediately ;-)
2022-06-13 12:47:07 +02:00
Sybren A. Stüvel
c3525c3b1a Manager: move task requeueing to TaskStateMachine
Requeueing the tasks of a specific worker is now done in the
`TaskStateMachine`, such that it can be called from other services as
well in future commits.

This also makes the `LogStorage` service a dependency of the
`TaskStateMachine`, as it needs to write "this task was requeued" kind
of messages to the task logs.
2022-06-13 12:33:01 +02:00
Sybren A. Stüvel
24204084c1 Manager: move timestamping of log messages to task_logs package
In the future different services will write to the task log, and thus
it makes sense to move the responsibility of prepending the timestamps
to the log storage service.
2022-06-09 17:00:38 +02:00
Sybren A. Stüvel
819cad1d18 Manager: move broadcasting of task logs via SocketIO to task log service
To ensure all task logs also get broadcast via SocketIO, the responsibility
has moved from the `api_impl` to the `task_logs` package.
2022-06-09 16:49:48 +02:00
Sybren A. Stüvel
354fd29f9e Manager: Start timeout counting as soon as Worker gets task assigned
Set the task's "last touched" field in the database to "now" as soon as
the task is assigned to a worker.
2022-06-09 11:58:30 +02:00
Sybren A. Stüvel
87bce6be36 Manager: unify logging of task assignment and requeue-on-signoff
The requeue-task-on-worker-signoff operation also needs to log a timestamp.
The code for this, and the recently added code for timestamping the
"task assigned to worker" message, are now unified.
2022-06-09 11:30:46 +02:00
Sybren A. Stüvel
75903a2da3 Manager: prepend timestamp to "task assigned to worker" task log entries
Add a new `clock` service to the Flamenco struct, which allows us to mock
the passing of time, and thus test for timestamps in a stable fashion.
2022-06-09 11:24:02 +02:00
Sybren A. Stüvel
b186ea1828 Manager: write to task log when assigning it to a worker 2022-06-09 10:59:44 +02:00
Sybren A. Stüvel
b4d2fc4231 Manager: keep track of when a Worker last worked on a task
This will be used for keeping track of stuck tasks.
2022-06-03 16:33:50 +02:00
Sybren A. Stüvel
6cf82e5d43 Manager: cleanup, refactor Worker state change request persistence code
Move the setting & clearing of worker state change requests into separate
functions.

No functional changes.
2022-06-02 16:36:06 +02:00
Sybren A. Stüvel
132ce8f2ec Merge 'shutdown' and 'offline' states
Move the 'shutdown' state code to the 'offline' state, to match the
removal of the 'shutdown' state from the OpenAPI definition.
2022-06-02 16:35:07 +02:00
Sybren A. Stüvel
9ed6b6d931 Manager: adjust code for WorkerStatusChangeRequest extraction
See preceeding OpenAPI change.
2022-06-02 12:17:54 +02:00
Sybren A. Stüvel
2e11c1c240 Manager: Implement SocketIO worker updates 2022-05-31 15:19:12 +02:00
Sybren A. Stüvel
90b567f97c Manager: store software version on worker sign-on 2022-05-31 12:29:25 +02:00
Sybren A. Stüvel
19db947eb4 Manager: remove Worker.LastActivity
This removes the field both from the OpenAPI interface and the database.
2022-05-31 10:46:27 +02:00
Sybren A. Stüvel
f77b11d85e Manager: add a small wrapper around Google's UUID library
Add a small wrapper around github.com/google/uuid. That way it's clearer
which functionality is used by Flamenco, doesn't link most of the code to
any specific UUID library, and allows a bit of customisation.

The only customisation now is that Flamenco is a bit stricter in the
formats it accepts; only the `xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx` is
accepted. This makes things a little bit stricter, with the advantage
that we don't need to do any normalisation of received UUID strings.
2022-05-20 15:35:51 +02:00
Sybren A. Stüvel
792b4ab141 Manager: on worker signoff, add a note to any requeued task logs
When a worker signs off, its tasks get requeued. This is now also saved in
the task log, and broadcast via SocketIO as task log chunk.
2022-05-20 14:17:17 +02:00
Sybren A. Stüvel
3e5f681321 Task log broadcasting via SocketIO
Implement task log broadcasting via SocketIO. The logs aren't shown in the
web interface yet, but do arrive there in a Pinia store. That store is
capped at 1000 lines to keep memory requirements low-ish.
2022-05-20 13:03:41 +02:00
Sybren A. Stüvel
3c622264a4 Manager: include 'activity' in SocketIO task updates
This also changes the order in which the task is updated; the activity is
now saved first, so that it can be included in the task status change
notification sent to SocketIO clients.
2022-05-19 14:27:42 +02:00
Sybren A. Stüvel
43f244ecab Manager: move TaskUpdate API function from jobs.go to workers.go
The OpenAPI spec tags this operation as `workers`, so it should be in
`workers.go`.

No functional changes.
2022-05-19 14:20:02 +02:00
Sybren A. Stüvel
0b39f229a1 Implement may-I-keep-running protocol
Worker and Manager implementation of the "may-I-kee-running" protocol.

While running tasks, the Worker will ask the Manager periodically
whether it's still allowed to keep running that task. This allows the
Manager to abort commands on Workers when:

- the Worker should go to another state (typically 'asleep' or
  'shutdown'),
- the task changed status from 'active' to something non-runnable
  (typically 'canceled' when the job as a whole is canceled).
- the task has been assigned to a different Worker. This can happen when
  a Worker loses its connection to its Manager, resulting in a task
  timeout (not yet implemented) after which the task can be assigned to
  another Worker. If then the connectivity is restored, the first Worker
  should abort (last-assigned Worker wins).
2022-05-12 15:06:05 +02:00
Sybren A. Stüvel
ba34652cd1 Implement task status changes from web interface
This also reworks some of the logic due to the recently-removed
`cancel-requested` task status.
2022-05-05 16:44:09 +02:00