Replace old used-to-be-GORM datastructures (#104305) with sqlc-generated
structs. This also makes it possible to use more specific structs that
are more taylored to the specific queries, increasing efficiency.
This commit mostly deals with workers, including the sleep schedule and
task scheduler.
Functional changes are kept to a minimum, as the API still serves the
same data.
Because this work covers so much of Flamenco's code, it's been split up
into different commits. Each commit brings Flamenco to a state where it
compiles and unit tests pass. Only the result of the final commit has
actually been tested properly.
Ref: #104343
Introduce an "event bus"-like system. It's more like a fan-out
broadcaster for certain events. Instead of directly sending events to
SocketIO, they are now sent to the broker, which in turn sends it to any
registered "forwarder". Currently there is ony one forwarder, for
SocketIO.
This opens the door for a proper MQTT client that sends the same events
to an MQTT server.
When the worker is started with `-restart-exit-code 47` or has
`restart_exit_code=47` in `flamenco-worker.yaml`, it's marked as
'restartable'. This will enable two worker actions 'Restart
(immediately)' and 'Restart (after task is finished)' in the Manager web
interface. When a worker is asked to restart, it will exit with exit
code `47`. Of course any positive exit code can be used here.
Change the package base name of the Go code, from
`git.blender.org/flamenco` to `projects.blender.org/studio/flamenco`.
The old location, `git.blender.org`, has no longer been use since the
[migration to Gitea][1]. The new package names now reflect the actual
location where Flamenco is hosted.
[1]: https://code.blender.org/2023/02/new-blender-development-infrastructure/
Limit which worker statuses are remembered (when they go offline) to
those that we want to restore when they come back online. This is now
set to `awake` and `asleep`. This prevents workers from being told to go
to states that they cannot handle, such as `error` or `starting`.
Fix#99549: When sending Workers offline, remember their previous status
When the status of a worker goes offline, the Manager will now make the status of the worker to be remembered once it goes back online. So when the Worker makes this status change (so for example `X → offline`), Manager should immediately set `StatusRequested = "X" ` once it goes online.
Reviewed-on: https://projects.blender.org/studio/flamenco/pulls/104217
Fix an issue where workers would switch immediately on a state change
request, even if it was of the "after task is finished" kind.
The "may I keep running" endpoint wasn't checking the lazyness flag, and
thus any state change, lazy or otherwise, would interrupt the worker's
current task.
The Manager now broadcasts a worker update to SocketIO clients when a
worker gets a new task assigned. This ensures the "current task" shown in
the worker details view is up to date.
When a worker's tasks get requeued because it goes to sleep, the task log
will now mention the worker identification (name + UUID). This aids in
figuring out what happened to tasks.
When a Worker changes state from `awake` to something else, it cannot
run tasks any more. This now triggers a requeue of its active task
(should be one at most, if things are sane) so that another worker can pick
it up.
SQLite often errors out on this with only `interrupted (9)` as message.
This logging should at least tell us whether it's our own "background
context" timing out, or whether something else fishy is going on.
Fix workers timing out when they're `asleep`. When sleeping, the Worker
will call the `WorkerState` operation to see if they have to wake up, but
that didn't mark the workers as "seen". As a result, a sleeping worker
would always time out.
Run some API operations in a background context. This should prevent some
of the SQLite "interrupted" errors, as those can occur when the context
closes while a query is running.
The API operations that Workers use are now mostly running in a separate
background context, at least from the moment onward when they can run
independently of the Worker connection.
Move the Worker password hashing/comparison functions into a struct, and
use it via an interface. This will make it easier to switch to different
hashing algorithms.
Even with a low number of iterations, BCrypt is quite slow. That's good for
security, but not for Flamenco Worker authentication -- the password is
more as "nice check to avoid accidentally reusing the same ID" than
something for security.
Add a "Last Rendered" view to the webapp.
The Manager now stores (in the database) which job was the last
recipient of a rendered image, and serves that to the appropriate
OpenAPI endpoint.
A new SocketIO subscription + accompanying room makes it possible for
the web interface to receive all rendered images (if they survive the
queue, which discards images when it gets too full).
After processing an image in the "last-rendered" processor, a SocketIO
object is sent to clients to indicate the last-rendered image needs to
be (re)loaded.
This also moves the previously existing "done callback" from a single
function to a per-image callback, so that it can be called with the
right information in there, and only when that particular image is
actually done processing.
The notification message sent via SocketIO also contains the necessary
info to render the image, so that the web client doesn't have to call
the `fetchJobLastRenderedInfo` operation.
Add a handler for the OpenAPI `taskOutputProduced` operation, and an
image thumbnailing goroutine.
The queue of images to process + the function to handle queued images
is managed by `last_rendered.LastRenderedProcessor`. This queue currently
simply allows 3 requests; this should be improved such that it keeps
track of the job IDs as well, as with the current approach a spammy job
can starve the updates from a more calm job.
Move the code related to task updates from workers to
`worker_task_updates.go`. It's going to get more complex with the
blocklisting in there; this prepares for that.
No functional changes.
This fixes a bug where 'Worker undefined changed status' was logged in
the web interface, as that was (back then incorrectly) `workerupdate.name`.
Now that code is correct.
Before checking whether the Worker is allowed to do work (i.e. is in
`awake` state), check any queued-up status changes. Those should be
communicated, before saying "no work for you", so that the Worker can
actually respond to it.
When a Worker indicates a task failed, mark it as `soft-failed` until
enough workers have tried & failed at the same task.
This is the first step in a blocklisting system, where tasks of an
often-failing worker will be requeued to be retried by others.
NOTE: currently the failure list of a task is NOT reset whenever it is
requeued! This will be implemented in a future commit, and is tracked in
`FEATURES.md`.
When receiving a `TaskUpdate` from a Worker, write to the task log, before
handling any task status change.
If both log and task status change are sent, the log will likely contain
the cause of the task state change. Any subsequent task logs, for example
generated by the Manager in response to the status change, should be
logged after that.
Update the 'last seen at' timestamp of workers when they:
- sign on
- sign off
- get a task assigned
- send a task update
- check whether they can keep running their task
Note that this commit is necessary to not have the workers time out
immediately ;-)
Requeueing the tasks of a specific worker is now done in the
`TaskStateMachine`, such that it can be called from other services as
well in future commits.
This also makes the `LogStorage` service a dependency of the
`TaskStateMachine`, as it needs to write "this task was requeued" kind
of messages to the task logs.
In the future different services will write to the task log, and thus
it makes sense to move the responsibility of prepending the timestamps
to the log storage service.
The requeue-task-on-worker-signoff operation also needs to log a timestamp.
The code for this, and the recently added code for timestamping the
"task assigned to worker" message, are now unified.
Add a small wrapper around github.com/google/uuid. That way it's clearer
which functionality is used by Flamenco, doesn't link most of the code to
any specific UUID library, and allows a bit of customisation.
The only customisation now is that Flamenco is a bit stricter in the
formats it accepts; only the `xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx` is
accepted. This makes things a little bit stricter, with the advantage
that we don't need to do any normalisation of received UUID strings.
Implement task log broadcasting via SocketIO. The logs aren't shown in the
web interface yet, but do arrive there in a Pinia store. That store is
capped at 1000 lines to keep memory requirements low-ish.
This also changes the order in which the task is updated; the activity is
now saved first, so that it can be included in the task status change
notification sent to SocketIO clients.