Allow overriding the worker name by setting the `FLAMENCO_WORKER_NAME`
environment variable. This makes it easy to do from Docker configs, and,
more importantly, from the scripts I use to run multiple workers on the
same machine while developing Flamenco.
Move the code related to task updates from workers to
`worker_task_updates.go`. It's going to get more complex with the
blocklisting in there; this prepares for that.
No functional changes.
When a job or task gets requeued from the web interface, its task
failure lists (i.e. the list of workers that previously failed this
task) will be cleared.
This clearing doesn't happen in other situations, e.g. when a worker
signs off and its task gets requeued, the task's failure list will
remain as-is.
Rename functions `onTaskStatusX` to `updateJobOnTaskStatusX` to clarify
their responsibility is to update the job in reaction to a task status
change.
No functional changes.
Introduce `ParameterMissingError` and `ParameterInvalidError` structs, to
be returned from command executors. These replace free-form `fmt.Errorf()`
style errors.
Fix sqlite issues in the "upstream buffer" test. The test used
`:memory:` to have an in-memory DB to separate from other tests. The
"flush at shutdown" code runs in a different goroutine, though, and
creates a new DB connection. The SQLite separation was too strong,
making that function not find any tables. This is now solved by having
an in-memory database that's shared between all connections made from
the same unit test.
Within the shutdown procedure, signing off is now the last thing the
worker does. This makes things more consistent from the Manager's point
of view (like receiving last-second log entries while the Worker is still
online).
This fixes a bug where 'Worker undefined changed status' was logged in
the web interface, as that was (back then incorrectly) `workerupdate.name`.
Now that code is correct.
Before checking whether the Worker is allowed to do work (i.e. is in
`awake` state), check any queued-up status changes. Those should be
communicated, before saying "no work for you", so that the Worker can
actually respond to it.
When a Worker indicates a task failed, mark it as `soft-failed` until
enough workers have tried & failed at the same task.
This is the first step in a blocklisting system, where tasks of an
often-failing worker will be requeued to be retried by others.
NOTE: currently the failure list of a task is NOT reset whenever it is
requeued! This will be implemented in a future commit, and is tracked in
`FEATURES.md`.
The persistence layer can now store which worker failed which task, as
preparation for a blocklisting system. Such a system should be able to
determine whether there are still any workers left to do the work.
This is needed for a future unit test, and exposed the fact that SQLite
didn't enforce foreign key constraints (and thus also didn't handle
on-delete-cascade attributes). This has been fixed in the previous commit.
When creating tasks the inter-task dependencies are saved as a 2nd pass,by
updating the tasks in the database. This now only saves those dependencies,
and no longer saves the entire task again.
`persistence.Model` contains the common database fields for most model
structs. It is a copy of `gorm.Model`, but without the `DeletedAt`
field (which triggers Gorm's soft deletion).
Soft deletion is not used by Flamenco. If it ever becomes necessary to
support soft-deletion, see https://gorm.io/docs/delete.html#Soft-Delete
When receiving a `TaskUpdate` from a Worker, write to the task log, before
handling any task status change.
If both log and task status change are sent, the log will likely contain
the cause of the task state change. Any subsequent task logs, for example
generated by the Manager in response to the status change, should be
logged after that.
The canary test asserts that certain constants still have the expected
value. Lowering those constants is good for testing the timeout stuff with
the actual Flamenco Manager + Worker (without having to wait 5 minutes for
it to kick in), but it's too easy to accidentally run the unit tests and
get cryptic errors about everything failing horribly and miserably when
you leave those constants low.
Update the 'last seen at' timestamp of workers when they:
- sign on
- sign off
- get a task assigned
- send a task update
- check whether they can keep running their task
Note that this commit is necessary to not have the workers time out
immediately ;-)