This fixes a bug where 'Worker undefined changed status' was logged in
the web interface, as that was (back then incorrectly) `workerupdate.name`.
Now that code is correct.
Before checking whether the Worker is allowed to do work (i.e. is in
`awake` state), check any queued-up status changes. Those should be
communicated, before saying "no work for you", so that the Worker can
actually respond to it.
When a Worker indicates a task failed, mark it as `soft-failed` until
enough workers have tried & failed at the same task.
This is the first step in a blocklisting system, where tasks of an
often-failing worker will be requeued to be retried by others.
NOTE: currently the failure list of a task is NOT reset whenever it is
requeued! This will be implemented in a future commit, and is tracked in
`FEATURES.md`.
The persistence layer can now store which worker failed which task, as
preparation for a blocklisting system. Such a system should be able to
determine whether there are still any workers left to do the work.
This is needed for a future unit test, and exposed the fact that SQLite
didn't enforce foreign key constraints (and thus also didn't handle
on-delete-cascade attributes). This has been fixed in the previous commit.
When creating tasks the inter-task dependencies are saved as a 2nd pass,by
updating the tasks in the database. This now only saves those dependencies,
and no longer saves the entire task again.
`persistence.Model` contains the common database fields for most model
structs. It is a copy of `gorm.Model`, but without the `DeletedAt`
field (which triggers Gorm's soft deletion).
Soft deletion is not used by Flamenco. If it ever becomes necessary to
support soft-deletion, see https://gorm.io/docs/delete.html#Soft-Delete
When receiving a `TaskUpdate` from a Worker, write to the task log, before
handling any task status change.
If both log and task status change are sent, the log will likely contain
the cause of the task state change. Any subsequent task logs, for example
generated by the Manager in response to the status change, should be
logged after that.
The canary test asserts that certain constants still have the expected
value. Lowering those constants is good for testing the timeout stuff with
the actual Flamenco Manager + Worker (without having to wait 5 minutes for
it to kick in), but it's too easy to accidentally run the unit tests and
get cryptic errors about everything failing horribly and miserably when
you leave those constants low.
Update the 'last seen at' timestamp of workers when they:
- sign on
- sign off
- get a task assigned
- send a task update
- check whether they can keep running their task
Note that this commit is necessary to not have the workers time out
immediately ;-)
Requeueing the tasks of a specific worker is now done in the
`TaskStateMachine`, such that it can be called from other services as
well in future commits.
This also makes the `LogStorage` service a dependency of the
`TaskStateMachine`, as it needs to write "this task was requeued" kind
of messages to the task logs.
SQLite doesn't handle timezones by default, when you just use something
like `date1 < date2`, for example. This makes GORM explicitly use UTC
timestamps for the `CreatedAt`, `UpdatedAt`, and `DeletedAt` fields.
Our own code should also use UTC when saving timestamps. That way all
datetimes in the database are in the same timezone, and can be compared
naievely.
The `TestTaskTimeout()` unit test assumes specific durations for initial &
subsequent sleeps of the timeout checker. The test will fail quite
cryptically when that assumption doesn't hold, so just test for it at
the start of the unit test.
Tasks that are in state `active` but haven't been 'touched' by a Worker
for 10 minutes or longer will transition to state `failed`.
In the future, it might be better to move the decision about which state
is suitable to the Task State Machine service, so that it can be smarter
and take the history of the task into account. Going to `soft-failed`
first might be a nice touch.
In the future different services will write to the task log, and thus
it makes sense to move the responsibility of prepending the timestamps
to the log storage service.
The requeue-task-on-worker-signoff operation also needs to log a timestamp.
The code for this, and the recently added code for timestamping the
"task assigned to worker" message, are now unified.
A worker state change request can now be cancelled by requesting the worker
to go to its current state. In other words, a previously requested change
`A → B` can be cancelled by requesting the worker goes to state `A`.
Previously this would simply overwrite the last request, resulting in a
requested state change `A → A`. Having this non-lazy would even interrupt
the currently running task.
rFcfb17b178da2055ef12b2aa2ad8f7f778a952bc3 changed the semantics of
`SocketIOWorkerUpdate`, in the sense that any update that doesn't change
the worker status can omit `previous_status`. This commit adjusts the
unit test for this.