2210 Commits

Author SHA1 Message Date
Sybren A. Stüvel
ec55cf6ce1 FEATURES.md: more features 2022-06-17 16:37:06 +02:00
Sybren A. Stüvel
64c8fa851d Show assigned worker in task details
Show the worker assigned to the task in the task details view, as link
to the worker itself.
2022-06-17 16:36:55 +02:00
Sybren A. Stüvel
7327896db9 Worker: allow overriding worker name from environment
Allow overriding the worker name by setting the `FLAMENCO_WORKER_NAME`
environment variable. This makes it easy to do from Docker configs, and,
more importantly, from the scripts I use to run multiple workers on the
same machine while developing Flamenco.
2022-06-17 16:24:03 +02:00
Sybren A. Stüvel
857704c184 Web: worker nickname → name
See 55676b000efbd04cd895da9068f375dfad473ff4
2022-06-17 15:55:36 +02:00
Sybren A. Stüvel
cdb7789f08 Refactor: Manager, move test code
Move code that covers `worker_task_updates.go` into
`worker_task_updates_test.go`.

No functional changes.
2022-06-17 15:51:15 +02:00
Sybren A. Stüvel
046853932d Manager: re-queue previously failed tasks of worker when blocklisting
When a Worker is blocked from a job, re-queue its previously failed tasks
so that other workers can give them a try.
2022-06-17 15:49:16 +02:00
Sybren A. Stüvel
b95bed1f96 Refactor: rename RequeueTasksOfWorker to RequeueActiveTasksOfWorker
Soon there will be another function to requeue tasks of workers by other
criteria, so being clear in the name helps.

No functional changes.
2022-06-17 15:49:16 +02:00
Sybren A. Stüvel
fd31a85bcd Manager: add blocking of workers when they fail certain tasks too much
When a worker fails too many tasks, of the same task type, on the same job,
it'll get blocked from doing those.
2022-06-17 15:49:16 +02:00
Sybren A. Stüvel
56abc825a6 Refactor: Manager, refactor handling of task failures
Split the handling of soft and hard failures into separate functions.

No functional changes intended.
2022-06-17 15:01:52 +02:00
Sybren A. Stüvel
0396919229 FEATURES: add new way in which jobs can get stuck 2022-06-17 14:59:26 +02:00
Sybren A. Stüvel
6feee74c54 Cleanup: Manager, move worker task update handling code into its own file
Move the code related to task updates from workers to
`worker_task_updates.go`. It's going to get more complex with the
blocklisting in there; this prepares for that.

No functional changes.
2022-06-17 11:46:07 +02:00
Sybren A. Stüvel
50e795c595 FEATURES.md: mark 'clear task failure list' as done 2022-06-17 11:39:57 +02:00
Sybren A. Stüvel
81f81d0e0a Show task failure list in the web frontend
Show the task failure list in the web frontend's `TaskDetails` component.
2022-06-17 11:37:56 +02:00
Sybren A. Stüvel
7f14dac62f OAPI: regenerate code 2022-06-17 11:37:54 +02:00
Sybren A. Stüvel
aaed1e0589 OAPI: include task failure list in Task schema
Include the list of workers who failed this task in the `Task` schema.
2022-06-17 11:37:28 +02:00
Sybren A. Stüvel
0b5140fc5f Manager: clear task failure list on requeueing of jobs & tasks
When a job or task gets requeued from the web interface, its task
failure lists (i.e. the list of workers that previously failed this
task) will be cleared.

This clearing doesn't happen in other situations, e.g. when a worker
signs off and its task gets requeued, the task's failure list will
remain as-is.
2022-06-17 11:37:28 +02:00
Sybren A. Stüvel
d8be9d95e8 README: document task status meanings 2022-06-17 11:37:28 +02:00
Sybren A. Stüvel
e9fca8d993 Cleanup: typo fix in comment 2022-06-17 11:03:43 +02:00
Sybren A. Stüvel
b991e5f446 Cleanup: Manager, clarify some function names of the task state machine
Rename functions `onTaskStatusX` to `updateJobOnTaskStatusX` to clarify
their responsibility is to update the job in reaction to a task status
change.

No functional changes.
2022-06-17 11:01:41 +02:00
Sybren A. Stüvel
8764f8f7c1 Manager: task scheduler, don't schedule tasks the worker failed before
When a worker asks for a task to perform, don't give it a task that it
failed before.
2022-06-16 16:02:28 +02:00
Sybren A. Stüvel
ec10128f85 Worker: Sleep command, return error when sleep time is negative
I need a way to reliably generate task errors, and having a more thorough
check on the sleep duration parameter seemed a nice way to create those.
2022-06-16 15:46:03 +02:00
Sybren A. Stüvel
d5d0893b05 Worker: use explicit types for command parameter errors
Introduce `ParameterMissingError` and `ParameterInvalidError` structs, to
be returned from command executors. These replace free-form `fmt.Errorf()`
style errors.
2022-06-16 15:45:09 +02:00
Sybren A. Stüvel
8af1b9d976 Worker: fix sync issue in TestUpstreamBufferManagerUnavailable unit test
Fix synchronisation/goroutine issue in the "upstream buffer" test,
where very occasionally the queue size was checked at the wrong time.
2022-06-16 15:43:20 +02:00
Sybren A. Stüvel
da1b42f9fa Worker: fix sqlite connection issue in unit tests
Fix sqlite issues in the "upstream buffer" test. The test used
`:memory:` to have an in-memory DB to separate from other tests. The
"flush at shutdown" code runs in a different goroutine, though, and
creates a new DB connection. The SQLite separation was too strong,
making that function not find any tables. This is now solved by having
an in-memory database that's shared between all connections made from
the same unit test.
2022-06-16 15:42:52 +02:00
Sybren A. Stüvel
7e28cfa69c Worker: add task failures to the task log as well
Task failures were only placed in the task's activity field, and are now
added to the log as well.
2022-06-16 12:22:05 +02:00
Sybren A. Stüvel
e1309ad8fc Worker: flush upstream buffer when shutting down
When shutting down, the worker now tries to flush any buffered task updates
before closing.
2022-06-16 12:21:17 +02:00
Sybren A. Stüvel
9ddf72fa37 Worker: sign off as last step of shutdown
Within the shutdown procedure, signing off is now the last thing the
worker does. This makes things more consistent from the Manager's point
of view (like receiving last-second log entries while the Worker is still
online).
2022-06-16 12:19:03 +02:00
Sybren A. Stüvel
5bc94101e8 Worker: Avoid sleep at shutdown
Make the sleep between fetching tasks interruptable, so that a shutdown
doesn't have to wait a few seconds.
2022-06-16 12:08:13 +02:00
Sybren A. Stüvel
9ab41984ac Adjust Go code for Nickname -> Name change
This fixes a bug where 'Worker undefined changed status' was logged in
the web interface, as that was (back then incorrectly) `workerupdate.name`.
Now that code is correct.
2022-06-16 11:03:18 +02:00
Sybren A. Stüvel
61aad21e99 OAPI: regenerate code 2022-06-16 11:02:04 +02:00
Sybren A. Stüvel
55676b000e OAPI: change worker 'nickname' to just 'name'
There was no need to have the extra four letters 'nick', and some parts
of the code were already using just 'name' for the workers. This simplifies
and unifies things.
2022-06-16 11:01:27 +02:00
Sybren A. Stüvel
12f0a605a4 Manager: log configured worker timeout at startup 2022-06-16 10:51:17 +02:00
Sybren A. Stüvel
5f2712980e Manager: task scheduler, check for requested worker status change first
Before checking whether the Worker is allowed to do work (i.e. is in
`awake` state), check any queued-up status changes. Those should be
communicated, before saying "no work for you", so that the Worker can
actually respond to it.
2022-06-16 10:48:38 +02:00
Sybren A. Stüvel
ee53373878 Cleanup: compare worker state to constant instead of hard-coded state
Use the `requiredStatusToGetTask` constant to compare the worker status,
and not just for logging.

No functional changes, just better code.
2022-06-16 10:46:50 +02:00
Sybren A. Stüvel
40f711bf69 Fix two unit tests for the previous commit
I pushed too soon :'(
2022-06-16 10:42:04 +02:00
Sybren A. Stüvel
be0b10400f Manager: count workers as 'seen' even when there is no task
Fix a bug where a worker would only be counted as 'seen' by the task
scheduler if it actually got a task assigned.
2022-06-16 10:39:42 +02:00
Sybren A. Stüvel
7d7c2b1bd6 Cleanup: blacklist → blocklist
Change "blacklist" to "blocklist", because that makes people happier.

No functional changes.
2022-06-16 10:36:36 +02:00
Sybren A. Stüvel
6e12a2fb25 Manager: keep track of which worker failed which task
When a Worker indicates a task failed, mark it as `soft-failed` until
enough workers have tried & failed at the same task.

This is the first step in a blocklisting system, where tasks of an
often-failing worker will be requeued to be retried by others.

NOTE: currently the failure list of a task is NOT reset whenever it is
requeued! This will be implemented in a future commit, and is tracked in
`FEATURES.md`.
2022-06-13 18:41:38 +02:00
Sybren A. Stüvel
c5debdeb70 Manager: add 'task failure list' to record workers failing tasks
The persistence layer can now store which worker failed which task, as
preparation for a blocklisting system. Such a system should be able to
determine whether there are still any workers left to do the work.
2022-06-13 18:41:30 +02:00
Sybren A. Stüvel
e35911d106 Manager: add ability to delete jobs
This is needed for a future unit test, and exposed the fact that SQLite
didn't enforce foreign key constraints (and thus also didn't handle
on-delete-cascade attributes). This has been fixed in the previous commit.
2022-06-13 18:41:19 +02:00
Sybren A. Stüvel
e5d0e987e1 Manager: enforce DB foreign key checks at startup
SQLite disables foreign key checks by default, so Flamenco has to enable
them explicitly.
2022-06-13 18:41:19 +02:00
Sybren A. Stüvel
6ec493d944 Manager, more efficiently create tasks
When creating tasks the inter-task dependencies are saved as a 2nd pass,by
updating the tasks in the database. This now only saves those dependencies,
and no longer saves the entire task again.
2022-06-13 18:40:42 +02:00
Sybren A. Stüvel
02bc03ae2b Manager: replace gorm.Model with our own persistence.Model struct
`persistence.Model` contains the common database fields for most model
structs. It is a copy of `gorm.Model`, but without the `DeletedAt`
field (which triggers Gorm's soft deletion).

Soft deletion is not used by Flamenco. If it ever becomes necessary to
support soft-deletion, see https://gorm.io/docs/delete.html#Soft-Delete
2022-06-13 18:40:42 +02:00
Sybren A. Stüvel
ec5b3aac52 Manager: on getting task update from Worker, write log before status change
When receiving a `TaskUpdate` from a Worker, write to the task log, before
handling any task status change.

If both log and task status change are sent, the log will likely contain
the cause of the task state change. Any subsequent task logs, for example
generated by the Manager in response to the status change, should be
logged after that.
2022-06-13 18:40:42 +02:00
Sybren A. Stüvel
25d5b01b3c Cleanup: test errors with assert.NoError() instead of assert.Nil()
No functional changes, just nicer way to test.
2022-06-13 18:40:42 +02:00
Sybren A. Stüvel
6fc936d0a6 Revert accidental debug code
Revert change in rF01c45afc20854918d1f18e6859b4154499d500b6 that made
unit tests use an on-disk database.
2022-06-13 18:40:25 +02:00
Sybren A. Stüvel
b922722614 Manager: broadcast worker timeouts over SocketIO
This way the web interface will also show timed-out workers.
2022-06-13 13:05:20 +02:00
Sybren A. Stüvel
75ca0e652e Cleanup: timeout checker, improve readability of failed tests
No functional changes
2022-06-13 12:50:27 +02:00
Sybren A. Stüvel
1de1e3a9a5 Manager: add 'canary' test to all timeout checker tests
The canary test asserts that certain constants still have the expected
value. Lowering those constants is good for testing the timeout stuff with
the actual Flamenco Manager + Worker (without having to wait 5 minutes for
it to kick in), but it's too easy to accidentally run the unit tests and
get cryptic errors about everything failing horribly and miserably when
you leave those constants low.
2022-06-13 12:50:02 +02:00
Sybren A. Stüvel
5dac3c2dc0 Manager: mark workers as 'seen' when they send updates
Update the 'last seen at' timestamp of workers when they:
- sign on
- sign off
- get a task assigned
- send a task update
- check whether they can keep running their task

Note that this commit is necessary to not have the workers time out
immediately ;-)
2022-06-13 12:47:07 +02:00