2180 Commits

Author SHA1 Message Date
Sybren A. Stüvel
55676b000e OAPI: change worker 'nickname' to just 'name'
There was no need to have the extra four letters 'nick', and some parts
of the code were already using just 'name' for the workers. This simplifies
and unifies things.
2022-06-16 11:01:27 +02:00
Sybren A. Stüvel
12f0a605a4 Manager: log configured worker timeout at startup 2022-06-16 10:51:17 +02:00
Sybren A. Stüvel
5f2712980e Manager: task scheduler, check for requested worker status change first
Before checking whether the Worker is allowed to do work (i.e. is in
`awake` state), check any queued-up status changes. Those should be
communicated, before saying "no work for you", so that the Worker can
actually respond to it.
2022-06-16 10:48:38 +02:00
Sybren A. Stüvel
ee53373878 Cleanup: compare worker state to constant instead of hard-coded state
Use the `requiredStatusToGetTask` constant to compare the worker status,
and not just for logging.

No functional changes, just better code.
2022-06-16 10:46:50 +02:00
Sybren A. Stüvel
40f711bf69 Fix two unit tests for the previous commit
I pushed too soon :'(
2022-06-16 10:42:04 +02:00
Sybren A. Stüvel
be0b10400f Manager: count workers as 'seen' even when there is no task
Fix a bug where a worker would only be counted as 'seen' by the task
scheduler if it actually got a task assigned.
2022-06-16 10:39:42 +02:00
Sybren A. Stüvel
7d7c2b1bd6 Cleanup: blacklist → blocklist
Change "blacklist" to "blocklist", because that makes people happier.

No functional changes.
2022-06-16 10:36:36 +02:00
Sybren A. Stüvel
6e12a2fb25 Manager: keep track of which worker failed which task
When a Worker indicates a task failed, mark it as `soft-failed` until
enough workers have tried & failed at the same task.

This is the first step in a blocklisting system, where tasks of an
often-failing worker will be requeued to be retried by others.

NOTE: currently the failure list of a task is NOT reset whenever it is
requeued! This will be implemented in a future commit, and is tracked in
`FEATURES.md`.
2022-06-13 18:41:38 +02:00
Sybren A. Stüvel
c5debdeb70 Manager: add 'task failure list' to record workers failing tasks
The persistence layer can now store which worker failed which task, as
preparation for a blocklisting system. Such a system should be able to
determine whether there are still any workers left to do the work.
2022-06-13 18:41:30 +02:00
Sybren A. Stüvel
e35911d106 Manager: add ability to delete jobs
This is needed for a future unit test, and exposed the fact that SQLite
didn't enforce foreign key constraints (and thus also didn't handle
on-delete-cascade attributes). This has been fixed in the previous commit.
2022-06-13 18:41:19 +02:00
Sybren A. Stüvel
e5d0e987e1 Manager: enforce DB foreign key checks at startup
SQLite disables foreign key checks by default, so Flamenco has to enable
them explicitly.
2022-06-13 18:41:19 +02:00
Sybren A. Stüvel
6ec493d944 Manager, more efficiently create tasks
When creating tasks the inter-task dependencies are saved as a 2nd pass,by
updating the tasks in the database. This now only saves those dependencies,
and no longer saves the entire task again.
2022-06-13 18:40:42 +02:00
Sybren A. Stüvel
02bc03ae2b Manager: replace gorm.Model with our own persistence.Model struct
`persistence.Model` contains the common database fields for most model
structs. It is a copy of `gorm.Model`, but without the `DeletedAt`
field (which triggers Gorm's soft deletion).

Soft deletion is not used by Flamenco. If it ever becomes necessary to
support soft-deletion, see https://gorm.io/docs/delete.html#Soft-Delete
2022-06-13 18:40:42 +02:00
Sybren A. Stüvel
ec5b3aac52 Manager: on getting task update from Worker, write log before status change
When receiving a `TaskUpdate` from a Worker, write to the task log, before
handling any task status change.

If both log and task status change are sent, the log will likely contain
the cause of the task state change. Any subsequent task logs, for example
generated by the Manager in response to the status change, should be
logged after that.
2022-06-13 18:40:42 +02:00
Sybren A. Stüvel
25d5b01b3c Cleanup: test errors with assert.NoError() instead of assert.Nil()
No functional changes, just nicer way to test.
2022-06-13 18:40:42 +02:00
Sybren A. Stüvel
6fc936d0a6 Revert accidental debug code
Revert change in rF01c45afc20854918d1f18e6859b4154499d500b6 that made
unit tests use an on-disk database.
2022-06-13 18:40:25 +02:00
Sybren A. Stüvel
b922722614 Manager: broadcast worker timeouts over SocketIO
This way the web interface will also show timed-out workers.
2022-06-13 13:05:20 +02:00
Sybren A. Stüvel
75ca0e652e Cleanup: timeout checker, improve readability of failed tests
No functional changes
2022-06-13 12:50:27 +02:00
Sybren A. Stüvel
1de1e3a9a5 Manager: add 'canary' test to all timeout checker tests
The canary test asserts that certain constants still have the expected
value. Lowering those constants is good for testing the timeout stuff with
the actual Flamenco Manager + Worker (without having to wait 5 minutes for
it to kick in), but it's too easy to accidentally run the unit tests and
get cryptic errors about everything failing horribly and miserably when
you leave those constants low.
2022-06-13 12:50:02 +02:00
Sybren A. Stüvel
5dac3c2dc0 Manager: mark workers as 'seen' when they send updates
Update the 'last seen at' timestamp of workers when they:
- sign on
- sign off
- get a task assigned
- send a task update
- check whether they can keep running their task

Note that this commit is necessary to not have the workers time out
immediately ;-)
2022-06-13 12:47:07 +02:00
Sybren A. Stüvel
986b647967 Manager: re-queue tasks of timed-out workers
Allow other workers to pick up the task(s) assigned to a timed-out worker.
2022-06-13 12:38:35 +02:00
Sybren A. Stüvel
7d5aae25b5 Manager: add timeout checks for workers 2022-06-13 12:33:22 +02:00
Sybren A. Stüvel
e8171fc597 Cleanup: Manager, reduce log level of task timeout checks 2022-06-13 12:33:16 +02:00
Sybren A. Stüvel
67562856d3 Manager: let Gorm create an index on Task.LastTouchedAt
It's used in timeout queries, and there could be tens or hundreds of
thousands of tasks in the database.
2022-06-13 12:33:05 +02:00
Sybren A. Stüvel
c3525c3b1a Manager: move task requeueing to TaskStateMachine
Requeueing the tasks of a specific worker is now done in the
`TaskStateMachine`, such that it can be called from other services as
well in future commits.

This also makes the `LogStorage` service a dependency of the
`TaskStateMachine`, as it needs to write "this task was requeued" kind
of messages to the task logs.
2022-06-13 12:33:01 +02:00
Sybren A. Stüvel
e06bc484f4 Cleanup: manager, move task state machine interfaces to their own file
No functional changes.
2022-06-13 12:32:18 +02:00
Sybren A. Stüvel
01c45afc20 Manager: explicitly store timestamps as UTC
SQLite doesn't handle timezones by default, when you just use something
like `date1 < date2`, for example. This makes GORM explicitly use UTC
timestamps for the `CreatedAt`, `UpdatedAt`, and `DeletedAt` fields.
Our own code should also use UTC when saving timestamps. That way all
datetimes in the database are in the same timezone, and can be compared
naievely.
2022-06-13 12:10:11 +02:00
Sybren A. Stüvel
ec3a74f5f6 VSCode: disable 'cover on save' setting, it's too noisy 2022-06-10 16:53:39 +02:00
Sybren A. Stüvel
bf831aa0fd FEATURES: mark task timeout monitoring as done 2022-06-10 16:14:38 +02:00
Sybren A. Stüvel
fe1627dd85 Cleanup: timeout checker, move task-specific code to tasks.go
Just a cleanup to prepare for the addition of worker timeouts.
2022-06-10 14:58:44 +02:00
Sybren A. Stüvel
13307c5a24 Manager: add canary test to timeout checker unit test
The `TestTaskTimeout()` unit test assumes specific durations for initial &
subsequent sleeps of the timeout checker. The test will fail quite
cryptically when that assumption doesn't hold, so just test for it at
the start of the unit test.
2022-06-10 14:53:23 +02:00
Sybren A. Stüvel
09902d201c Manager: fix task timeout check logging of assigned workers
The task's worker wasn't fetched from the database, always causing
"unknown worker" messages in the task log.
2022-06-10 14:52:03 +02:00
Sybren A. Stüvel
734982ffbc Manager: log HTTP endpoints only at Trace level
Log available HTTP URLs only at trace level; it made the debug log too
noisy.
2022-06-10 14:50:41 +02:00
Sybren A. Stüvel
d90a8b987d Manager: Task Timeout Checker
Tasks that are in state `active` but haven't been 'touched' by a Worker
for 10 minutes or longer will transition to state `failed`.

In the future, it might be better to move the decision about which state
is suitable to the Task State Machine service, so that it can be smarter
and take the history of the task into account. Going to `soft-failed`
first might be a nice touch.
2022-06-10 14:32:02 +02:00
Sybren A. Stüvel
295891a17a Manager: ensure Gorm-generated timestamps are in UTC
SQLite should store all timestamps in UTC, as the database is woefully
unaware of timezones and will compare lexicographically.
2022-06-10 14:31:53 +02:00
Sybren A. Stüvel
24204084c1 Manager: move timestamping of log messages to task_logs package
In the future different services will write to the task log, and thus
it makes sense to move the responsibility of prepending the timestamps
to the log storage service.
2022-06-09 17:00:38 +02:00
Sybren A. Stüvel
819cad1d18 Manager: move broadcasting of task logs via SocketIO to task log service
To ensure all task logs also get broadcast via SocketIO, the responsibility
has moved from the `api_impl` to the `task_logs` package.
2022-06-09 16:49:48 +02:00
Sybren A. Stüvel
04dd479248 Manager: protect task log writing with mutex
A per-task mutex is used to protect the writing of task logs, so that
mutliple goroutines can safely write to the same task log.
2022-06-09 14:44:54 +02:00
Sybren A. Stüvel
92d6693871 Show Task's "last touched" in the web interface 2022-06-09 11:59:43 +02:00
Sybren A. Stüvel
1c9846bb8f OAPI: regenerate code 2022-06-09 11:59:32 +02:00
Sybren A. Stüvel
f020582bf7 OpenAPI: include last_touched in Task schema
Include the timestamp of when a Worker last touched the task in the OpenAPI
`Task` schema.
2022-06-09 11:59:01 +02:00
Sybren A. Stüvel
354fd29f9e Manager: Start timeout counting as soon as Worker gets task assigned
Set the task's "last touched" field in the database to "now" as soon as
the task is assigned to a worker.
2022-06-09 11:58:30 +02:00
Sybren A. Stüvel
87bce6be36 Manager: unify logging of task assignment and requeue-on-signoff
The requeue-task-on-worker-signoff operation also needs to log a timestamp.
The code for this, and the recently added code for timestamping the
"task assigned to worker" message, are now unified.
2022-06-09 11:30:46 +02:00
Sybren A. Stüvel
75903a2da3 Manager: prepend timestamp to "task assigned to worker" task log entries
Add a new `clock` service to the Flamenco struct, which allows us to mock
the passing of time, and thus test for timestamps in a stable fashion.
2022-06-09 11:24:02 +02:00
Sybren A. Stüvel
7c43b9e1bc Web: include status by name in job & task tables
Having only the status dot was hard to read. It requires you to learn &
remember the different colours, or to mouse-over and wait to see the
tooltip. For accessibility, we shouldn't be using just the colour to
convey information in the interface.
2022-06-09 11:01:03 +02:00
Sybren A. Stüvel
b186ea1828 Manager: write to task log when assigning it to a worker 2022-06-09 10:59:44 +02:00
Sybren A. Stüvel
b4d2fc4231 Manager: keep track of when a Worker last worked on a task
This will be used for keeping track of stuck tasks.
2022-06-03 16:33:50 +02:00
Sybren A. Stüvel
0be1ca30dd Cleanup: manager, move api_impl interfaces to interfaces.go
The number of interfaces declared by the `api_impl` package is getting
large, so they deserve their own file.

No functional changes.
2022-06-03 15:52:07 +02:00
Sybren A. Stüvel
bba5c6020d FEATURES.md: mark some features as done, add some others 2022-06-03 15:52:07 +02:00
Sybren A. Stüvel
b41feee313 Web: reduce workers table height
The 720px was almost filling up the entire height, making it hard to add
anything new at the top. Soon it should be auto-resizing anyway, making
this less relevant.
2022-06-03 13:02:23 +02:00