Send & handle `JobUpdate.refresh_tasks = true` when many tasks are
updated simultaneously. This applies to things like cancelling &
requeueing an entire job.
This partially rolls back 67bf77de13d99b1bc5d7344951068822c4fadd88, as
it was too slow when 1000+ tasks were being updated all at once.
Worker and Manager implementation of the "may-I-kee-running" protocol.
While running tasks, the Worker will ask the Manager periodically
whether it's still allowed to keep running that task. This allows the
Manager to abort commands on Workers when:
- the Worker should go to another state (typically 'asleep' or
'shutdown'),
- the task changed status from 'active' to something non-runnable
(typically 'canceled' when the job as a whole is canceled).
- the task has been assigned to a different Worker. This can happen when
a Worker loses its connection to its Manager, resulting in a task
timeout (not yet implemented) after which the task can be assigned to
another Worker. If then the connectivity is restored, the first Worker
should abort (last-assigned Worker wins).
Improve how the task scheduler deals with tasks that already have a
worker assigned to them:
- When a Worker asks for a task, and there is already an active task
assigned to it, always return that task.
- Otherwise, never allow scheduling of active tasks, as those are
already being run by another worker. If this is not the case, their
status should change to queued/failed, instead of handling the
situation in the task scheduler.
- Apart from the assigned-and-active case above, ignore task's worker ID
when scheduling tasks. If the status is 'queued' or 'soft-failed', the
task's worker ID just indicates who ran the task last.
Check for jobs in 'cancel-requested' or 'requeued' statuses, and ensure
they transition to the right status. This happens at startup, before
even starting the web interface, so that a consistent state is presented.
When the job status changes, it impacts the task statuses as well. These
status changes are now no longer done with a single database query, but
instead each affected task is fetched, changed, and saved. This unifies
the regular & mass updates to the tasks, and causes the resulting task
changes to be broadcast to SocketIO clients.
Manager now sends out task updates via SocketIO, and the web interface
handles those.
Note that there is a `BroadcastTaskUpdate()` function, but not a
`BroadcastNewTask`. The 'new job' broadcast is sent after the job's
tasks have been created, and thus there is no need for a separate
broadcast per task.
SocketIO clients can now send a message with `/subscription` event type
in order to subscribe to or unsubscribe from job-related updates.
These job-related updates themselves aren't sent yet, so this is a change
that's impossible to really test. The socketIO code for joining/leaving
rooms is called, though.
Add `fetchJobTasks` operation to the Jobs API. This returns a summary of
each of the job's tasks, suitable for display in a task list view.
The actually used fields may need tweaking once we actually have a task
list view, but at least the functionality is there.
The password check of worker API calls was 2 orders of magnitude slower
than actually handling the API call itself. Since the Worker authentication
is not that important (it's all on the same network anyway, and Worker
account registration is automatic too), lowering the BCrypt cost to the
minimum helps.
On my machine, this reduces the time for password checks from 50 to 2 ms.
To prepare for job status changes being requestable from the API, store
the reason for any status change on the job itself.
Not yet part of the API, just on the persistence layer.
Avoid users of the persistence layer to have to test against Gorm errors,
by wrapping job/task errors in a new `PersistenceError` struct.
Instead of testing for `gorm.ErrRecordNotFound`, code can now test for
`persistence.ErrJobNotFound` or `persistence.ErrTaskNotFound`.
By having all info for the jobs table in the `JobUpdate` schema, it won't
have to query for the full job when a new job is added. This fetching of
the full job is now delayed until someone clicks on the table row.
Replace the Vue v2 webapp with a Vue v3 one, and embed the OpenAPI
client in the webapp itself (instead of being its own npm project).
- Vue v2.x -> v3.x
- Tabulator v4.x -> v5.1
- Moment JS -> replaced with Luxon JS
- Vue CLI/UI -> replaced with Vite
This helps to get things consistent on Windows and Linux. Otherwise a path
like `/some/path` is absolute on one platform but not on the other. This is
mostly for getting the unit tests in this package to work on Windows, but
using absolute paths also helps in clarity of error logging.
SQLite can return `SQLITE_BUSY` errors when it's doing too many things at
the same time. This is now improved a bit by setting a 5-second timeout,
during which the SQLite driver will wait for the database to become
available. If that doesn't happen, Flamenco Manager will return a
`503 Service Unavailable` response so that the client knows to back off
a little.