Add a few more unit tests for the persistence layer. The goal is to have
100% coverage of the happy flow, to aid in conversion from GORM to sqlc.
No functional changes.
As a safety measure, refuse to delete Workers from the Manager's database
when foreign key constraints are disabled.
In the long term, the underlying problem should be solved. This is a stop-
gap measure to ensure database consistency.
Before deleting a Worker Tag, check that foreign key constraints are
active for the current database connection.
Sometimes GORM decides to create a new database connection by itself,
without telling us, and then foreign key constraints are not active on
it. This commit is a workaround to avoid database corruption.
Move some of the Worker Tags test code into a function of its own, to have
a clearer separation between 'the test' and 'what needs to happen to do
this part of the test'.
Also it'll make an upcoming change easier to implement.
No functional changes.
Back in the days when I wrote the code, I didn't know about the
`require` package yet. Using `require.NoError()` makes the test code
more straight-forward.
No functional changes, except that when tests fail, they now fail
without panicking.
Back in the days when I wrote the code, I didn't know about the
`require` package yet. Using `require.NoError()` makes the test code
more straight-forward.
No functional changes, except that when tests fail, they now fail
without panicking.
Fix the database migration that adds `NOT NULL` clauses. It used
`INSERT INTO temp_x SELECT * from x;`, and the `*` returns the fields in
the order they are defined on the table. Since this might be different from
the order that the `INSERT INTO temp_x` expects, strange problems can
happen where columns get swapped (or constraints can fail on columns that
they should not fail for, because they got fed data from a different
column).
This makes it easier to later also create `query_workesr.sql`,
`query_meta.sql` etc. so that the sqlc-generated code can follow the
same subdivision as the persistence service code itself.
No functional changes.
GORM has certain downsides:
- Code-first approach, where queries have to be translated to the Go code
required to execute them.
- GORM comes with its own SQLite implementation, which doesn't provide an
on-connect callback. This means that new connections cannot correctly
enable foreign key constraints, causing database consistency issues.
[SQLC](https://sqlc.dev/) solves these issues for us.
This commit doesn't fully replace GORM with SQLC, but introduces it for
a few queries. Once all queries have been converted, GORM can be removed
completely.
There are still issues with foreign keys getting disabled, so enable them
in the periodic database consistency check.
A more permanent solution is likely to drop GORM and switch to something
else that gives us an on-connect-callback, which can then be used to
turn on foreign key constraints for every connection made.
Add a new API operation to get the overall farm status. This is based on
the jobs and workers, and their status.
The statuses are:
- `active`: Actively working on jobs.
- `idle`: Farm could be active, but has no work to do.
- `waiting`: Work has been queued, but all workers are asleep.
- `asleep`: Farm is idle, and all workers are asleep.
- `inoperative`: Cannot work: no workers, or all are offline/error.
- `starting`: Farm is starting up.
- `unknown`: Unexpected configuration of worker and job statuses.
Prevent logging an error in the persistence layer when an unknown worker
is requested.
This reduces the noise & confusion when the web interface is showing the
details of a worker, but the worker gets removed by someone else. Or when
the Manager doesn't know about a Worker and it's trying to connect.
See #104282.
Create a dedicated package `.../pkg/website` to contain constants for the
URLs of documentation, bug reporting, etc. That way it's easier to see
which parts of the website are being referred to from the Flamenco
binaries, and updates can happen in a central spot.
No functional changes.
Deleting jobs from the database can still sometimes cause consistency
errors, as if foreign key constraints aren't enabled. This check is there
to try and get a grip on things.
Just as a safety measure, before deleting a job, check that foreign key
constraints are enabled. These are optional in SQLite, and the deletion
function assumes that they are on.
Implement the API function to mass-mark jobs for deletion, based on
their 'updated_at' timestamp.
Note that the `last_updated_max` parameter is rounded up to entire
seconds. This may mark more jobs for deletion than you expect, if their
`updated_at` timestamps differ by less than a second.
GORM Automigration created a separate `job_storage_infos` table (because
we used it wrong, to be fair), which is actually only used as an
embedded struct in the `jobs` table. This means this table itself can be
dropped.
Replace GORM's auto-migration with Goose. The latter uses hand-written
SQL queries to apply database schema changes, which is safer and easier to
understand than what GORM is doing.
When the worker is started with `-restart-exit-code 47` or has
`restart_exit_code=47` in `flamenco-worker.yaml`, it's marked as
'restartable'. This will enable two worker actions 'Restart
(immediately)' and 'Restart (after task is finished)' in the Manager web
interface. When a worker is asked to restart, it will exit with exit
code `47`. Of course any positive exit code can be used here.
Change the package base name of the Go code, from
`git.blender.org/flamenco` to `projects.blender.org/studio/flamenco`.
The old location, `git.blender.org`, has no longer been use since the
[migration to Gitea][1]. The new package names now reflect the actual
location where Flamenco is hosted.
[1]: https://code.blender.org/2023/02/new-blender-development-infrastructure/
Instead of always performing the periodic integrity check, make it possible
to disable it or run it at different intervals.
Currently for the Blender Studio it's crunch time, so the check should
really only run when there is someone looking at the system (i.e. at
restarts for upgrade purposes).
Due to an issue (which has been fixed in the previous commit), all tasks
in the database were deleted when starting Flamenco. This tool attempts
to recompile the job and recreate its tasks.
The statuses of the tasks are set based on the job status. Basically:
- job active → tasks queued
- job completed → tasks completed
- job cancelled / failed → tasks cancelled
- otherwise → tasks queued
To ensure that the tool is only used to create tasks from scratch, it
refuses to work on a job that still has tasks in the database.
There is an issue with the GORM auto-migration, in that it doesn't
always disable foreign key constraints when it should. Due to
limitations of SQLite, not all 'alter table' commands you'd want to use
are available. As a workaround, these steps are performed:
1. create a new table with the desired schema,
2. copy the data over,
3. drop the old table,
4. rename the new table to the old name.
Step #3 will wreak havoc with the database when foreign key constraint
checks are active, so no we temporarily deactivate them while performing
database migration.
Mark the default value of `Worker.LazyStatusRequest` as `false`. The
previous default was configured as `0`, which was different enough to
always trigger a database migration of that column. However, since these
values do map to each other, the migration didn't do anything concrete,
and would be triggered again at the next startup.
As it was decided that the name "tags" would be better for the clarity
of the feature, all files and code named "cluster" or "worker cluster"
have been removed and replaced with "tag" and "worker tag". This is only
a name change, no other features were touched.
This addresses part of #104204.
Reviewed-on: https://projects.blender.org/studio/flamenco/pulls/104223
As a note to anyone who already ran a pre-release version of Flamenco
and configured some worker clusters, with the help of an SQLite client
you can migrate the clusters to tags. First build Flamenco Manager and
start it, to create the new database schema. Then run these SQL queries
via an sqlite commandline client:
```sql
insert into worker_tags
(id, created_at, updated_at, uuid, name, description)
select id, created_at, updated_at, uuid, name, description
from worker_clusters;
insert into worker_tag_membership (worker_tag_id, worker_id)
select worker_cluster_id, worker_id from worker_cluster_membership;
```
Perform these two SQL calls & check their result:
- `PRAGMA integrity_check`
- `PRAGMA foreign_key_check`:
See https: //www.sqlite.org/pragma.html for more info on these.
This also removes the unused `PeriodicMaintenanceLoop()` function.
Periodic checking while Flamenco Manager is running might be introduced
in a future commit, after the startup-time checks have been shown to not
get in the way.
When there is an error detected at startup, close the database connection.
Before, the connection could be kept open even when an error was returned,
causing the write-ahead log files to be kept around. These are now
properly integrated into the main database file before exiting.
Upgrade:
- `gorm.io/gorm` v1.23.8 → 1.25.2
- `github.com/glebarez/go-sqlite` v1.17.3 → v1.8.0
- `github.com/glebarez/sqlite` v1.4.6 → v1.8.0
and also some indirect dependencies.
This is in the hope that some weird cases at Blender Studio get resolved.
It appears that sometimes, for some unknown reason, when deleting a job,
its tasks get reassigned to another job (instead of also getting deleted).
Since there is no code in Flamenco itself to do this task deletion (it's
all depending on SQLite following the foreign keys and cascading to tasks),
I hope it was a bug in either GORM or SQLite that got fixed at some point.
Add extra job to the database before deleting one, to ensure that job
deletion doesn't do anything with other jobs (and their tasks).
No functional changes to Flamenco itself.
Workers can be soft-deleted, which means that they stay in the database.
As such, foreign key constraints `ON DELETE CASCADE` do not trigger, and
thus their sleep schedule can still be active. This is now detected and
handled gracefully.
The Shaman Checkout ID setter shouldn't update a job's "updated at"
timestamp. Its goal is to fake that the job was submitted with a new
enough Flamenco version, and thus should not touch the timestamps.
This is a command that can be run to retroactively set the Shaman
Checkout ID of jobs, allowing the job deletion to also remove the job's
Shaman checkout directory.
This is highly experimental, and not built by default or shipped with
Flamenco releases. It's only been used once at Blender Animation Studio
to help cleaning up. Run at your own risk. Make backups first.
- Add a little confirmation overlay before deleting a job. This overlay
also shows information about whether the Shaman checkout directory
will be deleted or not.
- Send job updates to the web frontend when jobs are marked for
deletion, and when they are actually deleted.
- Respond to those updates, and handle some corner cases where job info
is missing (because it just got deleted).
This closes T99401.