Add a new API operation to get the overall farm status. This is based on
the jobs and workers, and their status.
The statuses are:
- `active`: Actively working on jobs.
- `idle`: Farm could be active, but has no work to do.
- `waiting`: Work has been queued, but all workers are asleep.
- `asleep`: Farm is idle, and all workers are asleep.
- `inoperative`: Cannot work: no workers, or all are offline/error.
- `starting`: Farm is starting up.
- `unknown`: Unexpected configuration of worker and job statuses.
Send events on Manager startup & shutdown. To make this possible, events
sent to MQTT are now queued up until an MQTT server can be reached.
Otherwise the startup event would be sent before the MQTT connection was
established.
Introduce an "event bus"-like system. It's more like a fan-out
broadcaster for certain events. Instead of directly sending events to
SocketIO, they are now sent to the broker, which in turn sends it to any
registered "forwarder". Currently there is ony one forwarder, for
SocketIO.
This opens the door for a proper MQTT client that sends the same events
to an MQTT server.
In addition to logging `GOOS` and `GOARCH`, also log more info about the
system:
- Windows: the Windows version and edition.
- Linux: distribution, distribution version, and kernel version.
- macOS: just "macOS", until we know more about getting info there too.
Change the package base name of the Go code, from
`git.blender.org/flamenco` to `projects.blender.org/studio/flamenco`.
The old location, `git.blender.org`, has no longer been use since the
[migration to Gitea][1]. The new package names now reflect the actual
location where Flamenco is hosted.
[1]: https://code.blender.org/2023/02/new-blender-development-infrastructure/
Instead of always performing the periodic integrity check, make it possible
to disable it or run it at different intervals.
Currently for the Blender Studio it's crunch time, so the check should
really only run when there is someone looking at the system (i.e. at
restarts for upgrade purposes).
Implement the `deleteJob` API endpoint. Calling this endpoint will mark
the job as "deletion requested", after which it's queued for actual
deletion. This makes the API response fast, even when there is a lot of
work to do in the background.
A new background service "job deleter" keeps track of the queue of such
jobs, and performs the actual deletion. It removes:
- Shaman checkout for the job (but see below)
- Manager-local files of the job (task logs, last-rendered images)
- The job itself
The removal is done in the above order, so the job is only removed from the
database if the rest of the removal was succesful.
Shaman checkouts are only removed if the job was submitted with Flamenco
version 3.2. Earlier versions did not record enough information to reliably
do this.
When launching Flamenco from a server system with no way to open a web
browser, just ask the user to launch one. Quitting the Manager because
of this was a bit too dramatic.
Extract some code from `cmd/flamenco-manager/main.go` into `webservice.go`
in the same directory, just to make `main.go` a little smaller.
No functional changes.
Include `RELEASE_CYCLE` in the Makefile. This is mentioned at startup of
Manager and Worker, and reflects in the software version they report.
If `RELEASE_CYCLE == "release"`, Manager and Worker report their version
as `ApplicationVersion`. If it's any other string, the Git hash will get
appended.
This commit does not introduce functional changes, besides renaming
every mention of 'wizard' with 'setup assistant'. In order to run the
manager setup assistant use:
./flamenco-manager -setup-assistant
The change was introduced to favor more neutral and descriptive working
for this functionality. Thanks to Sybren for helping to get this done!
Logging the URLs at which the Manager can be reached as the last thing
when starting up, in the hope that this makes them more noticable and
inviting to actually visit.
Flamenco now no longer uses the Git tags + hash for the application
version, but an explicit `VERSION` variable in the `Makefile`.
After changing the `VERSION` variable in the `Makefile`, run
`make update-version`.
Not every part of Flamenco looks at this variable, though. Most
importantly: the Blender add-on needs special handling, because that
doesn't just take a version string but a tuple of integers. Running
`make update-version` updates the add-on's `bl_info` dict with the new
version. If the version has any `-blabla` suffix (like `3.0-beta0`) it
will also set the `warning` field to explain that it's not a stable
release.
For development of the web interface, to get a less predictable order of
asynchronous requests, the API responses were artificially delayed. This
was supposed to be optional, to be enabled via the `-delay` CLI argument,
but somehow the optionalness either never made it in or was mysteriously
removed.
Add a `-pprof` CLI option to enable the profiler. It will expose profiler
info on the web interface at `/debug/pprof/`.
To have a nice view of this, including flame graphs, run:
```
go tool pprof -http localhost:8082 http://localhost:8080/debug/pprof/profile
```
This adds a `-wizard` CLI option to the Manager, which opens a webbrowser
and shows the First-Time Wizard to aid in configuration of Flamenco.
This is work in progress. The wizard is just one page, and doesn't save
anything yet to the configuration.
Rather than just print the error message ("error creating UPnP/SSDP
server"), it now explains what the effect is of this error (workers
unable to automatically find this Manager) and how to solve it (pass
`-manager URL` to the Worker).
The task logs storage system is refactored to use the `local_storage`
package. Configuration options have also changed:
- `task_logs_path` is renamed to `local_manager_storage_path`, to
emphasise that only the Manager deals with those files, with default
value `./flamenco-manager-storage`.
- `storage_path` is renamed to `shared_storage_path`, to emphasise this
is the storage shared between Manager and Workers, with default value
`./flamenco-shared-storage`.
Task logs are still stored in
`${local_manager_storage_path}/job-{jobUUID[0:4]}/{jobUUID}/task-{taskUUID}.txt`
Manifest task: T99409
`os.IsNotExist()` is from before `errors.Is()` existed. The latter is the
recommended approach, as it also recognised wrapped errors.
No functional changes, except for recognising more cases of "does not
exist" errors as such.
Vue Router generates URLs for which there are no static files on the
filesystem (like `/jobs/{job ID}`). To make this work, the webapp's
`index.html` has to be served for such requests. The client-side JavaScript
then figures out how things fit together, and can even render a nice 404
page if necessary.
This shouldn't happen for non-webapp URLs, though. Because of this, the
entire webapp (including the "serve `index.html` if file not found logic)
is moved to a `/app/` base URL.
`make flamenco-manager` now also builds the webapp and embeds the static
files into the binary.
`make flamenco-manager_race` does NOT rebuild the static web files, to
help speed up of debug cycles. Run `make webapp-static` to rebuild the
webapp itself, if necessary, or run a separate web development server with
`yarn --cwd web/app run dev --host`.
Add a handler for the OpenAPI `taskOutputProduced` operation, and an
image thumbnailing goroutine.
The queue of images to process + the function to handle queued images
is managed by `last_rendered.LastRenderedProcessor`. This queue currently
simply allows 3 requests; this should be improved such that it keeps
track of the job IDs as well, as with the current approach a spammy job
can starve the updates from a more calm job.
The OpenAPI library we use for request validation needs to know per mime
type how to handle the contents. The same function for
`application/octet-stream` is now used for `image/png` and `image/jpeg`
as well.
Change "accepted CORS origins" to "acceptable CORS origins", as the former
is too ambiguous (it can mean "I just accepted these" or "These are the
acceptable ones").
Requeueing the tasks of a specific worker is now done in the
`TaskStateMachine`, such that it can be called from other services as
well in future commits.
This also makes the `LogStorage` service a dependency of the
`TaskStateMachine`, as it needs to write "this task was requeued" kind
of messages to the task logs.
Tasks that are in state `active` but haven't been 'touched' by a Worker
for 10 minutes or longer will transition to state `failed`.
In the future, it might be better to move the decision about which state
is suitable to the Task State Machine service, so that it can be smarter
and take the history of the task into account. Going to `soft-failed`
first might be a nice touch.
In the future different services will write to the task log, and thus
it makes sense to move the responsibility of prepending the timestamps
to the log storage service.
add `-delay` CLI argument, which adds a random delay of around 250ms to
all HTTP responses.
The web interface is quite asynchronous in nature, and having more
randomness and more visible delays in there will help development.
This CLI argument should not be used in production.
Check for jobs in 'cancel-requested' or 'requeued' statuses, and ensure
they transition to the right status. This happens at startup, before
even starting the web interface, so that a consistent state is presented.