design : alchemiscale strategist
contents
This design document proposes our design for the alchemiscale Strategist
service.
For prior work on this subject, see Ian Kenney’s development log.
the problem#
Users of alchemiscale currently submit an AlchemicalNetwork
they wish to compute via the AlchemiscaleClient
, then create and action Task
s for each Transformation
in the AlchemicalNetwork
.
This works just fine, and gives users ultimate control over how to allocate compute across their Transformation
s, but it also means that users either:
waste compute on
Transformation
s that already have converged results by indiscriminantly performing moreTask
s than are needed across theTransformation
s in anAlchemicalNetwork
.have to micromanage the compute they allocate by making decisions on which
Transformation
s to perform moreTask
s for.
our solution#
We aim to provide users with another option: define and submit a Strategy
alongside their AlchemicalNetwork
, and let this Strategy
efficiently allocate compute effort across the Transformation
s until it is satisfied.
This will allow users to largely “fire-and-forget” their AlchemicalNetwork
s to alchemiscale, only concerning themselves with retrieving results as Task
s complete.
user interaction#
Users can create a parameterized Strategy
from stratocaster
, then set this as the Strategy
on their AlchemicalNetwork
:
from alchemiscale import AlchemiscaleClient, Scope, ScopedKey
from stratocaster import ConnectivityStrategy
# instantiate client for alchemiscale instance
asc = AlchemiscaleClient('https://api.alchemiscale.localdomain')
# read in a pre-built network; choose a scope to submit to
network = AlchemicalNetwork.from_json('network.json')
scope = Scope('org', 'campaign', 'project')
# submit the network
network_sk: ScopedKey = asc.create_network(network, scope)
# create an instance of a strategy, using default settings
strategy = ConnectivityStrategy(ConnectivityStrategy.default_settings())
# set the strategy for this network
asc.set_network_strategy(network_sk, strategy)
...
# later, retrieve results for the network
results = asc.get_network_results(an_sk)
The Strategy
will automatically be applied by alchemiscale to the AlchemicalNetwork
.
Task
s will be periodically created and actioned on the AlchemicalNetwork
as needed based on the Strategy
’s proposal for how much additional effort to allocate to each Transformation
given the results accumulated so far.
additional options#
The AlchemiscaleClient.set_network_strategy()
method also features the following keyword arguments that adjust how the Strategy
is performed by alchemiscale, independent of the Strategy
’s own settings:
max_tasks_per_transformation
: the maximum number of actionedTask
s allowed on aTransformation
at once; default 3max_tasks_per_network
: the max number of actionedTask
s allowed on theAlchemicalNetwork
at once; defaultNone
task_scaling
: modulates how to translate weights intoTask
counts;"linear"
scales this count directly by weight, while"exponential"
operates more conservatively by requiring higher weights to yield higher countssleep_interval
: wait time between iterations of theStrategy
; theStrategist
service (see below) will have also have a minimumsleep_interval
, and the larger of the two will take effect
A user can replace the Strategy
on the AlchemicalNetwork
using set_network_strategy()
as above, or can drop the Strategy
entirely by calling:
# drop strategy from the network
asc.set_network_strategy(network_sk, None)
mode, status, and introspection#
A Strategy
assigned to an AlchemicalNetwork
can be in one of the following mode
s:
full : the
Strategy
can create, action, and cancelTask
s on theAlchemicalNetwork
based on its proposedTransformation
weightspartial : the
Strategy
can only create and actionTask
s on theAlchemicalNetwork
based on its proposedTransformation
weights; it cannot ever cancelTask
sdisabled : the
Strategy
is switched off, and won’t be performed by the server
Independent of mode
, an assigned Strategy
also has a status
.
status
may be one of the following:
awake : the
Strategy
has not hit any stop conditions, and has not errored; it will performed according to itsmode
dormant : the
Strategy
has reached stop conditions, and will no longer be performed until new results appearerror : the
Strategy
has encountered anException
, and will no longer be performed
The mode
is set by the user, and changes the way the Strategist
service handles the Strategy
.
The status
of the Strategy
is changed by the Strategist
over time based on execution conditions, and can be manually set back to awake by a user.
See the Strategist
service section for details.
Users can interrogate Strategy
state with:
> asc.get_network_strategy_state(an_sk)
StrategyState(mode: 'partial', status: 'awake', iterations: 4, sleep_interval: 3600,
last_iteration: '2025-05-30T18:24:30.540413+00:00',
last_iteration_result_count: 1213,
max_tasks_per_transformation: 5,
max_tasks_per_network: None,
task_scaling: 'exponential',
)
Or specifically ask for its status
:
> asc.get_network_strategy_status(an_sk)
'partial'
A Strategy
with status
error will also feature the ability to get exception
and traceback
information from StrategyState
:
> state = asc.get_network_strategy_state(an_sk)
> state.status
'error'
> state.exception
("KeyError", "No such key 'foo'")
> state.traceback
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/stratocaster/base/strategy.py", line 127, in propose
return self._propose(alchemical_network, protocol_results)
...
Users can also retrieve the Strategy
itself with:
> asc.get_network_strategy(an_sk)
<ConnectivityStrategy-d78edfd56996deef89deb19e231d0e79>
This is useful if a user wants to make a new Strategy
based on the settings of the current one, but with some modifications.
If the Strategy
has gone dormant or entered error status, users can kick it awake with:
asc.set_network_strategy_awake(an_sk)
the Strategist service#
All Strategy
s submitted by users are performed server-side by the Strategist
service.
This service directly interfaces with the state and object stores in the same way as the API services do, minimizing latency and complexity.
--- config: theme: neutral --- graph LR; subgraph user client client[[user client]] end subgraph alchemiscale server api[[user api]] & strategist[[strategist]] & computeapi[[compute api]]--> statestore[(state store)] & objectstore[(object store)] end subgraph HPC computehpc[[compute service]] end subgraph k8s computek8s[[compute service]] end subgraph Folding at Home computefah[[compute service]] end client-->api computehpc-->computeapi computek8s-->computeapi computefah-->computeapi
The service performs the following sequence as cycles in an infinite loop:
Query the state store for all non-disabled and non-error
Strategy
s.For each
Strategy
due for an iteration based onsleep_interval
, dispatch the following toProcessPoolExecutor
:If the
Strategy
is dormant, check the number of successfulProtocolDAGResult
s against last recorded count.- If different, switch
status
toawake
and proceed; if not, remaindormant
and skip.
- If different, switch
Pull the
AlchemicalNetwork
and all successfulProtocolDAGResult
s for eachTransformation
, and gather intoProtocolResult
s.Feed these to
Strategy.propose()
to yield aStrategyResult
, and acquire normalizedTransformation
weights withStrategyResult.resolve()
.If weights are all
None
, setStrategy
status
to dormant and skip. Additionally, ifStrategy
mode
is full, cancel all actionedTask
s on theAlchemicalNetwork
.For
Transformation
s witherror
edTask
s, set weights toNone
.
Convert these weights into
Task
counts for eachTransformation
(see below for how we do this).Set the count of actioned
Task
s for eachTransformation
, creating newTask
s as necessary.- If
Strategy
mode
is full, cancelTask
s as necessary onTransformation
s that have too manyTask
s actioned.
- If
Update the
Strategy
iteration
count, number of successfulProtocolDAGResult
s encountered,last_iteration
datetime.If any of the above steps failed, set
Strategy
status
to error.
Sleep for configured
Strategist
sleep_interval
.
proposal weights to Task counts#
We require a mechanism for translating normalized Strategy
proposal weights (continuous values from 0 to 1) into discrete Task
counts for the Transformation
s in an AlchemicalNetwork
.
There are likely many reasonable ways to do this, but we propose the following per Transformation
:
w: float # proposed Transformation weight
max_tasks_per_transformation: int
task_scaling: str # 'linear' or 'exponential'
if w is None or w == 0:
tasks = 0
elif w == 1:
tasks = max_tasks_per_transformation
else:
if task_scaling == 'linear':
tasks = int(1 + w * max_tasks_per_transformation)
elif task_scaling == 'exponential':
tasks = int((1 + max_tasks_per_transformation)**w)
Which gives the following qualitative relationship between weight and Task
counts, assuming max_tasks_per_transformation = 6
:
max_tasks_per_transformation = 6
# linear
tasks 1 2 3 4 5 6
|----------|-----------|----------|-----------|----------|-----------|
weight 0 1
# exponential
tasks 1 2 3 4 5 6
|--------------------------------|----------------|--------|----|--|-|
weight 0 1
Following this, we (potentially) scale down the Task
counts based on max_tasks_per_network
:
import numpy as np
task_counts: list[int] # proposed task counts for each Transformation
total_task_counts = sum(task_counts)
if (max_tasks_per_network is not None) and
(total_task_counts > max_tasks_per_network):
task_counts_scaled = (np.array(task_counts) *
max_tasks_per_network/total_task_counts)
task_counts_scaled = list(task_counts_scaled.astype(int))
else:
task_counts_scaled = task_counts
Importantly, this rescaling retains the determinism in the Task
counts emitted by Strategist
.
Strategy status transitions#
A newly-set Strategy
begins in the awake status
, but can transition to dormant or error (and back again):
stateDiagram-v2 direction LR dormant --> awake : count(ok PDRs) != last_iteration_result_count dormant --> awake : user reset awake --> dormant : Strategy stop condition error --> awake : user reset awake --> error : Strategy raise exception
Strategy database schema#
A Strategy
submitted to alchemiscale for a given AlchemicalNetwork
is represented in the state store (Neo4j) as:
graph LR; Strategy-- PROGRESSES -->AlchemicalNetwork
Where the PROGRESSES
relationship features the (Neo4j-friendly versions of) attributes of StrategyState
:
class StrategyState:
mode: StrategyModeEnum
status: StrategyStatusEnum
iterations: int
sleep_interval: int
last_iteration: datetime
last_iteration_result_count: int
max_tasks_per_transformation: int
max_tasks_per_network: int | None
task_scaling: StrategyTaskScalingEnum
Since Strategy
is a GufeTokenizable
, this allows the same Strategy
object to serve multipe AlchemicalNetwork
s in the same Scope
if already present, while the StrategyState
specific to each AlchemicalNetwork
is encoded in the relationship to that AlchemicalNetwork
.
An alternative to this approach would be to make StrategyState
a separate node with relationships to both the Strategy
and AlchemicalNetwork
, but this is likely unnecessarily complex.
miscellanea#
Additional notes:
The
Strategist
should not create and action any newTask
s on aTransformation
featuringTask
s withstatus = error
.- users should deal with these
error
edTask
s in order to clear theTransformation
for handling by theStrategy
- users should deal with these
The
Strategist
must feature an LRU cache of substantial max size forProtocolDAGResult
s, reducing the need to pull them from the object store each time it performs aStrategy
iteration.- instead of
ProtocolDAGResult
s, this cache could retainProtocolResult
s along with the count ofProtocolDAGResult
used to create them; cache invalidation would compare this count with the count present in the state store; this approach could substantially reduce cache size while still offering sufficient performance
- instead of
When a
Strategy
is in fullmode
, it should prioritize canceling unclaimedTask
s when possible to avoid wasting compute.We may later want to include a mechanism for making
Strategy
s perform extensions from existingTask
s on aTransformation
. One idea is to include anextends_preference
keyword argument toAlchemiscaleClient.set_network_strategy()
with the following behavior:- if
0
, perform no extensions at all - if
1
, always perform extension when possible - if between
0
and1
, extend or not extend in proportion to this value (e.g.0.7
would mean around 70% ofTask
s created would extend from anotherTask
)
- if
Must be careful that setting
Strategy
toNone
doesn’t deleteStrategy
object if otherPROGRESSES
relationship(s) present.