If you find yourself needing to run a Daft workload on a Ray cluster without having to figure out all the little details, this is for you.
First, write your script. Here’s an example. You can add dependencies at the top like so.
# /// script
# dependencies = ['numpy']
# ///
import daft
if __name__ == "__main__":
daft.context.set_runner_ray()
df = daft.from_pydict({"nums": [1, 2, 3]})
df = df.with_column("result", daft.col("nums").cbrt()).collect()
df.show()
Now you can run this script locally (python myscripts/myscript.py
) for your own testing, and when you’re ready, you can also run this on a Ray cluster through github actions.
uv run tools/gha_run_cluster_job.py myscripts/myscript.py
Now you can view your script being executed at: https://github.com/Eventual-Inc/Daft/actions/workflows/run-cluster.yaml
Once your job runs, you can also view any logs (including a Daft RayRunner trace that is produced by default) by navigating to the Summary
page of your job and downloading the ray-daft-logs
zip file.
Custom Script Arguments
You can specify custom arguments to your script after a --
delimiter like so:
uv run tools/gha_run_cluster_job.py myscripts/myscript.py -- --my-script=args
Daft Version
--daft-version
lets you use a specific released version of Daft
--daft-wheel-url
lets you use a specific built wheel of Daft.
You can build a custom wheel for Daft using this workflow. Then you can copy the public github artifact wheel link and use it in
--daft-wheel-url
More work will eventually be done to make this more seamless and part of the workflow itself. Feel free to contribute functionality to make this happen :)
Cluster Configuration
The cluster is currently hardcoded to a fixed configuration of 4 i3.2xlarge
workers. More work will need to be done to make this configurable. Feel free to make this happen :)