Why do you need this?

If you find yourself needing to run a Daft workload on a Ray cluster without having to figure out all the little details, this is for you.

Usage

First, write your script. Here’s an example. You can add dependencies at the top like so.

# /// script
# dependencies = ['numpy']
# ///

import daft

if __name__ == "__main__":
		daft.context.set_runner_ray()

		df = daft.from_pydict({"nums": [1, 2, 3]})
		df = df.with_column("result", daft.col("nums").cbrt()).collect()
		df.show()

Now you can run this script locally (python myscripts/myscript.py) for your own testing, and when you’re ready, you can also run this on a Ray cluster through github actions.

uv run tools/gha_run_cluster_job.py myscripts/myscript.py

Now you can view your script being executed at: https://github.com/Eventual-Inc/Daft/actions/workflows/run-cluster.yaml

Once your job runs, you can also view any logs (including a Daft RayRunner trace that is produced by default) by navigating to the Summary page of your job and downloading the ray-daft-logs zip file.

image.png

Other Usage

Custom Script Arguments

You can specify custom arguments to your script after a -- delimiter like so:

uv run tools/gha_run_cluster_job.py myscripts/myscript.py -- --my-script=args

Daft Version

--daft-version lets you use a specific released version of Daft

--daft-wheel-url lets you use a specific built wheel of Daft.

You can build a custom wheel for Daft using this workflow. Then you can copy the public github artifact wheel link and use it in --daft-wheel-url

More work will eventually be done to make this more seamless and part of the workflow itself. Feel free to contribute functionality to make this happen :)

Cluster Configuration

The cluster is currently hardcoded to a fixed configuration of 4 i3.2xlarge workers. More work will need to be done to make this configurable. Feel free to make this happen :)