Metaxy + Dagster-Slurm for Efficient Multimodal Pipelines

At ASCII (Supply Chain Intelligence Institute Austria), we process large-scale web crawls and extract structured knowledge from messy data. For us, fast iteration is essential: researchers need to test and compare extraction approaches quickly.
We had been using Spark, but it became a bottleneck for fast experimentation because of:
- its complexity
- a processing model that is not well aligned with AI workloads
- missing sample-level versioning and field-level change semantics
Metaxy connects orchestrators such as Dagster (table/asset-level control) with low-level engines such as Ray, so we process exactly the samples that need work at each step, and no more.
This led us to build a new stack around:
For a research institute, cloud GPU compute is often expensive. Luckily, we also have access to fairly affordable EU-sovereign HPC AI systems such as MUSICA. We built dagster-slurm to get a clean control plane and observability across local, cloud, and Slurm execution.
Introducing Metaxy
Why Metaxy (visual intuition)
In multimodal pipelines, one upstream change should only affect the downstream fields that actually depend on that change. Imagine a pipeline that branches after a video-processing step:
Consider a simple change: a new crop resolution in one video stage. After a fan-out, multiple downstream branches depend on that object’s lineage.
Topology matters for lineage.
Some branches depend on frames, others only on audio-derived artifacts.
Without field-aware versioning, the orchestration layer often marks the whole downstream region as stale. That triggers expensive recomputation for branches that are technically unaffected by the change, including model inference and embedding jobs that do not consume the updated field at all. The challenge is therefore not only graph topology; it is field-level dependency tracking within each sample.
Metaxy addresses this by tracking provenance per sample and per field, then resolving incremental sets (new, stale, orphaned) with field-level lineage instead of coarse asset-level invalidation.
Partial data dependencies
Consider three Metaxy features (data produced at each step): video/full, transcript/whisper, and video/face_crop.
Dependencies are field-level, not only feature-level. Each field version is computed from the upstream fields it depends on, and field versions are then combined into a feature version.
Example: transcript/whisper.text depends on the audio field of video/full. If video/full is resized or recropped (frame-side change), transcript/whisper does not need recomputation.
Metaxy detects these irrelevant updates by storing separate data versions per field, per sample, per feature, and only recomputes downstream fields affected by upstream changes.
Incremental computations
The core operation is resolve_update. For a downstream feature, Metaxy computes the expected provenance and compares it with current state, returning exactly what changed.
with store: # MetadataStore
# Metaxy computes provenance_by_field and identifies changes
increment = store.resolve_update(DownstreamFeature)
# Process only the changed samples
In practice, pipeline code only needs to handle increment.new, increment.stale, and increment.orphaned, instead of reprocessing full datasets.
Composability
Metaxy is designed to stay composable with existing infrastructure choices. Dagster provides orchestration and observability, while engines such as Ray handle scale-out execution. This separation keeps control-plane logic stable while compute backends remain flexible across local, cloud, and HPC environments. Currently ClickHouse, BigQuery, DuckDB, Lance, DuckLake, and Delta Lake are supported as backends for Metaxy.
Use Case: RAG
Why Docling for RAG preprocessing?
Docling gives us a practical bridge from raw documents (PDF, Word, etc.) to structured markdown/text that is easier to use in LLM pipelines for chunking and embeddings.
With Metaxy on top, document conversion becomes incremental: changed documents are reprocessed, unchanged documents are skipped. More importantly, we can run extraction experiments on a subset of the corpus, materialize the full end-to-end subgraph, and merge the result back into main.
Why Ray?
Ray is useful because it keeps scale-out simple:
- simple parallel map-style processing model
- flexible worker sizing and batching
- practical integration with existing Python data workloads
- good fit for AI workloads with dynamic resource needs
Combined with Dagster-Slurm, this gives a clean control plane while still using HPC-native scheduling as the resource provider.
Hands-on example - Metaxy + Docling + Dagster-Slurm
Here is a minimal pseudocode sketch of the pipeline. For full details, see the referenced repository. We first register raw documents, then let Metaxy track provenance and resolve exactly what needs processing.
# 1) Register discovered docs with provenance fields
sources_samples = to_dataframe(files).with_columns(
metaxy_provenance_by_field={
"source_path": source_path,
"file_size_bytes": file_size_bytes,
}
)
with store:
src_inc = store.resolve_update("docling/source_documents", samples=sources_samples)
with store.open("w"):
store.write("docling/source_documents", src_inc.new)
store.write("docling/source_documents", src_inc.stale)
# 2) Resolve what must be processed downstream
with store:
inc = store.resolve_update("docling/converted_documents", versioning_engine="polars")
# Metaxy returns Increment(new, stale, orphaned)
to_process = pl.concat([inc.new.to_polars(), inc.stale.to_polars()]).filter(current_input_doc_uids)
if to_process.is_empty():
return "up_to_date"
# 3) Scale-out execution with Ray (cost-balanced sharding)
shards = build_weighted_shards(to_process, num_workers)
result_ds = ray_dataset(shards).map_batches(DoclingMapper, batch_size=batch_size)
# 4) Persist results through Metaxy datasink
result_ds.write_datasink(MetaxyDatasink(feature="docling/converted_documents", store=store))
For full implementation details, see:
dagster-slurmrepository: https://github.com/ascii-supply-networks/dagster-slurm/- Asset definition (
metaxy_docling_assets.py) - Processing implementation (
process_documents_docling_metaxy.py)
References and related
- Dagster asset versioning and caching
- Unlocking flexible pipelines: customizing asset decorator
- Building a modern machine learning platform with Ray
- Sanas AI + Dagster + Ray project
- YouTube: Sanas AI + Dagster + Ray
- PaaS as implementation detail
- YouTube: PaaS as implementation detail
- Narwhals
- Ibis
- Announcement post by Anam
- Dagster + Metaxy by Dagster Labs
Contributing
We welcome contributions to both projects:
dagster-slurm: https://github.com/ascii-supply-networks/dagster-slurm- Metaxy: https://github.com/anam-org/metaxy
Special thanks to Daniel Gafni for the collaboration around Metaxy, and to Hernan for the Docling/document-processing collaboration.



Researcher at the Supply Chain Intelligence Institute Austria (ASCII).
My research interest lies at the intersection of forecasting extreme events and causal analysis in high-frequency time series.