Skip to main content

flytekit: The Flyte SDK for Python

flytekit is a Python library for creating reliable and reproducible data and machine learning workflows. Define your logic as simple Python functions, and flytekit helps you compose, containerize, and execute them at scale on the Flyte orchestration platform.

Why does this exist?

Modern data and ML pipelines are often complex, involving multiple steps, diverse tools (like Spark, PyTorch, or dbt), and strict requirements for reproducibility. Managing these pipelines with simple scripts can lead to several pain points:

  • Brittle Connections: Chaining scripts together manually is error-prone. A failure in one step can leave intermediate data in an inconsistent state.
  • "Works on my machine" Syndrome: A script that runs locally might fail in production due to differences in dependencies, environment variables, or data access.
  • Poor Scalability: A single script can't easily orchestrate distributed jobs or manage parallel execution without significant boilerplate code.
  • Lack of Versioning: It's hard to track which version of the code produced a specific output, making it difficult to reproduce results or debug issues.

flytekit solves these problems by providing a framework to define your pipeline as a structured, typed, and versioned workflow, right from your Python code.

A Simple Mental Model

Think of building a workflow with flytekit like writing a recipe. A recipe consists of individual steps and the instructions to combine them.

  • A @task is an ingredient or a single cooking step. It's a regular Python function that does one thing well, like fetching data, training a model, or running a query. Tasks are the fundamental building blocks of any workflow.

  • A @workflow is the full recipe. It defines the order of operations by calling tasks and wiring their inputs and outputs together. The workflow itself doesn't contain business logic; it only orchestrates the flow of data between tasks.

  • Type hints are your measurements. By using Python's type hints (e.g., def my_task(a: int) -> str:), you tell Flyte what kind of data each task expects and returns. Flyte uses this information to ensure data flows correctly and to provide a better user experience.

  • Plugins are your specialized kitchen appliances. Need to run a Spark job or a dbt model? flytekit's plugin system allows you to seamlessly integrate with other tools as if they were native Python functions.

How It Works

flytekit helps you translate your Python code into a format that the Flyte orchestration platform can understand and execute reliably.

  1. Define: You write your logic in Python files, using the @task and @workflow decorators.
  2. Package: You use the pyflyte command-line tool to package your code. This process containerizes your code and its dependencies into a Docker image.
  3. Register: The packaged workflow is registered with a Flyte backend, making it available to be launched.
  4. Execute: You can trigger an execution via the UI, API, or CLI. Flyte's backend then orchestrates the execution, running each task in its own container, managing data passing, and handling retries or caching automatically.

Common Use Cases

Here are a few minimal examples to give you a feel for flytekit.

A Simple "Hello, World" Workflow

This example shows a basic workflow with a single task that takes an input and produces an output.

# workflows/hello.py
from flytekit import task, workflow

@task
def say_hello(name: str) -> str:
return f"Hello, {name}!"

@workflow
def hello_workflow(name: str = "world") -> str:
return say_hello(name=name)

Run it locally:

pyflyte run workflows/hello.py hello_workflow --name "flyte"

Data Processing with Pandas

Tasks can work with complex data types like Pandas DataFrames. Flyte automatically handles serializing and passing the data between tasks.

# workflows/data_processing.py
import pandas as pd
from flytekit import task, workflow

@task
def create_dataframe() -> pd.DataFrame:
return pd.DataFrame({"col1": [1, 2], "col2": [3, 4]})

@task
def get_column_sum(df: pd.DataFrame) -> int:
return int(df["col1"].sum())

@workflow
def data_workflow() -> int:
df = create_dataframe()
return get_column_sum(df=df)

Run it locally:

pyflyte run workflows/data_processing.py data_workflow

Using a Plugin for dbt

flytekit integrates with other tools through plugins. Here's how you might run a dbt model as a task.

# workflows/dbt_example.py
# Assumes you have a dbt project in a directory named 'my_dbt_project'
# and you've installed the dbt plugin with: pip install flytekitplugins-dbt

from flytekit import workflow
from flytekitplugins.dbt import dbt_run, DBTRunInput

# Create a dbt task by pointing to your project directory
dbt_task = dbt_run.with_project_dir("my_dbt_project")

@workflow
def dbt_workflow():
# Run all models in the dbt project
dbt_task()

When should I use flytekit?

flytekit is a powerful tool, but it's not always the right one. Here’s a guide to help you decide.

Use flytekit when...Consider other tools when...
You have a multi-step data or ML pipeline.You have a single, simple script that runs in one go.
You need to orchestrate different tools (e.g., Python, Spark, SQL).Your entire process runs within a single environment (e.g., a single Jupyter notebook or a monolithic application).
Reproducibility, versioning, and caching are critical for your work.You are doing exploratory analysis where reproducibility is not a primary concern.
You need to scale your local code to run in a production environment with retries, error handling, and monitoring.You are building a real-time, low-latency service like a web API. Flyte is for batch and asynchronous processing.

Integrations

flytekit is a Python library and is designed to run on Linux-based systems (as it relies on containers).

  • Language: Python (>=3.9, <3.13)
  • Installation: uv pip install flytekit or pip install flytekit
  • Ecosystem: flytekit integrates with dozens of tools through its plugin system. A few examples include:
    • Distributed Computing: Spark, Dask, Ray
    • Data Warehouses: Snowflake, BigQuery, Hive
    • Data Quality: Great Expectations, Pandera
    • ML Frameworks: PyTorch, TensorFlow, Scikit-learn
    • Transformation: dbt

Getting Started

  1. Install flytekit:

    pip install flytekit
  2. Write a workflow in a Python file (e.g., my_app.py):

    from flytekit import task, workflow

    @task
    def greet(name: str) -> str:
    return f"Welcome, {name}!"

    @workflow
    def main_workflow(name: str = "to flytekit") -> str:
    return greet(name=name)
  3. Run it locally:

    pyflyte run my_app.py main_workflow --name "developer"
  4. Run on a Flyte cluster: To run on a production environment, you package your code into a Docker image and tell pyflyte to use it.

    # 1. Build an image containing your code
    docker build . -t my-flyte-image:latest

    # 2. Run the workflow using the image
    pyflyte run --image my-flyte-image:latest my_app.py main_workflow --name "developer"

Limitations & Assumptions

  • Flyte Backend: flytekit is the SDK for the Flyte orchestration platform. For production use, you need a running Flyte backend. Local execution with pyflyte run is primarily for development and testing.
  • Containerization: Workflows are executed in containers. You'll need Docker installed for local development and a container registry for production.
  • Python-Centric: While tasks can run code in any language (e.g., via shell scripts), the orchestration logic itself is defined in Python.
  • Plugin Architecture: To integrate with external tools, you'll often need to install corresponding plugin packages (e.g., flytekitplugins-spark).

Frequently Asked Questions

1. What's the difference between a @task and a @workflow? A @task is a single, executable piece of code, like a Python function. A @workflow is a pipeline that connects multiple tasks together by defining the flow of data between them. Workflows orchestrate; tasks execute.

2. How do I pass data between tasks? You don't have to manage it manually. Simply return a value from an upstream task and pass it as an argument to a downstream task. flytekit and the Flyte backend handle the data passing, serialization, and storage for you.

3. Do I need to know Docker? For local development, you don't need to write Dockerfiles. For running on a Flyte cluster, your code is packaged into a Docker image. flytekit provides tools to help build this image, but a basic understanding of Docker is beneficial.

4. How does flytekit relate to Flyte? Flyte is the orchestration platform that runs and manages your workflows. flytekit is the Python SDK you use to write those workflows.

5. How do I use libraries like Spark or PyTorch? flytekit has a rich plugin system. You can install a plugin (e.g., pip install flytekitplugins-spark) and use its specialized tasks to run Spark jobs, PyTorch training, and more, as if they were native Python functions.

6. Can I run my workflows locally without connecting to a cluster? Yes! The pyflyte run command executes your entire workflow on your local machine. This is a great way to quickly test and debug your code before deploying it.