flytekit: The Python SDK for Production-Grade Data & ML Pipelines
Your Python code, but production-ready. Build robust, scalable, and maintainable data and ML pipelines with ease.
Overview
flytekit is the Python SDK for Flyte, the open-source, container-native orchestration platform designed for large-scale data and machine learning workflows. It empowers you to transform standard Python functions into versioned, containerized, and strongly-typed tasks that can be composed into scalable and reproducible pipelines.
At its core, flytekit brings the best of software engineering practices—like modularity, testability, and CI/CD—to your data and ML workflows without sacrificing the flexibility of Python. By simply adding decorators to your functions, flytekit handles the heavy lifting of dependency management, containerization, and data passing. This allows you to focus on your logic while flytekit ensures your pipelines are robust, observable, and ready for production.
With a rich, extensible plugin ecosystem, flytekit seamlessly integrates with popular tools like Spark, Dask, Kubernetes, and various ML frameworks. It also offers unique features like Flyte Decks for creating rich, visual reports from your tasks, making your pipelines more observable and interactive than ever before.
Key Concepts
-
Tasks: The fundamental units of execution. A task is a versioned, containerized Python function with a well-defined interface of typed inputs and outputs, making it independently executable and testable.
-
Workflows: The orchestration layer. Workflows are Python functions that compose tasks and other workflows into a Directed Acyclic Graph (DAG), defining the flow of data and dependencies between them.
-
Launch Plans: The mechanism for executing workflows. A launch plan binds a specific version of a workflow with inputs, schedules, and notifications, separating the workflow's definition from its execution context.
-
Type System:
flytekitis built on a powerful, extensible type system that ensures data consistency, catches errors early, and enables automatic validation and UI rendering for inputs and outputs. -
Remote Interaction: Beyond defining pipelines,
flytekitprovides a client (FlyteRemote) and CLI (pyflyte) to programmatically interact with a remote Flyte cluster for registering, launching, and monitoring executions. -
Flyte Decks: A unique observability feature that allows tasks to generate rich HTML reports—including tables, charts, and images—that are displayed directly in the Flyte UI, providing deep insights into your task executions.
Common Use Cases
- Building and scheduling robust, daily ETL and data processing pipelines.
- Orchestrating end-to-end machine learning pipelines: from data preparation and feature engineering to model training, evaluation, and deployment.
- Running large-scale, distributed data processing jobs using integrations like Spark and Dask.
- Creating reproducible research environments and sharing complex computational experiments.
- Automating business processes that involve multiple dependent steps and services.
Getting Started
New to flytekit? Start with the Getting Started guide to install the library and run your first workflow. To build a solid foundation, dive into the Core Concepts & Entities section. Once you're comfortable, explore the vast Plugin Integrations to connect flytekit with your favorite tools.