Image and Serialization Settings
Image and Serialization Settings
These settings define the execution environment and metadata for tasks and workflows during the serialization process, which precedes registration. They ensure that tasks and workflows execute consistently and are correctly identified within the platform.
Image Configuration
Image configuration, managed by ImageConfig, specifies the container images used for task execution. It allows defining a default image for general tasks and additional named images for specialized requirements.
Core Features:
- Default Image: A primary container image used when no specific image is designated for a task.
- Named Images: Additional images identified by unique names, enabling different tasks within a workflow to run in distinct environments.
- Automatic Resolution:
ImageConfigcan automatically determine a default image based on the current Python version and the installed Flytekit version. - Configuration Loading: Images can be specified in a configuration file, allowing for centralized management.
Creating Image Configurations:
-
Automatic Detection: For most use cases,
ImageConfig.auto_default_image()provides a convenient way to create anImageConfigthat includes a default image derived from the current Python environment and Flytekit version.from flytekit.configuration.default_images import DefaultImages
from flytekit.configuration.serialization import ImageConfig
# Automatically determines the default image (e.g., cr.flyte.org/flyteorg/flytekit:py3.X-vY.Z.A)
image_config = ImageConfig.auto_default_image() -
From a Specific Image Name: To use a specific image as the default,
ImageConfig.auto()can be used.from flytekit.configuration.serialization import ImageConfig
# Uses a specified image as the default
image_config = ImageConfig.auto(img_name="ghcr.io/my-org/my-custom-image:v1.0.0") -
Programmatic Definition: For more control,
ImageConfig.from_images()allows defining a default image and multiple named images directly in code.from flytekit.configuration.serialization import ImageConfig
image_config = ImageConfig.from_images(
default_image="ghcr.io/flyteorg/flytecookbook:v1.0.0",
m={
"spark": "ghcr.io/flyteorg/myspark:3.2.0",
"gpu_task": "ghcr.io/flyteorg/my-gpu-image:cuda11.3",
},
) -
CLI Validation: The
ImageConfig.validate_image()method is used internally by the command-line interface (CLI) to parse and validate user-supplied image arguments, ensuring correct formatting and preventing multiple default image specifications.
Retrieving Images:
The find_image(name) method on an ImageConfig instance retrieves an Image object by its name. If the name matches the default image, it returns the default.
# Assuming image_config was created with named images "spark" and "gpu_task"
spark_image = image_config.find_image("spark")
default_image = image_config.find_image("default") # Or the name used for the default image
Serialization Settings
SerializationSettings encapsulates all parameters necessary for serializing tasks and workflows before registration. This includes image configuration, project and domain metadata, environment variables, and options for fast serialization.
Key Attributes:
image_config: AnImageConfiginstance defining the container images.project,domain,version: Metadata used for registering entities.env: A dictionary of environment variables to inject into task containers.python_interpreter: The path to the Python executable used for task execution.flytekit_virtualenv_root: The root directory of the Python virtual environment.fast_serialization_settings: Optional settings for enabling fast registration.source_root: The root directory of the source code, used for packaging.
Creating Serialization Settings:
-
Direct Instantiation:
SerializationSettingscan be instantiated directly, providing all necessary parameters.from flytekit.configuration.serialization import ImageConfig, SerializationSettings
img_config = ImageConfig.auto_default_image()
settings = SerializationSettings(
project="my-project",
domain="development",
version="v123",
image_config=img_config,
env={"MY_VAR": "value"},
) -
Convenience Method:
SerializationSettings.for_image()simplifies creating settings for a single default image.from flytekit.configuration.serialization import SerializationSettings
settings = SerializationSettings.for_image(
image="ghcr.io/flyteorg/flytecookbook:v1.0.0",
version="v123",
project="my-project",
domain="development",
) -
Builder Pattern: The
new_builder()method creates aSerializationSettings.Builderinstance, allowing for immutable modification of existing settings.# Assuming 'settings' is an existing SerializationSettings object
updated_settings = settings.new_builder().with_fast_serialization_settings(
FastSerializationSettings(enabled=True, destination_dir="/tmp/code")
).build()
Fast Serialization (FastSerializationSettings):
Fast serialization allows registering tasks and workflows without rebuilding and pushing a new Docker image for every code change. Instead, the code is packaged and uploaded separately.
enabled: A boolean indicating whether fast serialization is active.destination_dir: The target directory within the container where the packaged code will be placed.distribution_location: The location (e.g., S3 URI) where the packaged code (e.g., a zip file) is uploaded.
When fast_serialization_settings are enabled, the platform uses the existing base image and injects the updated code package at runtime, significantly accelerating the development iteration cycle.
Entrypoint Settings (EntrypointSettings):
EntrypointSettings specifies the path to the pyflyte-execute script, which is the entrypoint for Flyte tasks within the container. This is particularly relevant for environments like PySpark, where the Python interpreter and virtual environment paths might need explicit configuration. The venv_root_from_interpreter() static method helps derive the virtual environment root from a given Python interpreter path.
Serialized Context:
SerializationSettings can serialize itself into a base64-encoded, gzipped JSON string and inject it as an environment variable (SERIALIZED_CONTEXT_ENV_VAR) into the task container. The with_serialized_context() method facilitates this, creating a new SerializationSettings object that includes this environment variable. This allows tasks to access their own serialization context at runtime, which can be useful for dynamic behavior or debugging.
Common Use Cases
- Standardizing Execution Environments: Define a default
ImageConfigto ensure all tasks within a project or domain use a consistent base image, promoting reproducibility and reducing configuration overhead. - Heterogeneous Task Execution: Use named images within
ImageConfigto support workflows that combine tasks requiring different runtime environments, such as Python tasks, Spark tasks, or GPU-accelerated tasks. - Streamlining Local Development: Enable
FastSerializationSettingsto quickly iterate on code changes without waiting for Docker image builds and pushes, accelerating the development feedback loop. - Consistent Entity Registration: Ensure all tasks and workflows are registered with the correct
project,domain, andversionby configuringSerializationSettingsappropriately. - Injecting Runtime Configuration: Pass environment variables through
SerializationSettings.envto provide tasks with dynamic configuration, secrets, or feature flags at execution time. - Custom Python Environments: Specify
python_interpreterandflytekit_virtualenv_rootwhen tasks require a specific Python version or a custom virtual environment setup.
Best Practices and Considerations
- For most standard Python tasks, leverage
ImageConfig.auto_default_image()to automatically pick up the correct Flytekit base image for your Python version. - Use named images judiciously for tasks with unique dependencies or runtime requirements. Avoid creating too many distinct images if a single, well-provisioned default image suffices.
- Understand that fast serialization is primarily for development and rapid iteration. For production deployments, it is often recommended to build and push dedicated Docker images to ensure a fully self-contained and immutable execution environment.
- The
SerializationSettingsobject is crucial for thepyflyteCLI and SDK when packaging and registering entities. Ensure your build and deployment processes correctly configure these settings. - When passing sensitive information, prefer platform-native secret management over injecting secrets directly via
envinSerializationSettings.