Skip to main content

Data Persistence Configuration

Data Persistence Configuration provides a unified and extensible mechanism for managing settings related to various data storage backends. It centralizes configurations for cloud storage providers like AWS S3, Google Cloud Storage (GCS), and Azure Blob Storage, alongside generic persistence options. This system ensures that data persistence plugins can consistently access the necessary parameters for their operations.

Core Configuration Object: DataConfig

The DataConfig class serves as the primary container for all data persistence settings. It aggregates provider-specific configurations, allowing for a single point of access to all relevant parameters.

DataConfig includes instances of the following configuration classes:

  • S3Config: For Amazon S3 specific settings.
  • GCSConfig: For Google Cloud Storage specific settings.
  • AzureBlobStorageConfig: For Azure Blob Storage specific settings.
  • GenericPersistenceConfig: For general persistence settings that apply across all providers.

Automatic Configuration Loading

The DataConfig.auto() class method simplifies loading configurations from a specified ConfigFile. This method intelligently populates the nested configuration objects by reading values from the configuration file, reducing the need for manual instantiation and parameter passing. Each nested configuration class also provides its own auto() method for granular loading.

import datetime
from your_package_name import DataConfig, S3Config, GCSConfig, AzureBlobStorageConfig, GenericPersistenceConfig
# Assuming ConfigFile and internal config readers are available
# from your_package_name.internal import ConfigFile

# Example: Loading all configurations automatically from a config file
# (e.g., 'config.yaml' containing relevant sections like 'aws', 'gcp', 'azure', 'persistence')
# config_file_path = "path/to/your/config.yaml"
# data_config = DataConfig.auto(config_file_path)

# If you need to create a DataConfig manually, you can pass individual config objects:
s3_config = S3Config(endpoint="http://localhost:9000", access_key_id="minioadmin", secret_access_key="minioadmin")
gcs_config = GCSConfig(gsutil_parallelism=True)
azure_config = AzureBlobStorageConfig(account_name="myazurestorage")
generic_config = GenericPersistenceConfig(attach_execution_metadata=False)

manual_data_config = DataConfig(
s3=s3_config,
gcs=gcs_config,
azure=azure_config,
generic=generic_config
)

print(f"S3 Endpoint: {manual_data_config.s3.endpoint}")
print(f"GCS Parallelism: {manual_data_config.gcs.gsutil_parallelism}")
print(f"Azure Account Name: {manual_data_config.azure.account_name}")
print(f"Attach Execution Metadata: {manual_data_config.generic.attach_execution_metadata}")

Provider-Specific Configurations

S3 Configuration (S3Config)

S3Config manages settings specific to Amazon S3 and S3-compatible storage services.

  • enable_debug (bool): Enables debug logging for S3 operations. Defaults to False.
  • endpoint (str, optional): Specifies a custom S3 endpoint. This is useful for S3-compatible storage solutions (e.g., MinIO) or local development.
  • retries (int): Defines the number of retries for failed S3 operations. Defaults to 3.
  • backoff (datetime.timedelta): Sets the time duration to wait between retries. Defaults to 5 seconds.
  • access_key_id (str, optional): The access key ID for S3 authentication.
  • secret_access_key (str, optional): The secret access key for S3 authentication.

Important Note on Secrets: While access_key_id and secret_access_key can be configured directly, it is strongly recommended to use environment variables, IAM roles, or a dedicated secrets management service for production environments to avoid hardcoding sensitive credentials.

GCS Configuration (GCSConfig)

GCSConfig manages settings specific to Google Cloud Storage.

  • gsutil_parallelism (bool): A flag to enable or disable parallel operations when using gsutil. Defaults to False. Enabling this can improve performance for large transfers.

Azure Blob Storage Configuration (AzureBlobStorageConfig)

AzureBlobStorageConfig manages settings specific to Azure Blob Storage.

  • account_name (str, optional): The name of the Azure storage account.
  • account_key (str, optional): The access key for the Azure storage account.
  • tenant_id (str, optional): The Azure Active Directory (AAD) tenant ID for client-based authentication.
  • client_id (str, optional): The client ID for AAD-based authentication.
  • client_secret (str, optional): The client secret for AAD-based authentication.

Important Note on Secrets: Similar to S3, for production deployments, consider using Azure Managed Identities or environment variables for account_key and client_secret instead of direct configuration.

Generic Persistence Configuration (GenericPersistenceConfig)

GenericPersistenceConfig handles settings that apply across any data persistence provider.

  • attach_execution_metadata (bool): A flag that determines whether execution-related metadata should be attached to persisted data. Defaults to True. Disabling this can reduce storage overhead if metadata is not required.

Common Use Cases

Configuring S3 Access for a Local MinIO Instance

To use a local MinIO server for development or testing, configure the endpoint, access_key_id, and secret_access_key within S3Config.

import datetime
from your_package_name import DataConfig, S3Config

s3_local_config = S3Config(
endpoint="http://localhost:9000",
access_key_id="minioadmin",
secret_access_key="minioadmin",
enable_debug=True,
retries=1,
backoff=datetime.timedelta(seconds=1)
)
data_config = DataConfig(s3=s3_local_config)

print(f"Configured S3 endpoint: {data_config.s3.endpoint}")
# A persistence plugin would now use these settings for S3 operations.

Enabling GCS Parallelism for Faster Transfers

For applications dealing with large volumes of data on Google Cloud Storage, enabling gsutil_parallelism can significantly improve transfer speeds.

from your_package_name import DataConfig, GCSConfig

gcs_performance_config = GCSConfig(gsutil_parallelism=True)
data_config = DataConfig(gcs=gcs_performance_config)

if data_config.gcs.gsutil_parallelism:
print("GCS operations will attempt to use gsutil parallelism.")

Setting Azure Blob Storage Credentials via Environment Variables

While auto() methods can load from config files, for production, it's common to load sensitive credentials from environment variables. The underlying configuration system supports this.

Assuming your config.yaml or environment variables are set up to provide these:

azure:
storage_account_name: "myprodstorageaccount"
tenant_id: "your-tenant-id"
client_id: "your-client-id"
client_secret: "env:AZURE_CLIENT_SECRET" # Reads from AZURE_CLIENT_SECRET environment variable
from your_package_name import DataConfig
# from your_package_name.internal import ConfigFile # Assuming ConfigFile is available

# Example: Load from a config file that references environment variables
# data_config = DataConfig.auto("path/to/config.yaml")
# print(f"Azure Account Name: {data_config.azure.account_name}")
# print(f"Azure Client ID: {data_config.azure.client_id}")

# Manually setting for demonstration
azure_config = AzureBlobStorageConfig(
account_name="myprodstorageaccount",
tenant_id="your-tenant-id",
client_id="your-client-id",
client_secret="your-client-secret-from-env" # In real scenario, this would be read from env
)
data_config = DataConfig(azure=azure_config)
print(f"Azure Account Name: {data_config.azure.account_name}")

Disabling Execution Metadata Attachment

If your application does not require additional metadata to be stored alongside persisted data, you can disable attach_execution_metadata to potentially reduce storage costs and simplify data structures.

from your_package_name import DataConfig, GenericPersistenceConfig

no_metadata_config = GenericPersistenceConfig(attach_execution_metadata=False)
data_config = DataConfig(generic=no_metadata_config)

if not data_config.generic.attach_execution_metadata:
print("Execution metadata will not be attached to persisted data.")

Important Considerations

  • Secrets Management: Never hardcode sensitive credentials (e.g., access_key_id, secret_access_key, account_key, client_secret) directly into source code or version-controlled configuration files for production environments. Leverage environment variables, cloud provider IAM roles, or dedicated secret management services. The configuration system is designed to support reading values from environment variables.
  • Sandbox Environments: The system allows for direct storage of access keys and secrets in sandbox environments for ease of local development and testing. This practice is strictly for non-production use.
  • Extensibility: The modular design, with separate configuration classes for each cloud provider, facilitates easy extension to support new storage backends or custom persistence solutions without requiring changes to the core DataConfig structure.
  • Configuration Source: The auto() methods abstract away the details of reading from various configuration sources (e.g., legacy formats, YAML files). Developers should ensure their ConfigFile is correctly structured according to the expected keys defined in the internal configuration classes (AWS, GCP, AZURE, Persistence).