Skip to main content

Connecting to AWS S3

Promethium integrates with AWS S3 to enable metadata discovery and distributed querying. Once connected, Promethium will crawl AWS S3 prefixes, and build a catalog for files of known format.

This guide walks you through the steps to securely connect your AWS S3 storage account to Promethium using read-only access.

Prerequisites

S3 buckets are accessed via IAM policies. Any S3 bucket that needs to be crawled or used by Promethium for persistence requires the following configuration

  1. The S3 bucket policy needs to have the respective trino service_account Role (ARN) appended to the bucket policy.
  2. The bucket URL arn:aws:s3:::<bucket_name> needs to be added to the IAM policy attached to the trino service_account role.

The default trino service account role follows the following naming convention promethium-prod-<company_name>-trino-oidc-role The following are the minimum permissions required for Promethium to crawl and write to an S3 bucket

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowListingTheBucket"
"Effect": "Allow",
"Action": ["s3:ListBucket"],
"Resource": ["arn:aws:s3:::<bucket-name>"]
},
{
"Sid": "AllowReadWrite"
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:GetObjectTagging",
"s3:DeleteObject",
"s3:DeleteObjectVersion",
"s3:GetObjectVersion",
"s3:GetObjectVersionTagging",
"s3:GetObjectACL",
"s3:PutObjectACL"
],
"Resource": ["arn:aws:s3:::<bucket-name>/<path>"]
}
]
}

Supported File Formats

  • Delimited files
  • Parquet
    • Partitioned parquet requires HIVE partitioning conventions

General Usage

The S3 crawler expects files to follow the following conventions;

  • Each unique schema and table resides in a folder
    • <container-root>/<schema>/<table>/some_*.<parquet-or-csv>
    • <container-root>/<table>/some_*.<parquet-or-csv>
  • Files and partitions (e.g. folders like key=value) at the root will be ignored