Connecting to AWS S3
Promethium integrates with AWS S3 to enable metadata discovery and distributed querying. Once connected, Promethium will crawl AWS S3 prefixes, and build a catalog for files of known format.
This guide walks you through the steps to securely connect your AWS S3 storage account to Promethium using read-only access.
Prerequisites
S3 buckets are accessed via IAM policies. Any S3 bucket that needs to be crawled or used by Promethium for persistence requires the following configuration
- The S3 bucket policy needs to have the respective trino service_account Role (ARN) appended to the bucket policy.
- The bucket URL
arn:aws:s3:::<bucket_name>needs to be added to the IAM policy attached to the trino service_account role.
The default trino service account role follows the following naming convention promethium-prod-<company_name>-trino-oidc-role
The following are the minimum permissions required for Promethium to crawl and write to an S3 bucket
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowListingTheBucket"
"Effect": "Allow",
"Action": ["s3:ListBucket"],
"Resource": ["arn:aws:s3:::<bucket-name>"]
},
{
"Sid": "AllowReadWrite"
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:GetObjectTagging",
"s3:DeleteObject",
"s3:DeleteObjectVersion",
"s3:GetObjectVersion",
"s3:GetObjectVersionTagging",
"s3:GetObjectACL",
"s3:PutObjectACL"
],
"Resource": ["arn:aws:s3:::<bucket-name>/<path>"]
}
]
}
Supported File Formats
- Delimited files
- Parquet
- Partitioned parquet requires HIVE partitioning conventions
General Usage
The S3 crawler expects files to follow the following conventions;
- Each unique schema and table resides in a folder
<container-root>/<schema>/<table>/some_*.<parquet-or-csv><container-root>/<table>/some_*.<parquet-or-csv>
- Files and partitions (e.g. folders like key=value) at the root will be ignored