Exploring Cross-Account Data Access Control: Insights from AWS Lake Formation Deployment
Introduction
In today's data-driven landscape, efficient data sharing and access control are imperative for organizations aiming to maximize their data assets. AWS Lake Formation provides a robust solution, facilitating seamless management within the AWS ecosystem.
In this blog post, I'll take you through my journey of tackling a specific cross-account data access challenge using Delta Lake format Glue tables and AWS Lake Formation. I'll discuss the steps I followed, the obstacles I encountered, and the solutions I devised along the way. By sharing my experiences, I hope to provide valuable guidance to those navigating similar scenarios in the realm of data management and access control within the AWS ecosystem.
Use-case overview
This use-case focuses on implementing cross-account data access control. In this project, the goal is for a data producer AWS account to share data with multiple consumer AWS accounts while maintaining strong access control measures.
In the current setup, the data is stored in AWS S3 buckets in Delta Lake table format, with metadata managed by AWS Glue Catalog. As illustrated in the architecture diagram, the data consumer services reside in separate AWS accounts in relation to the data. The access control setup is very basic and is governed mainly by IAM policies associated with the data resources and consumer accounts.
While AWS IAM policies offer a degree of authorization control, they remain relatively high-level and may not offer the necessary granularity for all use-cases. Additionally, in this setup, both the data lake admin and consumers consistently find themselves needing to update their IAM policies.
That being said, the goal is to come up with solution that aligns seamlessly with the existing IAM policy access control method. This ensures availability of a fallback option and also guarantees compatibility with clients that do not yet support this new access control method.
AWS Lake Formation
Although numerous data governance solutions exist in the market, I opted for one readily available within the AWS cloud infrastructure. As implied by the title, this post focuses solely on implementing a data governance solution using AWS Lake Formation.
Essentially, AWS Lake Formation operates as a credential vending service governed by relational database management system style authorization policies. It's important to note that this blog post assumes readers are familiar with the workings of AWS Lake Formation.
Certain access pattern and strategy choices warrant consideration in the context of our case study.
Hybrid access mode
In Hybrid access mode, AWS Lake Formation permissions work along with IAM policy-based access control. By selecting this mode, we effectively inform AWS Lake Formation to fallback to AWS IAM policy for resources not governed by AWS Lake Formation access control policies
Data sharing patterns
AWS Lake Formation offers few data sharing and access control patterns.
In the producer-centric pattern, all controls are managed within the AWS account hosting the data, providing clearer visibility into access patterns.
Conversely, the consumer-centric pattern delegates access control to the consumers, with the data account solely responsible for sharing data with the consumer account's data lake admin.
The choice between these patterns varies depending on the specific use-case. In my scenario, I opted for the consumer-centric pattern, and the rationale behind this decision will be explored in subsequent sections.
Authorization strategies
AWS Lake Formation offers two authorization strategies:
Named resource method: This involves granting Lake Formation permissions to IAM principals on specific Data Catalog databases, tables, and views
Tag based access control: This method defines permissions based on attributes, known as LF-Tags in AWS Lake Formation. LF-Tags can be attached to Data Catalog resources, and permissions can be granted to Lake Formation principals based on these tags.
Tag based access control simplifies scaling and reduces permission management overhead. This is what I opted for.
Unexpected turns: Adjusting expectations
While the steps outlined in the official AWS Lake Formation documentation typically address a wide range of use-cases, our setup presents unique challenges. The primary distinctions include:
Requirement to have Hybrid access mode support in addition to Tag based access control
- Currently, AWS Lake Formation lacks certain features. Specifically, resources authorized via
LF-tags
must be explicitly opted-in to Hybrid access mode.
- Currently, AWS Lake Formation lacks certain features. Specifically, resources authorized via
Athena limitations when accessing Delta Lake Table format from an external AWS Glue Catalog
- During testing, I encountered an issue where the Athena query engine was unable to access tables in Delta Lake format registered in a separate AWS account. This appears to be a current limitation of Athena, and I hope that the AWS team is actively addressing it. As a workaround, I opted for a customer-centric data pattern, wherein the resource definition is accessible in the local AWS Glue Catalog via a resource link.
Metadata compatibility issues between AWS Lake Formation and Spark Delta framework
When Spark jobs utilize the Glue Catalog as a Metastore, they generate EXTERNAL tables with a
_PLACEHOLDER_
suffix in the location. This behavior is a default result of a bug in Spark. AWS Lake Formation provides credentials exclusively to the location specified in theLOCATION
property of the table, which may not align with the actual data storage location. Consequently, this discrepancy can lead to access issues.I implemented a workaround involving a Lambda function triggered by a
Glue:UpdateTable
API call. This Lambda function updates the table location to match the expected S3 folder location.
With these challenges and their workarounds, we will now look into the implementation architecture and steps.
Implementation
Architecture
As depicted in this architecture, our deployment incorporates a Hybrid access mode for AWS Lake Formation, complemented by Tag-based access control and a consumer-centric data sharing pattern.
In this setup, the data producer employs broad IAM policies for authorization, while consumers leverage AWS Lake Formation for fine-grained access control.
Steps
In the Data Producer AWS account
Modify the Glue Resource policy to enable hybrid mode and allow RAM sharing
{ "Sid": "RAMAccessLakeFormation", "Effect": "Allow", "Principal": { "Service": "ram.amazonaws.com" }, "Action": "glue:ShareResource", "Resource": [ "arn:aws:glue:<local_aws_region>:<local_account_id>:table/*/*", "arn:aws:glue:<local_aws_region>:<local_account_id>:database/*", "arn:aws:glue:<local_aws_region>:<local_account_id>:catalog" ] }
Establish or update an IAM role with access to the S3 data bucket and necessary describe/decrypt privileges on the KMS key if the data in the bucket is encrypted. Ensure the Trust policy on this role permits the Lake Formation and Glue services to assume this role
{ "Sid": "TrustLF", "Effect": "Allow", "Principal": { "Service": "lakeformation.amazonaws.com" }, "Action": "sts:AssumeRole" }, { "Sid": "TrustGlue", "Effect": "Allow", "Principal": { "Service": "glue.amazonaws.com" }, "Action": "sts:AssumeRole" }
Enroll the data S3 location in Lake Formation's hybrid access mode while associating it with the aforementioned IAM role
Update the current cross-account version in the Lake Formation Data Catalog settings to
Version 4
(this step is a prerequisite for enabling Hybrid mode)Establish Lake Formation tags and link them to the data resources designated for sharing with consumer accounts
Register the database with all tables in Hybrid access mode and provide access to the External AWS account
Grant access to the Lake Formation tags to the Consumer AWS account ID (which will automatically confer permissions to the Lake Formation Data Lake administrator)
Provide the Consumer AWS account access to any table or database that matches the Lake Formation tag expression (permitting Describe for Databases, and Select and Describe for tables with grantable permissions)
In the Data Consumer AWS account
Ensure that the consumer IAM principals are granted
Get permissions on all Glue resources from producer AWS account
{ "Action": [ "glue:GetTables", "glue:GetTable", "glue:SearchTables", "glue:GetDatabases", "glue:GetDatabase", "glue:GetPartitions", "glue:GetPartition", "glue:BatchGetPartition" ], "Effect": "Allow", "Resource": [ "arn:aws:glue:<local_aws_region>:<producer_account_id>:table/*/*", "arn:aws:glue:<local_aws_region>:<producer_account_id>:database/*", "arn:aws:glue:<local_aws_region>:<producer_account_id>:catalog" ] }
Get permissions on all Lake Formation resources
{ "Action": [ "lakeformation:GetDataAccess", "lakeformation:GetResourceLFTags", "lakeformation:ListLFTags", "lakeformation:GetLFTag", "lakeformation:SearchTablesByLFTags", "lakeformation:SearchDatabasesByLFTags" ], "Effect": "Allow", "Resource": "*" }
If applicable, accept the AWS Resource Access Manager share from the Producer AWS account. Note that this step may not be necessary, as the share is automatically accepted if both the producer and consumer accounts are under the same AWS Organizational Unit.
Establish a Glue Database resource link pointing to the shared database from the producer AWS account
Register the shared database and its tables in Hybrid access mode, and share them with the requisite consumer IAM principals
Grant Data Lake Permissions on the Resource link database and its tables to the designated consumer IAM principals
Grant the IAM principals access to any table or database that matches the Lake Formation tag expression (permitting Describe for Databases, and Select and Describe for tables)
Conclusion
As I explored the cross-account data access control with AWS Lake Formation, I encountered numerous challenges and unexpected realities. Despite the complexities, this journey offered invaluable insights and lessons.
An essential takeaway is the need to adapt to unforeseen limitations and evolving requirements. Facing hurdles like compatibility issues between AWS Lake Formation and Spark Delta framework, and Athena's constraints accessing Delta Lake tables required innovative workarounds.
With AWS services continuously evolving, I anticipate new solutions and enhancements to address existing limitations and streamline processes. By remaining informed and proactive, we can effectively navigate the ever-changing landscape of data management and access control within the AWS ecosystem.
In conclusion, my journey with AWS Lake Formation has provided profound insights into the intricacies of cross-account data access control. By sharing my experiences and lessons learned, I aim to empower others to confront similar challenges with confidence and achieve success in their endeavors.