Datasaur
Visit our websitePricingBlogPlaygroundAPI Docs
  • Welcome to Datasaur
    • Getting started with Datasaur
  • Data Studio Projects
    • Labeling Task Types
      • Span Based
        • OCR Labeling
        • Audio Project
      • Row Based
      • Document Based
      • Bounding Box
      • Conversational
      • Mixed Labeling
      • Project Templates
        • Test Project
    • Creating a Project
      • Data Formats
      • Data Samples
      • Split Files
      • Consensus
      • Dynamic Review Capabilities
    • Pre-Labeled Project
    • Let's Get Labeling!
      • Span Based
        • Span + Line Labeling
      • Row & Document Based
      • Bounding Box Labeling
      • Conversational Labeling
      • Label Sets / Question Sets
        • Dynamic Question Set
      • Multiple Label Sets
    • Reviewing Projects
      • Review Sampling
    • Adding Documents to an Ongoing Project
    • Export Project
  • LLM Projects
    • LLM Labs Introduction
    • Sandbox
      • Direct Access LLMs
      • File Attachment
      • Conversational Prompt
    • Deployment
      • Deployment API
    • Knowledge base
      • External Object Storage
      • File Properties
    • Models
      • Amazon SageMaker JumpStart
      • Amazon Bedrock
      • Open AI
      • Azure OpenAI
      • Vertex AI
      • Custom model
      • Fine-tuning
      • LLM Comparison Table
    • Evaluation
      • Automated Evaluation
        • Multi-application evaluation
        • Custom metrics
      • Ranking (RLHF)
      • Rating
      • Performance Monitoring
    • Dataset
    • Pricing Plan
  • Workspace Management
    • Workspace
    • Role & Permission
    • Analytics
      • Inter-Annotator Agreement (IAA)
        • Cohen's Kappa Calculation
        • Krippendorff's Alpha Calculation
      • Custom Report Builder
      • Project Report
      • Evaluation Metrics
    • Activity
    • File Transformer
      • Import Transformer
      • Export Transformer
      • Upload File Transformer
      • Running File Transformer
    • Label Management
      • Label Set Management
      • Question Set Management
    • Project Management
      • Self-Assignment
        • Self-Unassign
      • Transfer Assignment Ownership
      • Reset Labeling Work
      • Mark Document as Complete
      • Project Status Workflow
        • Read-only Mode
      • Comment Feature
      • Archive Project
    • Automation
      • Action: Create Projects
  • Assisted Labeling
    • ML Assisted Labeling
      • Amazon Comprehend
      • Amazon SageMaker
      • Azure ML
      • CoreNLP NER
      • CoreNLP POS
      • Custom API
      • FewNERD
      • Google Vertex AI
      • Hugging Face
      • LLM Assisted Labeling
        • Prompt Examples
        • Custom Provider
      • LLM Labs (beta)
      • NLTK
      • Sentiment Analysis
      • spaCy
      • SparkNLP NER
      • SparkNLP POS
    • Data Programming
      • Example of Labeling Functions
      • Labeling Function Analysis
      • Inter-Annotator Agreement for Data Programming
    • Predictive Labeling
  • Assisted Review
    • Label Error Detection
  • Building Your Own Model
    • Datasaur Dinamic
      • Datasaur Dinamic with Hugging Face
      • Datasaur Dinamic with Amazon SageMaker Autopilot
  • Advanced
    • Script-Generated Question
    • Shortcuts
    • Extensions
      • Labels
      • Review
      • Document and Row Labeling
      • Bounding Box Labels
      • List of Files
      • Comments
      • Analytics
      • Dictionary
      • Search
      • Labeling Guidelines
      • Metadata
      • Grammar Checker
      • ML Assisted Labeling
      • Data Programming
      • Datasaur Dinamic
      • Predictive Labeling
      • Label Error Detection
      • LLM Sandbox
    • Tokenizers
  • Integrations
    • External Object Storage
      • AWS S3
        • With IRSA
      • Google Cloud Storage
      • Azure Blob Storage
      • Dropbox
    • SAML
      • Okta
      • Microsoft Entra ID
    • SCIM
      • Okta
      • Microsoft Entra ID
    • Webhook Notifications
      • Webhook Signature
      • Events
      • Custom Headers
    • Robosaur
      • Commands
        • Create Projects
        • Apply Project Tags
        • Export Projects
        • Generate Time Per Task Report
        • Split Document
      • Storage Options
  • API
    • Datasaur APIs
    • Credentials
    • Create Project
      • New mutation (createProject)
      • Python Script Example
    • Adding Documents
    • Labeling
      • Create Label Set
      • Add Label Sets into Existing Project
      • Get List of Label Sets in a Project
      • Add Label Set Item into Project's Label Set
      • Programmatic API Labeling
      • Inserting Span and Arrow Label into Document
    • Export Project
      • Custom Webhook
    • Get Data
      • Get List of Projects
      • Get Document Information
      • Get List of Tags
      • Get Cabinet
      • Export Team Overview
      • Check Job
    • Custom OCR
      • Importable Format
    • Custom ASR
    • Run ML-Assisted Labeling
  • Security and Compliance
    • Security and Compliance
      • 2FA
  • Compatibility & Updates
    • Common Terminology
    • Recommended Machine Specifications
    • Supported Formats
    • Supported Languages
    • Release Notes
      • Version 6
        • 6.111.0
        • 6.110.0
        • 6.109.0
        • 6.108.0
        • 6.107.0
        • 6.106.0
        • 6.105.0
        • 6.104.0
        • 6.103.0
        • 6.102.0
        • 6.101.0
        • 6.100.0
        • 6.99.0
        • 6.98.0
        • 6.97.0
        • 6.96.0
        • 6.95.0
        • 6.94.0
        • 6.93.0
        • 6.92.0
        • 6.91.0
        • 6.90.0
        • 6.89.0
        • 6.88.0
        • 6.87.0
        • 6.86.0
        • 6.85.0
        • 6.84.0
        • 6.83.0
        • 6.82.0
        • 6.81.0
        • 6.80.0
        • 6.79.0
        • 6.78.0
        • 6.77.0
        • 6.76.0
        • 6.75.0
        • 6.74.0
        • 6.73.0
        • 6.72.0
        • 6.71.0
        • 6.70.0
        • 6.69.0
        • 6.68.0
        • 6.67.0
        • 6.66.0
        • 6.65.0
        • 6.64.0
        • 6.63.0
        • 6.62.0
        • 6.61.0
        • 6.60.0
        • 6.59.0
        • 6.58.0
        • 6.57.0
        • 6.56.0
        • 6.55.0
        • 6.54.0
        • 6.53.0
        • 6.52.0
        • 6.51.0
        • 6.50.0
        • 6.49.0
        • 6.48.0
        • 6.47.0
        • 6.46.0
        • 6.45.0
        • 6.44.0
        • 6.43.0
        • 6.42.0
        • 6.41.0
        • 6.40.0
        • 6.39.0
        • 6.38.0
        • 6.37.0
        • 6.36.0
        • 6.35.0
        • 6.34.0
        • 6.33.0
        • 6.32.0
        • 6.31.0
        • 6.30.0
        • 6.29.0
        • 6.28.0
        • 6.27.0
        • 6.26.0
        • 6.25.0
        • 6.24.0
        • 6.23.0
        • 6.22.0
        • 6.21.0
        • 6.20.0
        • 6.19.0
        • 6.18.0
        • 6.17.0
        • 6.16.0
        • 6.15.0
        • 6.14.0
        • 6.13.0
        • 6.12.0
        • 6.11.0
        • 6.10.0
        • 6.9.0
        • 6.8.0
        • 6.7.0
        • 6.6.0
        • 6.5.0
        • 6.4.0
        • 6.3.0
        • 6.2.0
        • 6.1.0
        • 6.0.0
      • Version 5
        • 5.63.0
        • 5.62.0
        • 5.61.0
        • 5.60.0
  • Deployment
    • Self-Hosted
      • AWS Marketplace
        • Data Studio
        • LLM Labs
Powered by GitBook
On this page
  • File Key
  • Setup
  1. Integrations
  2. External Object Storage
  3. AWS S3

With IRSA

Using your own S3 bucket for Datasaur projects with IRSA delegated permission

Last updated 8 months ago

Specific for self-hosted from AWS Marketplace, the delegated permission method being used would be through . Although the overall approach is almost the same as the original approach (), this page is still needed to avoid any confusion between the two and make it clear for the self-hosted users through AWS Marketplace. One of the main differences is when creating the IAM role, specifically the 4th step.

File Key

This attribute will be used when you create a project to tell Datasaur which file should be used. You can get it by using the path after bucket name on S3 URI. See the example below.

  • Bucket name: datasaur-test

  • S3 URI: s3://datasaur-test/some-folder/image.png

  • File key: /some-folder/image.png

Setup

By integrating your bucket into Datasaur, you would be able to create projects using files directly from your S3.

1. Setup External Object Storage Integration in Datasaur Team Settings

Let's begin by setting up an Integration in Team Settings. By default, Datasaur uses its own storage to manage your projects. By adding another one, we can use your preferred storage provider when creating projects.

  1. Open your team page, then go to Settings > Integrations.

  2. Click on "Add External Object Storage". A new window will pop up. Do not close the pop up because we will use the External ID and it will be generated each time you close the form.

  3. You can start by filling the Bucket name attribute. It will be used to reference and differentiate between external object storage.

We'll get back to this window later. Let's leave it for now.

2. Setup CORS for your S3 bucket

This step would allow Datasaur to access resources in your bucket.

  1. Log into your AWS account, then go to S3 management console.

  2. Click on your preferred bucket. And also, it's highly recommended to enable the lifecycle policy for both temp/ and export/ prefix to be removed in 7 days.

  3. Open Permissions. Edit the Cross-origin resource sharing (CORS) section, and paste the following configurations.

[
  { 
    "AllowedHeaders": ["*"], 
    "AllowedMethods": [
      "GET",
      "PUT",
      "POST",
      "HEAD",
      "DELETE"
    ],
    "AllowedOrigins": ["<FILL_THIS_WITH_YOUR_DOMAIN>"],
    "ExposeHeaders": []
  }
]
  • Bucket name: Fill with the name of the bucket that you just set the CORS for.

  • Bucket prefix: It will be added at the start of the bucket so that you can group it according to your needs, e.g. test will refer to /{bucket-name}/test.

  • Allowed origins: Change it to your self-hosted domain for the Datasaur app.

3. Create a policy for Datasaur role in AWS

You need to create a policy to access your S3 bucket. If you have already setup a policy for accessing the bucket, feel free to skip this step.

  1. In your AWS IAM management console, go to Policies, then click on Create Policy.

  2. Choose the JSON tab, and paste the following configurations. Don't forget to replace the resource with your bucket name. The write permission will be used to upload the selected files to your bucket whereas the get bucket location will be used to configure the request based on your bucket's region.

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Action": [
            "s3:ListBucket",
            "s3:ListBucketVersions",
            "s3:PutObjectAcl",
            "s3:PutObject",
            "s3:GetObjectAcl",
            "s3:GetObject",
            "s3:DeleteObjectVersion",
            "s3:DeleteObject",
            "s3:GetBucketLocation"
          ],
          "Effect": "Allow",
          "Resource": [
            "arn:aws:s3:::<your-bucket-name>/*",
            "arn:aws:s3:::<your-bucket-name>"
          ]
        }
      ]
    }
  3. Click on Next: Tags. We don't require tags to be added, but you can add tags here if you want.

  4. Click on Next: Review. Input a name for the AWS Policy, a description (optional), and click on Create Policy.

4. Create a role for Datasaur

After we've created a policy for your S3 bucket, we need to attach it to a role which will be assumed by Datasaur to access your bucket.

  1. Back on the IAM management console, go to Roles, then click on Create role.

  2. Choose AWS account in the trusted entity type section.

  3. Click on the Custom trust policy for the trusted entity type attribute. You can then paste this configuration below.

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "AWS": "arn:aws:iam::<DATASAUR_AWS_ACCOUNT_ID>:role/<IRSA_ROLE_NAME>"
          },
          "Action": "sts:AssumeRole",
          "Condition": {
            "StringEquals": {
              "sts:ExternalId": "<YOUR_EXTERNAL_ID>"
            }
          }
        }
      ]
    }
  4. Replace the values for AWS account ID, IRSA role name, and external ID accordingly. Use the displayed AWS Account ID. You can define your own external ID, just be sure to update the value in the external object storage form.

  5. In the Add permissions section, pick the policy that we've just created from the previous step. Then, click on Next.

  6. Input a name, (optional) a description, and click on Create role.

  7. After that, back on the Roles page, click on your newly created role.

  8. Copy the Role ARN from the page and paste it in Datasaur Team Settings Page.

5. Check connection

Before you create the integration, you do a check connection to make sure your setup is done correctly. If it's a success, you can continue to create the external object storage.

6. Good to go!

Now, you will be able to create projects using files directly from your S3 bucket, and also change the Default Storage option to whichever one you want from Team Settings page.

If you have any questions or comments, please let us know, and we'll be happy to support you.

IAM Roles for Service Accounts (IRSA)
parent page
Figure 2: Creating a role