MATIH Platform is in active MVP development. Documentation reflects current implementation status.
10a. Data Ingestion
Connectors
Cloud Storage Connectors

Cloud Storage Connectors

Cloud storage connectors extract data from files stored in object storage systems and file servers. They support reading structured files (CSV, Parquet, JSON, Avro) from cloud buckets and SFTP servers, making it possible to ingest data from data lakes, file drops, and partner data exchanges.


Amazon S3

The S3 connector reads files from Amazon S3 buckets. It supports path pattern matching, file format detection, and incremental sync based on file modification time.

Configuration

{
  "name": "s3-data-lake",
  "connectorType": "s3",
  "connectionConfig": {
    "bucket": "company-data-lake",
    "aws_access_key_id": "AKIA...",
    "aws_secret_access_key": "********",
    "region_name": "us-east-1",
    "path_prefix": "raw/sales/",
    "streams": [
      {
        "name": "sales_data",
        "format": {
          "filetype": "parquet"
        },
        "globs": ["raw/sales/**/*.parquet"]
      }
    ]
  }
}

Configuration Fields

FieldTypeRequiredDefaultDescription
bucketstringYes--S3 bucket name
aws_access_key_idstringYes*--AWS access key ID. Not required if using IAM role.
aws_secret_access_keystringYes*--AWS secret access key
region_namestringNous-east-1AWS region
path_prefixstringNo--Prefix to filter S3 objects
streams[].namestringYes--Logical stream name for the extracted data
streams[].format.filetypestringYes--File format: csv, parquet, jsonl, avro
streams[].globsstring[]No--Glob patterns to match files

Supported File Formats

FormatExtensionsFeatures
CSV.csv, .tsv, .txtConfigurable delimiter, quoting, encoding, header detection
Parquet.parquetFull schema preservation, predicate pushdown
JSON Lines.jsonl, .ndjsonOne JSON object per line
Avro.avroSchema registry integration

IAM Role Authentication

For deployments running on AWS, you can use IAM role-based authentication instead of access keys. Attach an IAM role with s3:GetObject and s3:ListBucket permissions to the Airbyte worker pods.


Google Cloud Storage

The GCS connector reads files from Google Cloud Storage buckets.

Configuration

{
  "name": "gcs-analytics",
  "connectorType": "gcs",
  "connectionConfig": {
    "service_account": "{\"type\":\"service_account\",\"project_id\":\"my-project\",...}",
    "bucket": "analytics-data",
    "streams": [
      {
        "name": "web_events",
        "format": {
          "filetype": "jsonl"
        },
        "globs": ["events/**/*.jsonl"]
      }
    ]
  }
}

Configuration Fields

FieldTypeRequiredDefaultDescription
service_accountstringYes--GCP service account JSON key (stringified)
bucketstringYes--GCS bucket name
streams[].namestringYes--Logical stream name
streams[].format.filetypestringYes--File format: csv, parquet, jsonl, avro
streams[].globsstring[]No--Glob patterns to match objects

Required Permissions

The service account must have the following IAM roles:

  • roles/storage.objectViewer -- read access to objects
  • roles/storage.legacyBucketReader -- list objects in the bucket

Azure Blob Storage

The Azure Blob Storage connector reads files from Azure Storage containers.

Configuration

{
  "name": "azure-blob-reports",
  "connectorType": "azure-blob-storage",
  "connectionConfig": {
    "azure_blob_storage_account_name": "mystorageaccount",
    "azure_blob_storage_account_key": "********",
    "azure_blob_storage_container_name": "reports",
    "streams": [
      {
        "name": "monthly_reports",
        "format": {
          "filetype": "csv"
        },
        "globs": ["reports/2024/**/*.csv"]
      }
    ]
  }
}

Configuration Fields

FieldTypeRequiredDefaultDescription
azure_blob_storage_account_namestringYes--Storage account name
azure_blob_storage_account_keystringYes*--Storage account access key
azure_blob_storage_sas_tokenstringYes*--SAS token (alternative to account key)
azure_blob_storage_container_namestringYes--Container name
streams[].namestringYes--Logical stream name
streams[].format.filetypestringYes--File format: csv, parquet, jsonl, avro
streams[].globsstring[]No--Glob patterns to match blobs

SFTP

The SFTP connector reads files from SFTP servers. This is commonly used for partner data exchanges and legacy system integrations.

Configuration

{
  "name": "sftp-partner-data",
  "connectorType": "sftp-bulk",
  "connectionConfig": {
    "host": "sftp.partner.com",
    "port": 22,
    "username": "matih_user",
    "credentials": {
      "auth_type": "password",
      "password": "********"
    },
    "streams": [
      {
        "name": "daily_feed",
        "format": {
          "filetype": "csv",
          "delimiter": "|",
          "encoding": "utf-8"
        },
        "globs": ["/data/feeds/daily_*.csv"]
      }
    ]
  }
}

Configuration Fields

FieldTypeRequiredDefaultDescription
hoststringYes--SFTP server hostname
portintegerNo22SFTP server port
usernamestringYes--SFTP username
credentials.auth_typestringYes--password or ssh_key
credentials.passwordstringConditional--Password (if auth_type is password)
credentials.private_keystringConditional--SSH private key (if auth_type is ssh_key)
streams[].namestringYes--Logical stream name
streams[].format.filetypestringYes--File format: csv, parquet, jsonl, avro
streams[].format.delimiterstringNo,Column delimiter for CSV files
streams[].format.encodingstringNoutf-8File encoding
streams[].globsstring[]No--Glob patterns to match files

Cloud Storage vs. File Import

The platform provides two ways to ingest file-based data. Choose the appropriate method based on your use case.

FeatureCloud Storage ConnectorFile Import (Direct Upload)
SourceFiles in S3, GCS, Azure, SFTPFiles on your local machine
ScheduleAutomated on cron scheduleManual, one-time upload
VolumeUnlimited (streams entire buckets)Single file per upload
Format supportCSV, Parquet, JSON Lines, AvroCSV, Excel, Parquet, JSON, Avro
SchemaAuto-detected from file structureAuto-detected with preview and manual override
Use caseRecurring data feeds, data lake ingestionAd-hoc data loading, one-time imports
DocumentationThis pageFile Import