Skip to content

Data Lake

/publish

Load data from files in S3 or GCS into the data lake.

  • rights: publish
  • verbs: POST
ParameterTypeRequired
fileStringYes. Fully qualified file path, e.g. s3://yourbucket/path/to/file.json or gs://yourgcsbucket/path/to/file.json.
rulesTransformRuleSetNo
formatStringNo. One of: txt, csv, json, jsonl. If not present, filename will be inspected for format.
compressionStringNo. One of: gz, zip. If not present, filename will be inspected.
encodingStringNo. Any Java encoding format. Default UTF-8.
delimitercharNo. Character for delimited file formats. Tabs and commas assumed for txt/csv unless specified.
headersbooleanNo. Whether headers are present for delimited formats. Default true. If false, a TransformRuleSet is required.
rowsIntegerNo. Max rows to load. Negative or omitted loads all rows.
bash
curl https://test-m1.minusonedb.com/publish \
-d "file=s3://m1-public/reddit/us.jsonl.gz" \
-H "m1-auth-token: $myToken"
bash
m1 test-m1 publish -file "file=s3://m1-public/reddit/us.jsonl.gz"
200 OKPublish progress

To load data via /publish you must configure permissions so that the EC2 instance profile role associated with your environment can read the file(s) you are attempting to /publish.

json
// Example ReadAccess policy for S3 Configuration
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ReadAccess",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::248899197673:role/$instanceProfileRole"
      },
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::$yourBucket",
        "arn:aws:s3:::$yourBucket/*"
      ]
    }
  ]
}

Configuration Steps When Publishing from S3

1. Run /env/bucket/register to register the bucket with your environment

2. In your AWS account, configure a bucket policy so that the instance profile role of your environment has read access to your data. You can obtain the instanceProfileRole from /env/get

3. Validate your bucket access by calling /publish with one of your files and rows=0. If you get an AccessDeniedException, recheck your IAM permissions above.

4. You can now call /publish on all the files you wish to load into your environment.

bash
# You can use an existing service account or create one. Skip this step if you already have a service account.
# You can replace the account name and display name with anything you like
gcloud iam service-accounts create m1db-archive-bucket-reader --display-name="m1db archive bucket reader"
# You will need the email address of the account you just created. You can find it in the console (under IAM > Service Accounts)
# Or you can run the below replacing the email filter with the name you gave your service account
export SA_EMAIL=$(gcloud iam service-accounts list \
--filter="email:m1db-archive-bucket-reader" \
--format="value(email)")
# Replace <bucket> with your bucket name
gcloud storage buckets add-iam-policy-binding gs://<bucket> \
--member="serviceAccount:$SA_EMAIL" \
--role="roles/storage.objectViewer"

Configuration Steps When Publishing from GCS

1. Create a service account in your google cloud account and allow it to read/download files in your bucket(s)

(You can skip the step if you already have a service account that has sufficient access to the data you wish to load in your m1db environment)

2. Create a trust relationship between the EC2 instance profile role associated with your environment and your service account

Select a configuration that refers to the project containing your GCS bucket. Alternatively you can add --configuration <config> to each of the commands below The account in the <config> must have at least these permissions on the relevant project:

  • iam.serviceAccounts.create
  • iam.serviceAccounts.setIamPolicy
  • storage.buckets.setIamPolicy
  • gcloud config configurations activate <config>
bash
# Replace <env> with your environment name
export M1_INSTANCE_PROFILE=$(m1 ops env/get -env <env>|grep instanceProfileRole|sed -e 's/\(.*\): "\(.*\)",/\2/')
# Create a trust relationship between your service account and your environment instance role
# Note that this role is mediated by an m1db GCP account.
gcloud iam service-accounts add-iam-policy-binding $SA_EMAIL \
--role=roles/iam.workloadIdentityUser \
--member="principalSet://iam.googleapis.com/projects/980494489932/locations/global/workloadIdentityPools/aws-pool/attribute.aws_role/arn:aws:sts::248899197673:assumed-role/$M1_INSTANCE_PROFILE"

3. Set the gcs-service-account system property to the email address of your service account.

bash
# Set the gcs-service-account system property to point to your service account
m1 <env> system -gcs-service-account $SA_EMAIL

4. Enable outbound connectivity for your environment

bash
# Enable outbound connectivity for your environment
m1 ops env/outbound -env <env> -enable true

5. Validate your GCS bucket access by calling /publish with one of your files and rows=0. If you get an access error, recheck your configuration steps. Note that it may take a few minutes for your configuration to take effect.

6. You can now call /publish on all the files you wish to load into your environment.

/modify

Modify datalake with inserts, updates or deletions. This method is a generalization of /insert, /update, /delete. Beware of bulk updates across many different files, updating many documents at once will take much longer.

  • rights: publish
  • verbs: POST
ParameterTypeRequired
eString[]No. Raw parameter entities
deleteString[]No. Raw ids to delete
  • returns: JSON [{},...]
bash
curl https://test-m1.minusonedb.com/modify \
-d 'e=[{"score":"199","downs":"10","author":"Alice"}]&delete=["ids=82270000"]' \
-H "m1-auth-token: $myToken"
bash
m1 test-m1 modify -e '[{"score" : "199", "downs" : "10", "author" : "Alice"}]' -delete "["ids=82270000"]"
200 OKRetrieved records

/insert

Insert raw entities into the lake.

  • rights: publish
  • verbs: POST
ParameterTypeRequired
eString[]No. Raw parameter entities
  • returns: JSON [{},...]
bash
curl https://test-m1.minusonedb.com/insert \
-d 'e=[{"score" : "199", "downs" : "10", "author" : "Alice"}]' -H "m1-auth-token: $myToken"
bash
m1 test-m1 insert -e "[{"score" : "199", "downs" : "10", "author" : "Alice"}]"
200 OKRetrieved records

/update

Update documents in lake by passing in entities associated with their _m1key. Beware of bulk updates across many different files, updating many documents at once will take much longer.

  • rights: delete
  • verbs: POST
ParameterTypeRequired
eString[]No. Raw parameter entities
  • returns: JSON [{},...]
bash
curl https://test-m1.minusonedb.com/update \
-d 'e=[{"_m1key" : "82270000", "score" : "199", "downs" : "10", "author" : "Alice"}]' -H "m1-auth-token: $myToken"
bash
m1 test-m1 update -e "[{"_m1key" : "82270000", "score" : "199", "downs" : "10", "author" : "Alice"}]"
200 OKRetrieved records

/delete

Delete _m1key records from the datalake for the specified list of m1keys.

  • rights: delete
  • verbs: POST
ParameterTypeRequired
idsString[]No. List of m1key ids to be deleted
  • returns: JSON [{},...]
bash
curl https://test-m1.minusonedb.com/delete \
-d "ids=["82270000"]" \
-H "m1-auth-token: $myToken"
bash
m1 test-m1 delete -ids '["82270000"]'
200 OKRetrieved records

/next

Retrieve the next _m1key that will be assigned to a document added to the lake (via /publish, for example).

  • rights: admin, publish
  • verbs: GET
  • parameters: none
bash
curl https://test-m1.minusonedb.com/next \
-H "m1-auth-token: $myToken"
bash
m1 test-m1 next
200 OKNext available key

/get

Retrieve any number of rows from the data lake via _m1key property.

  • rights: get
  • verbs: GET, POST
ParameterTypeRequired
idslong[]Yes. IDs of records to be retrieved.
propertiesArrayNo. List of properties from schema to include in records. If null, all columns are returned.
  • returns: JSON [{},...]
bash
curl https://test-m1.minusonedb.com/get \
-d 'ids=[10000,20000,30000]&properties=["_m1key","session.id"]' \
-H "m1-auth-token: $myToken"
bash
m1 test-m1 get -ids "[10000,20000,30000]" -properties '["_m1key","session.id"]'
200 OKRetrieved records

/range

Retrieve all rows from datalake with _m1key values between start (inclusive) and end (exclusive).

  • rights: get
  • verbs: GET, POST
ParameterTypeRequired
startlongYes. Inclusive.
endlongYes. Exclusive.
propertiesArrayNo. List of properties from schema to include in records. If null, all columns are returned.
  • returns: JSON [{},...]
bash
curl https://test-m1.minusonedb.com/range \
-d 'start=10000&end=30000&properties=["_m1key","session.id"]' \
-H "m1-auth-token: $myToken"
bash
m1 test-m1 range -start 10000 -end 30000
200 OKRecords in range

© 2021-2026 MinusOne, Inc.