Data Lake

/get

Retrieve any number of rows from the data lake via _m1key property.

rights: get
verbs: GET, POST

Parameter	Type	Required
ids	long[]	Yes. IDs of records to be retrieved.
properties	Array	No. List of properties from schema to include in records. If null, all columns are returned.

returns: JSON [{},...]

curlm1 client

bash

curl https://test-m1.minusonedb.com/get \
-d 'ids=[10000,20000,30000]&properties=["_m1key","session.id"]' \
-H "m1-auth-token: $myToken"

bash

m1 test-m1 get -ids "[10000,20000,30000]" -properties '["_m1key","session.id"]'

▶200 OKRetrieved records

[
  {
    "_m1key": 10000,
    "session.id": "67397f5e-b6b1-4cb9-9b88-b6396d3e1ae4"
  },
  {
    "_m1key": 20000,
    "session.id": "76a9cad0-88f8-46e3-bbae-40256f2c9267"
  },
  {
    "_m1key": 30000,
    "session.id": "acd6540d-b8cd-470e-b6e8-703cd1a48870"
  }
]

/range

Retrieve all rows from datalake with _m1key values between start (inclusive) and end (exclusive).

rights: get
verbs: GET, POST

Parameter	Type	Required
start	long	Yes. Inclusive.
end	long	Yes. Exclusive.
properties	Array	No. List of properties from schema to include in records. If null, all columns are returned.

returns: JSON [{},...]

curlm1 client

bash

curl https://test-m1.minusonedb.com/range \
-d 'start=10000&end=30000&properties=["_m1key","session.id"]' \
-H "m1-auth-token: $myToken"

bash

m1 test-m1 range -start 10000 -end 30000

▶200 OKRecords in range

[
  {
    "_m1key": 10000,
    "session.id": "67397f5e-b6b1-4cb9-9b88-b6396d3e1ae4"
  },
  {
    "_m1key": 20000,
    "session.id": "76a9cad0-88f8-46e3-bbae-40256f2c9267"
  },
  {
    "_m1key": 30000,
    "session.id": "acd6540d-b8cd-470e-b6e8-703cd1a48870"
  }
]

/publish

Load data from files in S3 or GCS into the data lake.

rights: publish
verbs: POST

Parameter	Type	Required
file	String	Yes. Fully qualified file path, e.g. `s3://yourbucket/path/to/file.json` or `gs://yourgcsbucket/path/to/file.json`.
rules	TransformRuleSet	No
format	String	No. One of: txt, csv, json, jsonl. If not present, filename will be inspected for format.
compression	String	No. One of: gz, zip. If not present, filename will be inspected.
encoding	String	No. Any Java encoding format. Default UTF-8.
delimiter	char	No. Character for delimited file formats. Tabs and commas assumed for txt/csv unless specified.
headers	boolean	No. Whether headers are present for delimited formats. Default true. If false, a TransformRuleSet is required.
rows	Integer	No. Max rows to load. Negative or omitted loads all rows.

curlm1 client

bash

curl https://test-m1.minusonedb.com/publish \
-d "file=s3://m1-public/reddit/us.jsonl.gz" \
-H "m1-auth-token: $myToken"

bash

m1 test-m1 publish -file "file=s3://m1-public/reddit/us.jsonl.gz"

▶200 OKPublish progress

"Saved 30000-46570 (16570 total)
Saved 50000-66376 (32946 total)
Saved 70000-86492 (49438 total)
Saved 90000-92104 (51542 total)"

To load data via /publish you must configure permissions so that the EC2 instance profile role associated with your environment can read the file(s) you are attempting to /publish.

json

// Example ReadAccess policy for S3 Configuration
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ReadAccess",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::248899197673:role/$instanceProfileRole"
      },
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::$yourBucket",
        "arn:aws:s3:::$yourBucket/*"
      ]
    }
  ]
}

Configuration Steps When Publishing from S3

1. Run /env/bucket/register to register the bucket with your environment

2. In your AWS account, configure a bucket policy so that the instance profile role of your environment has read access to your data. You can obtain the instanceProfileRole from /env/get

3. Validate your bucket access by calling /publish with one of your files and rows=0. If you get an AccessDeniedException, recheck your IAM permissions above.

4. You can now call /publish on all the files you wish to load into your environment.

bash

# You can use an existing service account or create one. Skip this step if you already have a service account.
# You can replace the account name and display name with anything you like
gcloud iam service-accounts create m1db-archive-bucket-reader --display-name="m1db archive bucket reader"
# You will need the email address of the account you just created. You can find it in the console (under IAM > Service Accounts)
# Or you can run the below replacing the email filter with the name you gave your service account
export SA_EMAIL=$(gcloud iam service-accounts list \
--filter="email:m1db-archive-bucket-reader" \
--format="value(email)")
# Replace <bucket> with your bucket name
gcloud storage buckets add-iam-policy-binding gs://<bucket> \
--member="serviceAccount:$SA_EMAIL" \
--role="roles/storage.objectViewer"

Configuration Steps When Publishing from GCS

1. Create a service account in your google cloud account and allow it to read/download files in your bucket(s)

(You can skip the step if you already have a service account that has sufficient access to the data you wish to load in your m1db environment)

2. Create a trust relationship between the EC2 instance profile role associated with your environment and your service account

Select a configuration that refers to the project containing your GCS bucket. Alternatively you can add --configuration <config> to each of the commands below The account in the <config> must have at least these permissions on the relevant project:

iam.serviceAccounts.create
iam.serviceAccounts.setIamPolicy
storage.buckets.setIamPolicy
gcloud config configurations activate <config>

bash

# Replace <env> with your environment name
export M1_INSTANCE_PROFILE=$(m1 ops env/get -env <env>|grep instanceProfileRole|sed -e 's/\(.*\): "\(.*\)",/\2/')
# Create a trust relationship between your service account and your environment instance role
# Note that this role is mediated by an m1db GCP account.
gcloud iam service-accounts add-iam-policy-binding $SA_EMAIL \
--role=roles/iam.workloadIdentityUser \
--member="principalSet://iam.googleapis.com/projects/980494489932/locations/global/workloadIdentityPools/aws-pool/attribute.aws_role/arn:aws:sts::248899197673:assumed-role/$M1_INSTANCE_PROFILE"

3. Set the gcs-service-account system property to the email address of your service account.

bash

# Set the gcs-service-account system property to point to your service account
m1 <env> system -gcs-service-account $SA_EMAIL

4. Enable outbound connectivity for your environment

bash

# Enable outbound connectivity for your environment
m1 ops env/outbound -env <env> -enable true

5. Validate your GCS bucket access by calling /publish with one of your files and rows=0. If you get an access error, recheck your configuration steps. Note that it may take a few minutes for your configuration to take effect.

6. You can now call /publish on all the files you wish to load into your environment.

/insert

Insert raw entities into the lake.

rights: publish
verbs: POST

Parameter	Type	Required
e	String[]	No. Raw parameter entities

returns: JSON [{},...]

curlm1 client

bash

curl https://test-m1.minusonedb.com/insert \
-d 'e=[{"score" : "199", "downs" : "10", "author" : "Alice"}]' -H "m1-auth-token: $myToken"

bash

m1 test-m1 insert -e "[{"score" : "199", "downs" : "10", "author" : "Alice"}]"

▶200 OKRetrieved records

null

/update

Update documents in lake by passing in entities associated with their _m1key. Beware of bulk updates across many different files, updating many documents at once will take much longer.

rights: delete
verbs: POST

Parameter	Type	Required
e	String[]	No. Raw parameter entities

returns: JSON [{},...]

curlm1 client

bash

curl https://test-m1.minusonedb.com/update \
-d 'e=[{"_m1key" : "82270000", "score" : "199", "downs" : "10", "author" : "Alice"}]' -H "m1-auth-token: $myToken"

bash

m1 test-m1 update -e "[{"_m1key" : "82270000", "score" : "199", "downs" : "10", "author" : "Alice"}]"

▶200 OKRetrieved records

null

/modify

Modify datalake with inserts, updates or deletions. This method is a generalization of /insert, /update, /delete. Beware of bulk updates across many different files, updating many documents at once will take much longer.

rights: publish
verbs: POST

Parameter	Type	Required
e	String[]	No. Raw parameter entities
delete	String[]	No. Raw ids to delete

returns: JSON [{},...]

curlm1 client

bash

curl https://test-m1.minusonedb.com/modify \
-d 'e=[{"score":"199","downs":"10","author":"Alice"}]&delete=["ids=82270000"]' \
-H "m1-auth-token: $myToken"

bash

m1 test-m1 modify -e '[{"score" : "199", "downs" : "10", "author" : "Alice"}]' -delete "["ids=82270000"]"

▶200 OKRetrieved records

null

/delete

Delete _m1key records from the datalake for the specified list of m1keys.

rights: delete
verbs: POST

Parameter	Type	Required
ids	String[]	No. List of m1key ids to be deleted

returns: JSON [{},...]

curlm1 client

bash

curl https://test-m1.minusonedb.com/delete \
-d "ids=["82270000"]" \
-H "m1-auth-token: $myToken"

bash

m1 test-m1 delete -ids '["82270000"]'

▶200 OKRetrieved records

null

/next

Retrieve the next _m1key that will be assigned to a document added to the lake (via /publish, for example).

rights: admin, publish
verbs: GET
parameters: none

curlm1 client

bash

curl https://test-m1.minusonedb.com/next \
-H "m1-auth-token: $myToken"

bash

m1 test-m1 next

▶200 OKNext available key

Data Lake ​

/get ​

/range ​

/publish ​

Configuration Steps When Publishing from S3 ​

Configuration Steps When Publishing from GCS ​

/insert ​

/update ​

/modify ​

/delete ​

/next ​

Data Lake

/get

/range

/publish

Configuration Steps When Publishing from S3

Configuration Steps When Publishing from GCS

/insert

/update

/modify

/delete

/next