Is it possible to run a bq command which query a dataset on project X and stores the result on another dataset on project Y:
bq query --destination_table=project_Y.dataset_1.table_1 "SELECT * FROM project_X.dataset2.table_2"
What about the credentials now that I have two projects involved?
I only have set a service account credential for project_X using gcloud.
Yes. Queries across multiple projects are supported.
The user (or service account) issuing the query will need to have the appropriate permissions on each project (and/or dataset).
You can read more about BigQuery permissions here:
https://cloud.google.com/bigquery/docs/access-control
Related
I need to list Tabular SSAS (Compatibility version 1500) tables from PowerShell.
Invoke-ASCmd command in Sql Server PowerShell package looks promising, however I'm a bit lost in documentation.
I can see that the following query from examples lists datasources of a tabular instance:
Invoke-ASCmd -Database:"Adventure Works DW 2008R2" -Query:
"<Discover xmlns='urn:schemas-microsoft-com:xml-analysis'>
<RequestType>DISCOVER_DATASOURCES</RequestType>
<Restrictions></Restrictions><Properties></Properties>
</Discover>"
It looks like RequestType parameter is what I'm after; I didn't find any documentation on it so tried guessing DISCOVER_TABLES, LIST_TABLES and TABLES which were rejected.
TMSL (which is what 1500 supports according to this link) has commands for altering and deleting tables, however I cannot find anything on querying or listing.
Dynamic Management Views sound like a possible solution however I cannot figure out the syntax.
From "Script Administrative Tasks in Analysis Services":
You can create a standalone MDX script file that queries data or system information. For example, Dynamic Management Views (DMV) that expose information about local server operations and server health are accessed via the MDX Select statement.
Found this discussion and tried
Invoke-ASCmd -Server "localhost" -Database:"database" -Query:"SELECT * FROM DBSCHEMA_TABLES"
however am getting an error
-1055522771 "Either the user X does not have permission to access the referenced mining model, DBSCHEMA_TABLES, or the object does not exist."
I use this to show all tables in tabular model database:
<Discover xmlns='urn:schemas-microsoft-com:xml-analysis'>
<RequestType>TMSCHEMA_TABLES</RequestType>
<Restrictions>
<RestrictionList>
<SystemFlags>0</SystemFlags>
</RestrictionList>
</Restrictions>
<Properties>
<PropertyList>
<CATALOG>YOUR_TABULAR_MODEL_DATABASE_NAME</CATALOG>
</PropertyList>
</Properties>
</Discover>
Hope this helps. For full reference see here or here.
I am using the RunJobFlow command to spin up a Spark EMR cluster. This command sets the JobFlowRole to an IAM Role which has the policies AmazonElasticMapReduceforEC2Role and AmazonRedshiftReadOnlyAccess. The first policy contains an action to allow all s3 permissions.
When the EC2 instances spin up, they assume this IAM role, and generate temporary credentials via STS.
The first thing which I do is read a table from my Redshift cluster into a Spark Dataframe using the com.databricks.spark.redshift format and using the same IAM Role to unload the data from redshift as I did for the EMR JobFlowRole.
So far as I understand, this runs an UNLOAD command on Redshift to dump into the S3 bucket I specify. Spark then loads the newly unloaded data into a Dataframe. I use the recommended s3n:// protocol for the tempdir option.
This command works great, and it always successfully loads the data into the Dataframe.
I then run some transformations and attempt to save the dataframe in the csv format to the same S3 bucket Redshift Unloaded into.
However, when I try to do this, it throws the following error
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively)
Okay. So I don't know why this happens, but I tried to hack around it by setting the recommended hadoop configuration parameters. I then used DefaultAWSCredentialsProviderChain to load the AWSAccessKeyID and AWSSecretKey and set via
spark.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", <CREDENTIALS_ACCESS_KEY>)
spark.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", <CREDENTIALS_SECRET_ACCESS_KEY>)
When I run it again it throws the following error:
java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The AWS Access Key Id you provided does not exist in our records. (Service: Amazon S3; Status Code: 403; Error Code: InvalidAccessKeyId;
Okay. So that didn't work. I then removed setting the hadoop configurations and hardcoded an IAM user's credentials in the s3 url via s3n://ACCESS_KEY:SECRET_KEY#BUCKET/KEY
When I ran this it spit out the following error:
java.lang.IllegalArgumentException: Bucket name should be between 3 and 63 characters long
So it tried to create a bucket.. which is definitely not what we want it to do.
I am really stuck on this one and would really appreciate any help here! It works fine when I run it locally, but completely fails on EMR.
The problem was the following:
EC2 Instance Generated Temporary Credentials on EMR Bootstrap Phase
When I queried Redshift, I passed the aws_iam_role to theDatabricks driver. The driver then re-generated temporary credentials for that same IAM role. This invalidated the credentials the EC2 instance generated.
I then tried to upload to S3 using the old credentials (and the credentials which were stored in the instance's metadata)
It failed because it was trying to use out-of-date credentials.
The solution was to remove redshift authorization via aws_iam_role and replace it with the following:
val credentials = EC2MetadataUtils.getIAMSecurityCredentials
...
.option("temporary_aws_access_key_id", credentials.get(IAM_ROLE).accessKeyId)
.option("temporary_aws_secret_access_key", credentials.get(IAM_ROLE).secretAccessKey)
.option("temporary_aws_session_token", credentials.get(IAM_ROLE).token)
On amazon EMR, try usong the prefix s3:// to refer to an object in S3.
It's a long story.
Is it possible to share information (such as credentials) across multiple notebooks in a DSX project, e.g. with environment variables?
For example a Cloud Foundry application in Bluemix has a control setting where environment variables can be defined, is there a similar concept for a DSX project (I couldn't see anything in the various project level settings).
Separate notebooks have separate runtimes in the background and at the moment it is not possible to share credentials among notebooks by defining environment variables. But there are helper methods for most obvious credential requirements in a project. This is called the "Insert to code" method.
For example: if you have an object store associated with your project.
Select the "Data" tab in the top bar.
Add some file to the object store by browsing or simple drag-n-drop.
Insert credentials of that object store container in your notebook by selecting the "Insert credentials" option, right besides your file in the right hand side panel.
You can then directly insert those credential (Step 3) in any other notebook in that project.
Besides "Insert to code" there are other helper functions like "Insert SparkR dataframe", "Pandas dataframe" etc. to speed up the analytics process of data scientists. Hope that was a bit helpful.
FYI - I've added a feature request on uservoice to allow Bluemix services to be bound to a project and then the credentials be accessed in the same way a Bluemix application accessess credentials. Please vote if you think this would be useful.
Currently, one pattern I use quite a lot is to create a notebook in my project that is used to save credentials to a file on DSX:
! echo '{ "username": "xxxx", "password": "xxxx", ... }' > cloudant_creds.json
That file is now available to all of your notebooks on the project. NOTE: the file is saved on the spark service file system. If you use the same spark service in other dsx projects, they will also be able to access the file.
The credentials for cloudant normally include other fields such as host, I haven't shown these fields here so I can Keep the example simple. I have indicated there are more fields with the .... I normally copy this json from the bluemix service credentials field.
In your other notebooks, you would read the credentials something like this:
with open('cloudant_creds.json') as data_file:
sourceDB = json.load(data_file)
You can then refer the credentials like this:
dfReader = sqlContext.read.format("com.cloudant.spark")
dfReader.option("cloudant.host", sourceDB.host)
if sourceDB.username:
dfReader.option("cloudant.username", sourceDB.username)
if sourceDB.password:
dfReader.option("cloudant.password", sourceDB.password)
df = dfReader.load(sourceDB.database).cache()
I want to run some pyspark jobs using Google Dataproc with different project Ids without success so far. I'm newbie with pyspark and Google Cloud but I've followed this example and runs well (if the BigQuery dataset is either public or belongs to my GCP project, which is ProjectA). Input parameters look like this:
bucket = sc._jsc.hadoopConfiguration().get('fs.gs.system.bucket')
projectA = sc._jsc.hadoopConfiguration().get('fs.gs.project.id')
input_directory ='gs://{}/hadoop/tmp/bigquery/pyspark_input'.format(bucket)
conf = {
# Input Parameters
'mapred.bq.project.id': projectA,
'mapred.bq.gcs.bucket': bucket,
'mapred.bq.temp.gcs.path': input_directory,
'mapred.bq.input.project.id': 'projectA',
'mapred.bq.input.dataset.id': 'my_dataset',
'mapred.bq.input.table.id': 'my_table',
}
# Load data in from BigQuery.
table_data = sc.newAPIHadoopRDD(
'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
'org.apache.hadoop.io.LongWritable',
'com.google.gson.JsonObject',
conf=conf)
But what I need is to run a job from a BQ dataset of a ProjectB (I have credentials to query it), so when setting the input parameters, which look like this:
conf = {
# Input Parameters
'mapred.bq.project.id': projectA,
'mapred.bq.gcs.bucket': bucket,
'mapred.bq.temp.gcs.path': input_directory,
'mapred.bq.input.project.id': 'projectB',
'mapred.bq.input.dataset.id': 'the_datasetB',
'mapred.bq.input.table.id': 'the_tableB',
}
and try to load data in from BQ, my script keeps running infinitely. How should I set it up properly?
FYI, after running the example I mentioned before, I can see that 2 carpets (shard-0 and shard-1) are created in Google Storage and contain the corresponding BQ data, but with my job only shard-0 is created and it's empty.
I talked to my co-worker Dennis and here is his suggestion:
"Hmm, not sure, it should work. They might want to test with "bq" CLI inside the master node to manually try some "bq extract" job of the projectB table into their GCS bucket since that's all the connector is doing under the hood.
If I had to guess I'd suspect they only meant their personal username has the credentials to query projectB, but the default service account of projectA might not have the query permissions. Everything inside Dataproc VMs act on behalf of the compute service account assigned to the VMs, not the end-user.
They can
gcloud compute instances describe -m
and somewhere in there it lists the service-account email address."
I have created an SQLDB service instance and bound it to my application. I have created some tables and need to load data into them. If I write an INSERT statement into RUN DDL, I receive a SQL -104 error. How can I INSERT SQL into my SQLDB service instance.
If you're needing to run your SQL from an application then there are several examples (sample code included) of how to accomplish this at the site listed below:
http://www.ng.bluemix.net/docs/services/SQLDB/index.html#run-a-query-in-java
Additionally, you can execute SQL in the SQL Database Console by navigating to Manage -> Work with Database Objects. More information can be found here:
http://www.ng.bluemix.net/docs/services/SQLDB/index.html#sqldb_005
s.executeUpdate("CREATE TABLE MYLIBRARY.MYTABLE (NAME VARCHAR(20), ID INTEGER)");
s.executeUpdate("INSERT INTO MYLIBRARY.MYTABLE (NAME, ID) VALUES ('BlueMix', 123)");
Full Code
Most people do initial database population or migrations when they deploy their application. Often these database commands are programming language specific. The poster didn't include the programming language. You can accomplish this two ways.
Append a bash script that would call your database scripts that you uploaded. This project shows how you can call that bash script from within your manifest file as part of doing a CF Push.
Some languages like offer a file type or service that will automatically get used to populate the database on initial deploy or when your migrate/synch the db. For example Python Django offers a "fixtures" file that will automatically take a JSON file and populate your database tables