How to Delete a CosmosDB Vertex using Pyspark - pyspark

As we can read and write data into cosmosdb using pyspark as follws,
cfg = {
"spark.cosmos.accountEndpoint" : "xx:443/",
"spark.cosmos.accountKey" : "xx==",
"spark.cosmos.database" : "graphdb",
"spark.cosmos.container" : "graph" ,
"spark.cosmos.read.customQuery" : "select * FROM c where c.label =Email"
}
cosmosDbFormat = "cosmos.oltp"
df = spark.read.format("cosmos.oltp").options(**cfg).load()
Similarly is there a way tro delete the vertex using pyspark...
Please note the vertex wont be overwritten because its created using uuid.. So everytime a vertex which has a deletion in the original table, doesnt get removed..

As from my knowledge, it is not possible to delete a vertex in cosmos DB using pyspark connector. You can cosmos Spark connector to read from or write to cosmos DB. You can use the Azure data explorer by using the .drop().
Azure Cosmos DB Gremlin graph support and compatibility with TinkerPop features

Related

Cloning Power BI DirectQuery into a regular query

I have a Power Bi dataset that someone shared with me. I would like to import it into Power Bi Desktop and transform its data
I used DirectQuery to import the dataset and I managed to create a calculated table:
My_V_Products = CALCULATETABLE(V_Products)
However, when I try using TransformData, I do not see this table. I guess this is due to the fact that this is not actually a table created from a query but from a DAX.
Is there a way to import the entire table using a query or convert the data to transformable data?
Only if the Dataset is on a premium capacity. If it is you can connect to the XMLA endpoint for the workspace using the Analysis Services connector, and create an Import table using a custom DAX query, like evaluate V_Products.

Azure Data Lake and export SQL query with pyspark

Looking to use databases I have stored in Azure Lake. I can run the SQL query with the notebook, for example (with PySpark set as the Language)
%%sql
*
from db1.table1
All I try to do now is to add another notebook / line of code to export the results of the above SELECT statement as the data frame (and CSV subsequently).
df = spark.sql("""SELECT * FROM db1.table1""")
df.coalesce(1).write.csv("path/df1.csv")
I would suggest to create global temp view. As a result, you can use this view in any notebook you want as long as your cluster is not terminated.
Having said that, you could create global temp view as below -
df.createOrReplaceGlobalTempView("temp_view")
Please follow the below official documentation from Databricks -
https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-syntax-ddl-create-view.html

How Aws glue scala save DyanamicFrame to athena table fastly

I have to save AWS Glue data to Athena with minimum time.
I have saved AWS Glue data i.e. DynamicFrame to Athena table successfully. But for 17Gb table it takes around 19-20 Minutes. If number of DPU uses is 100, then I think it's too long time. Currently I am using method:
getCatalogSink( database : String,tableName : String,redshiftTmpDir : String = "",transformationContext : String = "") : DataSink
Is there any way to speed up the process? Or current required time is fine ?.
Thanks in advances.

spark Athena connector

I need to use Athena in spark but spark uses preparedStatement when using JDBC drivers and it gives me an exception
"com.amazonaws.athena.jdbc.NotImplementedException: Method Connection.prepareStatement is not yet implemented"
Can you please let me know how can I connect Athena in spark
I don't know how you'd connect to Athena from Spark, but you don't need to - you can very easily query the data that Athena contains (or, more correctly, "registers") from Spark.
There are two parts to Athena
Hive Metastore (now called the Glue Data Catalog) which contains mappings between database and table names and all underlying files
Presto query engine which translates your SQL into data operations against those files
When you start an EMR cluster (v5.8.0 and later) you can instruct it to connect to your Glue Data Catalog. This is a checkbox in the 'create cluster' dialog. When you check this option your Spark SqlContext will connect to the Glue Data Catalog, and you'll be able to see the tables in Athena.
You can then query these tables as normal.
See https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html for more.
You can use this JDBC driver: SimbaAthenaJDBC
<dependency>
<groupId>com.syncron.amazonaws</groupId>
<artifactId>simba-athena-jdbc-driver</artifactId>
<version>2.0.2</version>
</dependency>
to use:
SparkSession spark = SparkSession
.builder()
.appName("My Spark Example")
.getOrCreate();
Class.forName("com.simba.athena.jdbc.Driver");
Properties connectionProperties = new Properties();
connectionProperties.put("User", "AWSAccessKey");
connectionProperties.put("Password", "AWSSecretAccessKey");
connectionProperties.put("S3OutputLocation", "s3://my-bucket/tmp/");
connectionProperties.put("AwsCredentialsProviderClass",
"com.simba.athena.amazonaws.auth.PropertiesFileCredentialsProvider");
connectionProperties.put("AwsCredentialsProviderArguments", "/my-folder/.athenaCredentials");
connectionProperties.put("driver", "com.simba.athena.jdbc.Driver");
List<String> predicateList =
Stream
.of("id = 'foo' and date >= DATE'2018-01-01' and date < DATE'2019-01-01'")
.collect(Collectors.toList());
String[] predicates = new String[predicateList.size()];
predicates = predicateList.toArray(predicates);
Dataset<Row> data =
spark.read()
.jdbc("jdbc:awsathena://AwsRegion=us-east-1;",
"my_env.my_table", predicates, connectionProperties);
You can also use this driver in a Flink application:
TypeInformation[] fieldTypes = new TypeInformation[] {
BasicTypeInfo.STRING_TYPE_INFO,
BasicTypeInfo.STRING_TYPE_INFO
};
RowTypeInfo rowTypeInfo = new RowTypeInfo(fieldTypes);
JDBCInputFormat jdbcInputFormat = JDBCInputFormat.buildJDBCInputFormat()
.setDrivername("com.simba.athena.jdbc.Driver")
.setDBUrl("jdbc:awsathena://AwsRegion=us-east-1;UID=my_access_key;PWD=my_secret_key;S3OutputLocation=s3://my-bucket/tmp/;")
.setQuery("select id, val_col from my_env.my_table WHERE id = 'foo' and date >= DATE'2018-01-01' and date < DATE'2019-01-01'")
.setRowTypeInfo(rowTypeInfo)
.finish();
DataSet<Row> dbData = env.createInput(jdbcInputFormat, rowTypeInfo);
You can't directly connect Spark to Athena. Athena is simply an implementation of Prestodb targeting s3. Unlike Presto, Athena cannot target data on HDFS.
However, if you want to use Spark to query data in s3, then you are in luck with HUE, which will let you query data in s3 from Spark on Elastic Map Reduce (EMR).
See Also:
Developer Guide for Hadoop User Experience (HUE) on EMR.
The response of #Kirk Broadhurst is correct if you want to use the data of Athena.
If you want to use the Athena engine, then, there is a lib on github that overcomes the preparedStatement problem.
Note that I didn't succeed to use the lib, due to my lack of experience with Maven etc
Actually you can use B2W's Spark Athena Driver.
https://github.com/B2W-BIT/athena-spark-driver

ADF - C # Custom Activity

I have a csv file as input which I have stored in Azure Blob Storage. I want to read data from csv file, perform some transformations on it and then store data in Azure SQL Database. I am trying to use a C# custom activity in Azure Data Factory having blob as input and sql table as output dataset. I am following this tutorial (https://azure.microsoft.com/en-us/documentation/articles/data-factory-use-custom-activities/#see-also) but it has both input and output as blobs. Can I get some sample code for sql database as output as I am unable to figure out how to do it. Thanks
You just need to fetch connection string of your Azure SQL database from a linked service and then you can talk to database. Try this sample code:
AzureSqlDatabaseLinkedService sqlInputLinkedService;
AzureSqlTableDataset sqlInputLocation;
Dataset sqlInputDataset = datasets.Where(dataset => dataset.Name == "<Dataset Name>").First();
sqlInputLocation = sqlInputDataset.Properties.TypeProperties as AzureSqlTableDataset;
sqlInputLinkedService = linkedServices.Where (
linkedService =>
linkedService.Name ==
sqlInputDataset.Properties.LinkedServiceName).First().Properties.TypeProperties
as AzureSqlDatabaseLinkedService;
SqlConnection connection = new SqlConnection(sqlInputLinkedService.ConnectionString);
connection.Open ();