I am playing around with Apache Spark with the Azure CosmosDB connectors in Scala and was wondering if anyone had examples or insight on how I would write my DataFrame back to a collection in my CosmosDB. Currently I am able to connect to my one collection and return the data and manipulate it but I want to write the results back to a different collection inside the same database.
I created a writeConfig that contains my EndPoint, MasterKey, Database, and the Collection that I want to write to.
I then tried writing it to the collection using the following line.
This runs fine and does not display any errors but nothing is showing up in my collection.
I went through the documentation I could find at https://github.com/Azure/azure-cosmosdb-spark but did not have much luck with finding any examples of writing data back to the database.
If there is an easier way to write to a documentDB/cosmosDB than what I am doing? I am open to any options.
Thanks for any help.
You can save to Cosmos DB directly from a Spark DataFrame just like you had noted. You may not need to use toJSON, for example:
// Import SaveMode so you can Overwrite, Append, ErrorIfExists, Ignore
import org.apache.spark.sql.{Row, SaveMode, SparkSession}
// Create new DataFrame `df` which has slightly flights information
// i.e. change the delay value to -999
val df = spark.sql("select -999 as delay, distance, origin, date, destination from c limit 5")
// Save to Cosmos DB (using Append in this case)
// Ensure the baseConfig contains a Read-Write Key
// The key provided in our examples is a Read-Only Key
As for the documentation, you are correct in that the save function should be have been better called out. I've created Include in User Guide / sample scripts how to save to Cosmos DB #91 to address this.
As for the saving but seeing no error, by any chance is your config using the Read-Only key instead of the Read-write key? I just created Saving to CosmosDB using read-only key has no error #92 calling out the same issue.
I am creating a data warehouse using Azure Data Factory to extract data from a MySQL table and saving it in parquet format in an ADLS Gen 2 filesystem. From there, I use Synapse notebooks to process and load data into destination tables.
The initial load is fairly easy using spark.write.saveAsTable('orders') however, I am running into some issues doing incremental load following the intial load. In particular, I have not been able to find a way to reliably insert/update information into an existing Synapse table.
Since Spark does not allow DML operations on a table, I have resorted to reading the current table into a Spark DataFrame and inserting/updating records in that DataFrame. However, when I try to save that DataFrame using spark.write.saveAsTable('orders', mode='overwrite', format='parquet'), I run into a Cannot overwrite table 'orders' that is also being read from error.
A solution indicated by this suggests creating a temporary table and then inserting using that but that still resorts in the above error.
Another solution in this post suggests to write the data into a temporary table, drop the target table, and then rename the table but upon doing this, Spark gives me a FileNotFound errors regarding metadata.
I know Delta Tables can fix this issue pretty reliably but our company is not yet ready to move over to DataBricks.
All suggestions are greatly appreciated.
My company is implementing Azure Data Explorer (ADX) as a backend. They also want to use Databricks for Data Science projects including data exploration. I'm in charge of connecting Databricks to ADX.
I first tried Azure Kusto package.
from azure.kusto.data import KustoClient, KustoConnectionStringBuilder
from azure.kusto.data.exceptions import KustoServiceError
from azure.kusto.data.helpers import dataframe_from_result_table
import pandas as pd
df = dataframe_from_result_table(RESPONSE.primary_results[0])
Full steps here
Functionlly this works well.
But it loses completely the lazy loading feature of both ADX and Databricks-Spark.
I thought that because df is a just a Pandas dataframe, also if I try to convert this to a hive table it persists the data, which is not required, as we need fresh online data we don't want a local copy.
The next thing I've tried was to have this loaded in a spark dataframe. I've tried the following code (after installing the relevant libraries)
df = spark.read. \
format("com.microsoft.kusto.spark.datasource"). \
option("kustoCluster", KUSTO_URI). \
option("kustoDatabase",KUSTO_DATABASE). \
option("kustoQuery", "some_table_in_adx"). \
option("kustoAadAppId",CLIENT_ID). \
option("kustoAadAppSecret",CLIENT_SECRET). \
option("kustoAadAuthorityID", AAD_TENANT_ID).load()
which again loads the data into the spark dataframe without any issue.
However, performance wise it's far far away from the direct query in ADX. A count in a table of 600 thousands records is subseconds in ADX while it's more than 20 seconds in the Databricks notebook on a DS3_V2 14GB 4 cores.
Before even to consider a SaveAsTable or CreateOrReplaceTempView I wonder why I'm experiencing this performance issue. So my questions are :
Does this connection use the lazy loading (I know doing a count is not the right way to check that)?
If not is there any way to have lazy loading instea of loading the full table in dataframes before doing operations
what would happen if I create a hive table from the spark dataframe, will it copy the data or still have a virtual table pointing to ADX
Thanks for your help
For the Python part -
Pandas is not Spark dataframe therefore it's not lazy computed, to utilize these together you may use Spark parallelize.
For the Spark ADX connector -
This is indeed lazy loading. It is not evaluated until some evaluation method was requested - like the count.
If the count was done by Spark syntax i.e spark.read.kusto...count() then it would cause all the data to be first brought into spark and then operate count on it - so 20 seconds sounds legit, to compare with a query simply change the value of "kustoQuery" to "some_table_in_adx | count" which will lead to a count on the ADX side - uploading to spark just the final int result.
The connector offers simple Kusto query or distributed export command via the readMode option with ForceSingleMode to perform a simple Kusto query from driver node - as explained in the docs
As Hive table operated over dataframes - once you create a dataframe from Kusto read - it will also be lazy evaluated with table operations.
I'm trying to copy a SQL table from one database to another database in another server using Databricks. I have heard that one method of doing this is by using pyodbc because I need to read the data from a stored procedure; JDBC does not support reading from stored procedures. I want to use code similar to the one below:
import pyodbc
conn = pyodbc.connect( 'DRIVER={ODBC Driver 17 for SQL Server};'
# Example getting records back from stored procedure (could also be a SELECT statement)
cursor = conn.cursor()
execsp = "EXEC GetConfig 'Dev'"
conn.autocommit = True
# Get all records
rc = cursor.fetchall()
The question is, once I get the data into the rc variable using pyodbc, should I bother moving the data into a Databricks Dataframe, or should I just push the data out to my destination?
You may not need to convert data into the Dataframe, and simply write the data into the destination. But it's really depends on the amount of data that you're trying to push - if it's a lot, then creating the Dataframe may help because it will parallelize writing (but that may overload the server). If it's not so much data, just write to destination.
Also, in this case, your worker nodes aren't really used, because all processing will happen on the Driver node - you may consider to use so-called Single Node Cluster, but you will need to size the driver node accordingly to your resultset.
P.S. You can also look to the alternatives listed in this answer.
We're hoping to use the AWS Glue Data Catalog to create a single table for JSON data residing in an S3 bucket, which we would then query and parse via Redshift Spectrum.
The JSON data is from DynamoDB Streams and is deeply nested. The first level of JSON has a consistent set of elements: Keys, NewImage, OldImage, SequenceNumber, ApproximateCreationDateTime, SizeBytes, and EventName. The only variation is that some records do not have a NewImage and some don't have an OldImage. Below this first level, though, the schema varies widely.
Ideally, we would like to use Glue to only parse this first level of JSON, and basically treat the lower levels as large STRING objects (which we would then parse as needed with Redshift Spectrum). Currently, we're loading the entire record into a single VARCHAR column in Redshift, but the records are nearing the maximum size for a data type in Redshift (maximum VARCHAR length is 65535). As a result, we'd like to perform this first level of parsing before the records hit Redshift.
What we've tried/referenced so far:
Pointing the AWS Glue Crawler to the S3 bucket results in hundreds of tables with a consistent top level schema (the attributes listed above), but varying schemas at deeper levels in the STRUCT elements. We have not found a way to create a Glue ETL Job that would read from all of these tables and load it into a single table.
Creating a table manually has not been fruitful. We tried setting each column to a STRING data type, but the job did not succeed in loading data (presumably since this would involve some conversion from STRUCTs to STRINGs). When setting columns to STRUCT, it requires a defined schema - but this is precisely what varies from one record to another, so we are not able to provide a generic STRUCT schema that works for all the records in question.
The AWS Glue Relationalize transform is intriguing, but not what we're looking for in this scenario (since we want to keep some of the JSON intact, rather than flattening it entirely). Redshift Spectrum supports scalar JSON data as of a couple weeks ago, but this does not work with the nested JSON we're dealing with. Neither of these appear to help with handling the hundreds of tables created by the Glue Crawler.
How would we use Glue (or some other method) to allow us to parse just the first level of these records - while ignoring the varying schemas below the elements at the top level - so that we can access it from Spectrum or load it physically into Redshift?
I'm new to Glue. I've spent quite a bit of time in the Glue documentation and looking through (the somewhat sparse) info on forums. I could be missing something obvious - or perhaps this is a limitation of Glue in its current form. Any recommendations are welcome.
I'm not sure you can do this with a table definition, but you can accomplish this with an ETL job by using a mapping function to cast the top level values as JSON strings. Documentation: [link]
import json
# Your mapping function
def flatten(rec):
for key in rec:
rec[key] = json.dumps(rec[key])
return rec
old_df = glueContext.create_dynamic_frame.from_options(
{"paths": ['s3://...']},
# Apply mapping function f to all DynamicRecords in DynamicFrame
new_df = Map.apply(frame=old_df, f=flatten)
From here you have the option of exporting to S3 (perhaps in Parquet or some other columnar format to optimize for querying) or directly into Redshift from my understanding, although I haven't tried it.
This is a limitation of Glue as of now. Have you taken a look at Glue Classifiers? It's the only piece I haven't used yet, but might suit your needs. You can define a JSON path for a field or something like that.
Other than that - Glue Jobs are the way to go. It's Spark in the background, so you can do pretty much everything. Set up a development endpoint and play around with it. I've run against various roadblocks for the last three weeks and decided to completely forgo any and all Glue functionality and only Spark, that way it's both portable and actually works.
One thing you might need to keep in mind when setting up the dev endpoint is that the IAM role must have a path of "/", so you will most probably need to create a separate role manually that has this path. The one automatically created has a path of "/service-role/".
you should add a glue classifier preferably $[*]
When you crawl the json file in s3, it will read the first line of the file.
You can create a glue job in order to load the data catalog table of this json file into the redshift.
My only problem with here is that Redshift Spectrum has problems reading json tables in the data catalog..
let me know if you have found a solution
The procedure I found useful to shallow nested json:
ApplyMapping for the first level as datasource0;
Explode struct or array objects to get rid of element level
df1 = datasource0.toDF().select(id,col1,col2,...,explode(coln).alias(coln), where explode requires from pyspark.sql.functions import explode;
Select the JSON objects that you would like to keep intact by intact_json = df1.select(id, itct1, itct2,..., itctm);
Transform df1 back to dynamicFrame and Relationalize the
dynamicFrame as well as drop the intact columns by dataframe.drop_fields(itct1, itct2,..., itctm);
Join relationalized table with the intact table based on 'id'
As of 12/20/2018, I was able to manually define a table with first level json fields as columns with type STRING. Then in the glue script the dynamicframe has the column as a string. From there, you can do an Unbox operation of type json on the fields. This will json parse the fields and derive the real schema. Combining Unbox with Filter allows you to loop through and process heterogeneous json schemas from the same input if you can loop through a list of schemas.
However, one word of caution, this is incredibly slow. I think that glue is downloading the source files from s3 during each iteration of the loop. I've been trying to find a way to persist the initial source data but it looks like .toDF derives the schema of the string json fields even if you specify them as glue StringType. I'll add a comment here if I can figure out a solution with better performance.
We've got a pretty big MongoDB instance with sharded collections. It's reached a point where it's becoming too expensive to rely on MongoDB query capabilities (including aggregation framework) for insight to the data.
I've looked around for options to make the data available and easier to consume, and have settled on two promising options:
AWS Redshift
Hadoop + Hive
We want to be able to use a SQL like syntax to analyze our data, and we want close to real time access to the data (a few minutes latency is fine, we just don't want to wait for the whole MongoDB to sync overnight).
As far as I can gather, for option 2, one can use this https://github.com/mongodb/mongo-hadoop to move data over from MongoDB to a Hadoop cluster.
I've looked high and low, but I'm struggling to find a similar solution for getting MongoDB into AWS Redshift. From looking at Amazon articles, it seems like the correct way to go about it is to use AWS Kinesis to get the data into Redshift. That said, I can't find any example of someone that did something similar, and I can't find any libraries or connectors to move data from MongoDB into a Kinesis stream. At least nothing that looks promising.
Has anyone done something like this?
I ended up coding up our own migrator using NodeJS.
I got a bit irritated with answers explaining what redshift and MongoDB is, so I decided I'll take the time to share what I had to do in the end.
Timestamped data
Basically we ensure that all our MongoDB collections that we want to be migrated to tables in redshift are timestamped, and indexed according to that timestamp.
Plugins returning cursors
We then code up a plugin for each migration that we want to do from a mongo collection to a redshift table. Each plugin returns a cursor, which takes the last migrated date into account (passed to it from the migrator engine), and only returns the data that has changed since the last successful migration for that plugin.
How the cursors are used
The migrator engine then uses this cursor, and loops through each record.
It calls back to the plugin for each record, to transform the document into an array, which the migrator then uses to create a delimited line which it streams to a file on disk. We use tabs to delimit this file, as our data contained a lot of commas and pipes.
Delimited exports from S3 into a table on redshift
The migrator then uploads the delimited file onto S3, and runs the redshift copy command to load the file from S3 into a temp table, using the plugin configuration to get the name and a convention to denote it as a temporary table.
So for example, if I had a plugin configured with a table name of employees, it would create a temp table with the name of temp_employees.
Now we've got data in this temp table. And the records in this temp table get their ids from the originating MongoDB collection. This allows us to then run a delete against the target table, in our example, the employees table, where the id is present in the temp table. If any of the tables don't exist, it gets created on the fly, based on a schema provided by the plugin. And so we get to insert all the records from the temp table into the target table. This caters for both new records and updated records. We only do soft deletes on our data, so it'll be updated with an is_deleted flag in redshift.
Once this whole process is done, the migrator engine stores a timestamp for the plugin in a redshift table, in order to keep track of when the migration last run successfully for it. This value is then passed to the plugin the next time the engine decides it should migrate data, allowing the plugin to use the timestamp in the cursor it needs to provide to the engine.
So in summary, each plugin/migration provides the following to the engine:
A cursor, which optionally uses the last migrated date passed to it
from the engine, in order to ensure that only deltas are moved
A transform function, which the engine uses to turn each document in the cursor into a delimited string, which gets appended to an export file
A schema file, this is a SQL file containing the schema for the table at redshift
Redshift is a data ware housing product and Mongo DB is a NoSQL DB. Clearly, they are not a replacement of each other and can co-exist and serve different purpose. Now how to save and update records at both places.
You can move all Mongo DB data to Redshift as a one time activity.
Redshift is not a good fit for real time write. For Near Real Time Sync to Redshift, you should Modify program that writes into Mongo DB.
Let that program also writes into S3 locations. S3 location to redshift movement can be done on regular interval.
Mongo DB being a document storage engine, Apache Solr, Elastic Search can be considered as possible replacements. But they do not support SQL type querying capabilities.They basically use a different filtering mechanism. For eg, for Solr, you might need to use the Dismax Filter.
On Cloud, Amazon's Cloud Search/Azure Search would be compelling options to try as well.
You can use AWS DMS to migrate data to redshift now easily , you can also realtime ongoing changes with it.