Combine and save two PySpark DataFrames to s3 using specific format - pyspark

I have a very specific problem in AWS Glue. I have two Glue DataFrames that looks something like this:
df1 :
name
----
John
Sam
Sarah
Chris
Bob
and df2:
codes
-----
12x
45f
66g
I now want to save this to s3 (using trusty glue_context.write_dynamic_frame.from_options(...)) as a single json file in the following format:
{
"names": ["John", "Sam", "Sarah", "Chris", "Bob"],
"properties": {
"values": ["12x", "45f", "70g"]
}
}
I suspect that I need to combine and convert the DataFrames using a series of transformations before saving it, but I am stuck. I tried using flatMap() in conjuction with collect(), but to no avail. Is this possible?

Related

How to create seperate CSV for nested Json array in Az Data Factory

I have a json as below. I can flatten this in ADF using DataFlow but subject details array can hold lot of values, for which I would like to create seperate CSV.
{
"studentid": 99999,
"schoolid": "100574521",
"name": "BLUE LAY",
"set_id": 53,
"subject_details": [
{
"subject_code": "url_key",
"value": "100574521"
},
{
"subject_code": "band",
"value": "29732"
},
{
"subject_code": "description",
"value": "Summer "
},
{
"subject_code": "options_container",
"value": "container2"
},
{
"subject_code": "has_options",
"value": "0"
},
{
"subject_code": "category_ids",
"value": [
"463",
"630"
]
}
]
}
You can create separate CSV file from the JSON nested array using the select in Data flows.
After creating the source for the JSON in Data flow create a select to the source from the options. After that, in the select settings you will see all the JSON columns and to get the nested array of JSON in the csv delete all columns apart from subject_details by clicking on the delete icon on every column.
Now, add a sink to select to save this in a csv file of blob.
Create respective linked service and dataset for the csv and add it to the sink.
Then, in the settings of sink and select output to single file from the drop down of File name option and give the csv file name in the Output to single file.
Execute this dataflow by the pipeline and you can see the CSV in blob.
You will get this kind of data in the csv if you do it before the flatten. So, the best practice is to create the csv file for the nested json array after flattening it.
Please refer this Microsoft Documentation to know more about the Flatten in Data flow.
After flattening the JSON, you can use the same select option to create the csv from it by deleting all columns apart from these two like above. If you want, you can export the data as csv in the Data preview of select itself.
Then add sink and execute the data flow and you can see the csv data like below in the blob.
In your JSON nested array, the value had different data types like string in first 5 rows and array in the last one. It will be difficult to flatten that, so try to maintain the same data type for the value like array for all rows if you face any issues in flatten.

Same Spark Dataframe created in 2 different ways gets different execution times in same query

I created the same Spark Dataframe in 2 ways in order to run Spark SQL on it.
1. I read the data from a .csv file straight into a Dataframe in Spark shell using the following command:
val df=spark.read.option("header",true).csv("C:\\Users\\Tony\\Desktop\\test.csv")
2. I created a collection in MongoDB from the same .csv file and then using the Spark-MongoDB Connector, I imported it as an RDD into Spark which I then turned into a Dataframe using the following commands(in cmd/spark-shell):
spark-shell --conf "spark.mongodb.input.uri=mongodb://127.0.0.1/myDb.myBigCollection" --packages org.mongodb.spark:mongo-spark-connector_2.12:3.0.1
import com.mongodb.spark._
val rdd = MongoSpark.load(sc)
val df = rdd.toDF()
After that I created a view of the dataframe in either case using the following command:
df.createOrReplaceTempView("sales")
Then I run the same queries on either Dataframe and the execution times were very different. In the following example, the 1st way of creating the dataframe had 4-5 times faster execution time then the 2nd one.
spark.time(spark.sql("SELECT Region, Country, `Unit Price`, `Unit Cost` FROM sales WHERE `Unit Price` > 500 AND `Unit Cost` < 510 ORDER BY Region").show())
The database has 1 million entries and has the following structure:
id: 61a6540c3838fe02b81e5338
Region: "Sub-Saharan Africa"
Country: "South Africa"
Item Type: "Fruits"
Sales Channel: "Offline"
Order Priority: "M"
Order Date: 2012-07-26T21:00:00.000+00:00
Order ID: 443368995
Ship Date: 2012-07-27T21:00:00.000+00:00
Units Sold: 1593
Unit Price: 9.33
Unit Cost: 6.92
Total Revenue: 14862.69
Total Cost: 11023.56
Total Profit: 3839.13
The problem in my case is that I have to get the Dataframe from Mongodb using the connector but why is this happening?
Spark is optimized to perform better on Dataframes. In your second approach you are first reading RDD then converting it to Dataframe which definitely has the cost.
Instead try to read data from Mongo DB directly as a dataframe. You can refer to the following syntax:
df = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource").option("uri", "mongodb://127.0.0.1/mydb.mycoll").load()
The answer is that in the second case, the extra time is needed in order to transfer the data from mongodb to Spark before executing the query.

TmongoDbInput schema not retrieving JsonArray Properly

I'm trying to retrieve below mongo collection data using TalendOpenStudio for Big Data 7.2.
T
{
"id": "5b69b66d3dae73000fa39440",
"data":"Testing"
"products": [{ "orderid":"1234"
}],
"createDate": "2018-08-07T15:10:37.570Z",
"updateDate": "2018-08-09T16:09:46.621Z"
}
I'm able to get Id,data but when i try to get products i'm unable to get the data. On top of it i'm getting products as below
[Document{{"orderid"="1234"}}] this is blocking me for parsing it as json. Can someone help. I think its a basic mistake but as i said i'm new to Talend OS for Big Data
If anyone has already parsed this , can you please share the schema to be defined for products arraylist in talend and how they parsed id,data,products.
I tried using extractjson fields, using mongoschema from repository still no luck.
Looks like its a mongo driver(3.5.x) issue. Here are more details.
https://community.talend.com/t5/Design-and-Development/TmongoDbInput-schema-not-retreiving-properly/m-p/202163#M110456

How to use spark and mongo with play to calculate prediction?

I am using play, scala and mongodb (salat).
I have following database structure-
[{
"id":mongoId,
"name":"abc",
"utilization":20,
"timestamp":1416668402352
},
{
"id":mongoId,
"name":"abc",
"utilization":30,
"timestamp":1415684102290
},
{
"id":mongoId,
"name":"abc",
"utilization":90,
"timestamp":1415684402210
},
{
"id":mongoId,
"name":"abc",
"utilization":40,
"timestamp":1415684702188
},
{
"id":mongoId,
"name":"abc",
"utilization":35,
"timestamp":1415684702780
}]
By using above data, I want to calculate utilization for current timestamp (By applying statistical algorithm).
To calculate it I am using spark. I have added following dependencies to build.sbt of play framework.
I have following questions.
1) How to calculate current utilization?? (using MLlib of spark)
2) Is it possible to query mongo collection to get some of fields using spark??
There is a project named Deep-Spark that takes care about integrating spark with mongodb (and others datastores like cassandra, aerospike, etc).
https://github.com/Stratio/deep-spark
You can check how to use it here:
https://github.com/Stratio/deep-spark/blob/master/deep-examples/src/main/java/com/stratio/deep/examples/java/ReadingCellFromMongoDB.java
It is a very simple way to start working with mongodb and spark.
Sorry I can not help you with MLlib, but sure somebody will add something useful.
Disclaimer: I am currently working on Stratio.

Use Mongoexport to export a collection in multiple files

I'm trying to export all datas from one of my collection but the collection exceed 16mo.
So when I try to re import it, Mongo fails since the limit of import is 16mo.
Is there a way to ask the export in multiple files? I don't find this information in the doc.
Thank you.
Depending on the data in your collection, one possible solution might be to use the --query <JSON>, -q <JSON> flag to create several files. (Documentation here.) For example, if your collection stores college student documents, e.g.:
{ _id: ObjectId("5237258211f41a0c647c47b1"),
name: "Jane Doe",
age: 19,
grade: "sophomore" },
{ _id: ObjectId("5237258211f41a0c647c47b2"),
name: "John Smith",
age: 20,
grade: "junior" },
...
You might, for example, decide to query on grade, running mongoexport four times to create four files (freshman, sophomore, junior, senior). If each file was under 16mb, this would solve your problem.
If this doesn't answer your question, please provide the commands you're using to import and export. :)