Pyspark with AWS Glue join 1-N relation into a JSON array

Pyspark with AWS Glue join 1-N relation into a JSON array - pyspark

Don't know how can I join 1-N relations on AWS Glue and export a JSON file like:
{"id": 123, "name": "John Doe", "profiles": [ {"id": 1111, "channel": "twitter"}, {"id": 2222, "channel": "twitter"}, {"id": 3333, "channel": "instagram"} ]}
{"id": 345, "name": "Test", "profiles": []}
The profiles JSON array should be created using the other tables. Also I would like to add the channel column too.
The 3 tables that I have on AWS Glue data catalog are:
person_json
{"id": 123,"nanme": "John Doe"}
{"id": 345,"nanme": "Test"}
instagram_json
{"id": 3333, "person_id": 123}
{"id": 3333, "person_id": null}
twitter_json
{"id": 1111, "person_id": 123}
{"id": 2222, "person_id": 123}
This is the script I have so far:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from pyspark.sql.functions import lit
from awsglue.context import GlueContext
from awsglue.job import Job
glueContext = GlueContext(SparkContext.getOrCreate())
# catalog: database and table names
db_name = "test_database"
tbl_person = "person_json"
tbl_instagram = "instagram_json"
tbl_twitter = "twitter_json"
# Create dynamic frames from the source tables
person = glueContext.create_dynamic_frame.from_catalog(database=db_name, table_name=tbl_person)
instagram = glueContext.create_dynamic_frame.from_catalog(database=db_name, table_name=tbl_instagram)
twitter = glueContext.create_dynamic_frame.from_catalog(database=db_name, table_name=tbl_twitter)
# Join the frames
joined_instagram = Join.apply(person, instagram, 'id', 'person_id').drop_fields(['person_id'])
joined_all = Join.apply(joined_instagram, twitter, 'id', 'person_id').drop_fields(['person_id'])
# Writing output to S3
output_s3_path = "s3://xxx/xxx/person.json"
output = joined_all.toDF().repartition(1)
output.write.mode("overwrite").json(output_s3_path)
How should the script be changed in order to achieve the desired output?
Thanks

from pyspark.sql.functions import collect_set, lit, struct
...
instagram = instagram.toDF().withColumn( 'channel', lit('instagram') )
instagram = instagram.withColumn( 'profile', struct('id', 'channel') )
twitter = twitter.toDF().withColumn( 'channel', lit('twitter') )
twitter = twitter.withColumn( 'profile', struct('id', 'channel') )
profiles = instagram.union(twitter)
profiles = profiles.groupBy('person_id').agg( collect_set('profile').alias('profiles') )
joined_all = person.join(profiles, person.id == profiles.person_id, 'left_outer').drop('channel', 'person_id')
joined_all.show(n=2, truncate=False)
+---+--------+-----------------------------------------------------+
|id |name |profiles |
+---+--------+-----------------------------------------------------+
|123|John Doe|[[1111, twitter], [2222, twitter], [3333, instagram]]|
|345|Test |null |
+---+--------+-----------------------------------------------------+
.show() doesn't show the full structure of the structs in the profiles field.
print(joined_all.collect())
[Row(id=123, name='John Doe', profiles=[Row(id=1111, channel='twitter'), Row(id=2222, channel='twitter'), Row(id=3333, channel='instagram')]), Row(id=345, name='Test', profiles=None)]

Related

Partition in dataframe pyspark

I have a dataframe:
data = [{"ID": 'asriыjewfsldflsar2','val':5},
{"ID": 'dsgвarwetreg','val':89},
{"ID": 'ewrсt43gdfb','val':36},
{"ID": 'q23м4534tgsdg','val':58},
{"ID": '34tя5erbdsfgv','val':52},
{"ID": '43t2ghnaef','val':123},
{"ID": '436tываhgfbds','val':457},
{"ID": '435t5вч3htrhnbszfdhb','val':54},
{"ID": '35yteвrhfdhbxfdbn','val':1},
{"ID": '345ghаывserh','val':1},
{"ID": 'asrijываewfsldflsar2','val':5},
{"ID": 'dsgarываwetreg','val':89},
{"ID": 'ewrt433gdfb','val':36},
{"ID": 'q2345выа34tgsdg','val':58},
{"ID": '34t5eоrbdsfgv','val':52},
{"ID": '43tghолnaef','val':123},
{"ID": '436thапgfbds','val':457},
{"ID": '435t5укн3htrhnbszfdhb','val':54},
{"ID": '35ytк3erhfdhbxfdbn','val':1},
{"ID": '345g244hserh','val':1}
]
df = spark.createDataFrame(data)
I want to split the rows into 4 groups, I used to be able to do this with the row_number():
.withColumn('part', F.row_number().over(Window.orderBy(F.lit(1))) % n)
But unfortunately this method does not suit me, because I have a large dataframe that will not fit into memory. I tried to use the hash function, but I think I'm doing it wrong
df2 = df.withColumn('hashed_name', (F.hash('ID') % N))\
.withColumn('divide',F.floor(F.col('hashed_name')/13))\
.sort('divide')
Maybe there is another way to split rows besides than row_number?

you can use partitionBy() when trying to save the dataframe in delta format.
df.write.partitionBy("ColumnName").format("delta").save("path_to_save_the_dataframe",header=True,mode="overwrite")
hope this helps!

Hi you can use coalesce() to force exact number of partitions and latter you can use partition number for future queries.
df1=df.coalesce(4)
df1.createOrReplaceTempView('df')
espsql="select x.*,spark_partition_id() as part from df x"
df_new=spark.sql(espsql)
newsql="select distinct part from df_new"
spark.sql(newsql).take(5)

postgres 11.6 - Creating array of JSON Objects from JSON array

I have the following schema here: http://sqlfiddle.com/#!17/5c73a/1
I want to create a query where the results will be something like this:
id | tags
_________________________________
1. | [{"id": "id", "title": "first"}, {"id": "id", "title": "second"},{"id": "id", "title": "third"}]
2 | [{"id": "id", "title": "fourth"}, {"id": "id", "title": "fifth"},{"id": "id", "title": "sixth"}]
The idea is to build an array with an object for each line of the array, the important is the title variable

You need to unnest the array and then aggregate it back:
select t.id, jsonb_agg(jsonb_build_object('id', 'id', 'title', tg.title))
from things t
cross join jsonb_array_elements(tags) as tg(title)
group by t.id;
Online example

Convert nested json to dataframe in scala spark

I want to create the dataframe out of json for only given key. It values is a list and that is nested json type. I tried for flattening but I think there could be some workaround as I only need one key of json to convert into dataframe.
I have json like:
("""
{
"Id_columns": 2,
"metadata": [{
"id": "1234",
"type": "file",
"length": 395
}, {
"id": "1235",
"type": "file2",
"length": 396
}]
}""")
Now I want to create a DataFrame using spark for only key 'metadata', I have written code:
val json = Json.parse("""
{
"Id_columns": 2,
"metadata": [{
"id": "1234",
"type": "file",
"length": 395
}, {
"id": "1235",
"type": "file2",
"length": 396
}]
}""")
var jsonlist = Json.stringify(json("metadata"))
val rddData = spark.sparkContext.parallelize(jsonlist)
resultDF = spark.read.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ").json(rddData)
resultDF.show()
But it's giving me error:
overloaded method value json with alternatives:
cannot be applied to (org.apache.spark.rdd.RDD[Char])
[error] val resultDF = spark.read.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ").json(rddData)
^
I am expecting result:
+----+-----+--------+
| id | type| length |
+----+-----+--------+
|1234|file1| 395 |
|1235|file2| 396 |
+----+-----+--------+

You need to explode your array like this :
import spark.implicits._
import org.apache.spark.sql.functions._
val df = spark.read.json(
spark.sparkContext.parallelize(Seq("""{"Id_columns":2,"metadata":[{"id":"1234","type":"file","length":395},{"id":"1235","type":"file2","length":396}]}"""))
)
df.select(explode($"metadata").as("metadata"))
.select("metadata.*")
.show(false)
Output :
+----+------+-----+
|id |length|type |
+----+------+-----+
|1234|395 |file |
|1235|396 |file2|
+----+------+-----+

Dataframe columns does not keep order and columns with null values are excluded while writing to CosmosDB Collection

I tried to copy data into cosmosDB collection from a dataframe in spark.
The data is writing into cosmosDB , but with two issues.
The order of column in dataframe is not maintaining in cosmosDB.
Columns with null values are not written in cosmosDB, they are totally excluded.
Below is the data available in dataframe:
+-------+------+--------+---------+---------+-------+
| NUM_ID| TIME| SIG1| SIG2| SIG3| SIG4|
+-------+------+--------+---------+---------+-------+
|X00030 | 13000|35.79893| 139.9061| 48.32786| null|
|X00095 | 75000| null| null| null|5860505|
|X00074 | 43000| null| 8.75037| 98.9562|8014505|
Below is the code written in spark to copy the dataframe into cosmosDB.
val finalSignals = spark.sql("""SELECT * FROM db.tableName""")
val toCosmosDF = finalSignals.withColumn("NUM_ID", trim(col("NUM_ID"))).withColumn("SIG1", round(col("SIG1"),5)).select("NUM_ID","TIME","SIG1","SIG2","SIG3","SIG4")
//write DF into COSMOSDB
import com.microsoft.azure.cosmosdb.spark.config.Config
import org.apache.spark.sql.SaveMode
import com.microsoft.azure.cosmosdb.spark.schema._
import com.microsoft.azure.cosmosdb.spark._
val writeConfig = Config(Map(
"Endpoint" -> "xxxxxxxx",
"Masterkey" -> "xxxxxxxxxxx",
"Database" -> "xxxxxxxxx",
"Collection" -> "xxxxxxxxx",
"preferredRegions" -> "xxxxxxxxx",
"Upsert" -> "true"
))
toCosmosDF.write.mode(SaveMode.Append).cosmosDB(writeConfig)
Below is the data written into cosmosDB.
"SIG3": 48.32786,
"SIG2": 139.9061,
"TIME": 13000,
"NUM_ID": "X00030",
"id": "xxxxxxxxxxxx2a",
"SIG1": 35.79893,
"_rid": "xxxxxxxxxxxx",
"_self": "xxxxxxxxxxxxxxxxxx",
"_etag": "\"xxxxxxxxxxxxxxxx\"",
"_attachments": "attachments/",
"_ts": 1571390120
}
{
"TIME": 75000,
"NUM_ID": "X00095",
"id": "xxxxxxxxxxxx2a",
"_rid": "xxxxxxxxxxxx",
"SIG4": 5860505,
"_self": "xxxxxxxxxxxxxxxxxx",
"_etag": "\"xxxxxxxxxxxxxxxx\"",
"_attachments": "attachments/",
"_ts": 1571390120
}
{
"SIG3": 98.9562,
"SIG2": 8.75037,
"TIME": 43000,
"NUM_ID": "X00074",
"id": "xxxxxxxxxxxx2a",
"SIG4": 8014505,
"_rid": "xxxxxxxxxxxx",
"_self": "xxxxxxxxxxxxxxxxxx",
"_etag": "\"xxxxxxxxxxxxxxxx\"",
"_attachments": "attachments/",
"_ts": 1571390120
}
The entry for columns with null in dataframe is missing in cosmosDB document.
The data written into cosmosDB is not having the column order which is there in dataframe.
How to resolve these two issues?

jsPDF not defined Ionic 3

I'm trying to use the jspdf-autotable module which I installed via
npm install jspdf jspdf-autotable
To use the module in my Ionic component I did the following:
declare let jsPDF;
I then proceed with some sample code taken from the jspdf-autotable repo in my component:
createReport() {
let columns = ["ID", "Name", "Age", "City"]
let data = [
[1, "Jonathan", 25, "Gothenburg"],
[2, "Simon", 23, "Gothenburg"],
[3, "Hanna", 21, "Stockholm"]
]
let doc = new jsPDF('p', 'pt');
doc.autoTable(columns, data);
doc.save("table.pdf");
}
upon calling createReport() I get the following error message however: ReferenceError: jsPDF is not defined
How can I correctly import jspdf-autotable? Any help would be highly appreciated

You need to import the plugin and declare jsPDF as a global variable in your component.
import * as jsPDF from 'jspdf'
declare var jsPDF: any;
#Component({
selector: 'app-root',
templateUrl: './app.component.html',
styleUrls: ['./app.component.css']
})

This ended up working for me:
import * as jsPDF from 'jspdf'
import 'jspdf-autotable'
For some reason only specifying rows and columns as follows would work:
var columns = [
{title: "ID", dataKey: "id"},
{title: "Name", dataKey: "name"},
{title: "Country", dataKey: "country"},
...
];
var rows = [
{"id": 1, "name": "Shaw", "country": "Tanzania", ...},
{"id": 2, "name": "Nelson", "country": "Kazakhstan", ...},
{"id": 3, "name": "Garcia", "country": "Madagascar", ...},
...
];

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Pyspark with AWS Glue join 1-N relation into a JSON array - pyspark

Related

Partition in dataframe pyspark

postgres 11.6 - Creating array of JSON Objects from JSON array

Convert nested json to dataframe in scala spark

Dataframe columns does not keep order and columns with null values are excluded while writing to CosmosDB Collection

jsPDF not defined Ionic 3

Categories

Resources