How to UnPivot COLUMNS into ROWS in AWS Glue / Py Spark script - pyspark

I have a large nested json document for each year (say 2018, 2017), which has aggregated data by each month (Jan-Dec) and each day (1-31).
{
"2018" : {
"Jan": {
"1": {
"u": 1,
"n": 2
}
"2": {
"u": 4,
"n": 7
}
},
"Feb": {
"1": {
"u": 3,
"n": 2
},
"4": {
"u": 4,
"n": 5
}
}
}
}
I have used AWS Glue Relationalize.apply function to convert above hierarchal data into flat structure:
dfc = Relationalize.apply(frame = datasource0, staging_path = my_temp_bucket, name = my_ref_relationalize_table, transformation_ctx = "dfc")
Which gives me table with columns of each json element as below:
| 2018.Jan.1.u | 2018.Jan.1.n | 2018.Jan.2.u | 2018.Jan.1.n | 2018.Feb.1.u | 2018.Feb.1.n | 2018.Feb.2.u | 2018.Feb.1.n |
| 1 | 2 | 4 | 7 | 3 | 2 | 4 | 5 |
As you can see, there will be lot of column in the table for each day and each month. And, I want to simplify the table by converting columns into rows to have below table.
| year | month | dd | u | n |
| 2018 | Jan | 1 | 1 | 2 |
| 2018 | Jan | 2 | 4 | 7 |
| 2018 | Feb | 1 | 3 | 2 |
| 2018 | Jan | 4 | 4 | 5 |
With my search, I could not get right answer. Is there a solution AWS Glue/PySpark or any other way to accomplish unpivot function to get row based table from column based table? Can it be done in Athena ?

Implemented solution similar to the below snippet
dataFrame = datasource0.toDF()
tableDataArray = [] ## to hold rows
rowArrayCount = 0
for row in dataFrame.rdd.toLocalIterator():
for colName in dataFrame.schema.names:
value = row[colName]
keyArray = colName.split('.')
rowDataArray = []
rowDataArray.insert(0,str(id))
rowDataArray.insert(1,str(keyArray[0]))
rowDataArray.insert(2,str(keyArray[1]))
rowDataArray.insert(3,str(keyArray[2]))
rowDataArray.insert(4,str(keyArray[3]))
tableDataArray.insert(rowArrayCount,rowDataArray)
rowArrayCount=+1
unpivotDF = None
for rowDataArray in tableDataArray:
newRowDF = sc.parallelize([Row(year=rowDataArray[0],month=rowDataArray[1],dd=rowDataArray[2],u=rowDataArray[3],n=rowDataArray[4])]).toDF()
if unpivotDF is None:
unpivotDF = newRowDF
else :
unpivotDF = unpivotDF.union(newRowDF)
datasource0 = datasource0.fromDF(unpivotDF, glueContext, "datasource0")
in above newRowDF can also be created as below if data type has to be enforced
columns = [StructField('year',StringType(), True),StructField('month', IntegerType(), ....]
schema = StructType(columns)
unpivotDF = sqlContext.createDataFrame(sc.emptyRDD(), schema)
for rowDataArray in tableDataArray:
newRowDF = spark.createDataFrame(rowDataArray, schema)

Here are the steps to successfully unpivot your Dataset Using AWS Glue with Pyspark
We need to add an additional import statement to the existing boiler plate import statements
from pyspark.sql.functions import expr
If our data is in a DynamicFrame, we need to convert it to a Spark DataFrame for example:
df_customer_sales = dyf_customer_sales.toDF()
Use the stack method to unpivot our dataset based on how many columns we want to unpivot
unpivotExpr = "stack(4, 'january', january, 'febuary', febuary, 'march', march, 'april', april) as (month, total_sales)"
unPivotDF = df_customer_sales.select('item_type', expr(unpivotExpr))
So using an example dataset, our dataframe looks like this now:
If my explanation is not clear, I made a youtube tutorial walkthrough of the solution: https://youtu.be/Nf78KMhNc3M

Related

How to iterate over a JSON object in a transaction for multiple inserts in database using pg-promise

Apologies if my title might not be clear, I'll explain the question further here.
What I would like to do is to have multiple inserts based on a JSON array that I (backend) will be receiving from the frontend. The JSON object has the following data:
//Sample JSON
{
// Some other data here to insert
...
"quests": {
[
{
"player_id": [1, 2, 3],
"task_id": [11, 12],
},
{
"player_id": [4, 5, 6],
"task_id": [13, 14, 15],
}
]
}
Based on this JSON, this is my expected output upon being inserted in Table quests and processed by the backend:
//quests table (Output)
----------------------------
id | player_id | task_id |
----------------------------
1 | 1 | 11 |
2 | 1 | 12 |
3 | 2 | 11 |
4 | 2 | 12 |
5 | 3 | 11 |
6 | 3 | 12 |
7 | 4 | 13 |
8 | 4 | 14 |
9 | 4 | 15 |
10| 5 | 13 |
11| 5 | 14 |
12| 5 | 15 |
13| 6 | 13 |
14| 6 | 14 |
15| 6 | 15 |
// Not sure if useful info, but I will be using the player_id as a join later on.
-- My current progress --
What I currently have (and tried) is to do multiple inserts by iterating each JSON object.
//The previous JSON response I accept:
{
"quests: {
[
{
"player_id": 1,
"task_id": 11
},
{
"player_id": 1,
"task_id": 12
},
{
"player_id": 6,
"task_id": 15
}
]
}
}
// My current backend code
db.tx(async t => {
const q1 // some queries
....
const q3 = await t.none(
`INSERT INTO quests (
player_id, task_id)
SELECT player_id, task_id FROM
json_to_recordset($1::json)
AS x(player_id int, tasl_id int)`,[
JSON.stringify(quests)
]);
return t.batch([q1, q2, q3]);
}).then(data => {
// Success
}).catch(error => {
// Fail
});
});
It works, but I think it's not good to have a long request body, which is why I'm wondering if it's possible to run iteration of the arrays inside the object.
If there are information needed, I'll edit again this post.
Thank you advance!

How to find similar rows by matching column values spark?

So i have a data set like
{"customer":"customer-1","attributes":{"att-a":"att-a-7","att-b":"att-b-3","att-c":"att-c-10","att-d":"att-d-10","att-e":"att-e-15","att-f":"att-f-11","att-g":"att-g-2","att-h":"att-h-7","att-i":"att-i-5","att-j":"att-j-14"}}
{"customer":"customer-2","attributes":{"att-a":"att-a-9","att-b":"att-b-7","att-c":"att-c-12","att-d":"att-d-4","att-e":"att-e-10","att-f":"att-f-4","att-g":"att-g-13","att-h":"att-h-4","att-i":"att-i-1","att-j":"att-j-13"}}
{"customer":"customer-3","attributes":{"att-a":"att-a-10","att-b":"att-b-6","att-c":"att-c-1","att-d":"att-d-1","att-e":"att-e-13","att-f":"att-f-12","att-g":"att-g-9","att-h":"att-h-6","att-i":"att-i-7","att-j":"att-j-4"}}
{"customer":"customer-4","attributes":{"att-a":"att-a-9","att-b":"att-b-14","att-c":"att-c-7","att-d":"att-d-4","att-e":"att-e-8","att-f":"att-f-7","att-g":"att-g-14","att-h":"att-h-9","att-i":"att-i-13","att-j":"att-j-3"}}
I have flattened the data in the DF like this
+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+-----------+
| att-a| att-b| att-c| att-d| att-e| att-f| att-g| att-h| att-i| att-j| customer|
+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+-----------+
| att-a-7| att-b-3|att-c-10|att-d-10|att-e-15|att-f-11| att-g-2| att-h-7| att-i-5|att-j-14| customer-1|
| att-a-9| att-b-7|att-c-12| att-d-4|att-e-10| att-f-4|att-g-13| att-h-4| att-i-1|att-j-13| customer-2|
I want to complete the comapreColumns function.
which compares the columns of the two dataframes(userDF and flattenedDF) and returns a new DF as sample output.
how to do that? Like, compare each row's and column in flattenedDF with userDF and count++ if they match? e.g att-a with att-a att-b with att-b.
def getCustomer(customerID: String)(dataFrame: DataFrame): DataFrame = {
dataFrame.filter($"customer" === customerID).toDF()
}
def compareColumns(customerID: String)(dataFrame: DataFrame): DataFrame = {
val userDF = dataFrame.transform(getCustomer(customerID))
userDF.printSchema()
userDF
}
Sample Output:
+--------------------+-----------+
| customer | similarity_score |
+--------------------+-----------+
|customer-1 | -1 | its the same as the reference customer so to ignore '-1'
|customer-12 | 2 |
|customer-3 | 2 |
|customer-44 | 5 |
|customer-5 | 1 |
|customer-6 | 10 |
Thanks

How do I explode a nested Struct in Spark using Scala

I am creating a dataframe using
val snDump = table_raw
.applyMapping(mappings = Seq(
("event_id", "string", "eventid", "string"),
("lot-number", "string", "lotnumber", "string"),
("serial-number", "string", "serialnumber", "string"),
("event-time", "bigint", "eventtime", "bigint"),
("companyid", "string", "companyid", "string")),
caseSensitive = false, transformationContext = "sn")
.toDF()
.groupBy(col("eventid"), col("lotnumber"), col("companyid"))
.agg(collect_list(struct("serialnumber", "eventtime")).alias("snetlist"))
.createOrReplaceTempView("sn")
I have data like this in the df
eventid | lotnumber | companyid | snetlist
123 | 4q22 | tu56ff | [[12345,67438]]
456 | 4q22 | tu56ff | [[12346,67434]]
258 | 4q22 | tu56ff | [[12347,67455], [12333,67455]]
999 | 4q22 | tu56ff | [[12348,67459]]
I want to explode it put data in 2 columns in my table for that what I am doing is
val serialNumberEvents = snDump.select(col("eventid"), col("lotnumber"), explode(col("snetlist")).alias("serialN"), explode(col("snetlist")).alias("eventT"), col("companyid"))
Also tried
val serialNumberEvents = snDump.select(col("eventid"), col("lotnumber"), col($"snetlist.serialnumber").alias("serialN"), col($"snetlist.eventtime").alias("eventT"), col("companyid"))
but it turns out that explode can be only used once and I get error in the select so how do I use explode/or something else to achieve what I am trying to.
eventid | lotnumber | companyid | serialN | eventT |
123 | 4q22 | tu56ff | 12345 | 67438 |
456 | 4q22 | tu56ff | 12346 | 67434 |
258 | 4q22 | tu56ff | 12347 | 67455 |
258 | 4q22 | tu56ff | 12333 | 67455 |
999 | 4q22 | tu56ff | 12348 | 67459 |
I have looked at a lot of stackoverflow threads but none of it helped me. It is possible that such question is already answered but my understanding of scala is very less which might have made me not understand the answer. If this is a duplicate then someone could direct me to the correct answer. Any help is appreciated.
First, explode the array in a temporary struct-column, then unpack it:
val serialNumberEvents = snDump
.withColumn("tmp",explode((col("snetlist"))))
.select(
col("eventid"),
col("lotnumber"),
col("companyid"),
// unpack struct
col("tmp.serialnumber").as("serialN"),
col("tmp.eventtime").as("serialT")
)
The trick is to pack the columns you want to explode in an array (or struct), use explode on the array and then unpack them.
val col_names = Seq("eventid", "lotnumber", "companyid", "snetlist")
val data = Seq(
(123, "4q22", "tu56ff", Seq(Seq(12345,67438))),
(456, "4q22", "tu56ff", Seq(Seq(12346,67434))),
(258, "4q22", "tu56ff", Seq(Seq(12347,67455), Seq(12333,67455))),
(999, "4q22", "tu56ff", Seq(Seq(12348,67459)))
)
val snDump = spark.createDataFrame(data).toDF(col_names: _*)
val serialNumberEvents = snDump.select(col("eventid"), col("lotnumber"), explode(col("snetlist")).alias("snetlist"), col("companyid"))
val exploded = serialNumberEvents.select($"eventid", $"lotnumber", $"snetlist".getItem(0).alias("serialN"), $"snetlist".getItem(1).alias("eventT"), $"companyid")
exploded.show()
Note that my snetlist has the schema Array(Array) rather then Array(Struct). You can simply get this by also creating an array instead of a struct out of your columns
Another approach, if needing to explode twice, is as follows - for another example, but to demonstrate the point:
val flattened2 = df.select($"director", explode($"films.actors").as("actors_flat"))
val flattened3 = flattened2.select($"director", explode($"actors_flat").as("actors_flattened"))
See Is there an efficient way to join two large Datasets with (deeper) nested array field? for a slightly different context, but same approach applies.
This answer in response to your assertion you can only explode once.

How to define / change MLDataValue.ValueType for a column in MLDataTable

I am loading a MLDataTable from a given .csv file. The data type for each column is inferred automatically depending on the content of the input file.
I need predictable, explicit types when I process the table later.
How can I enforce a certain type when loading a file or alternatively change the type in a second step?
Simplified Example:
import Foundation
import CreateML
// file.csv:
//
// value1,value2
// 1.5,1
let table = try MLDataTable(contentsOf:URL(fileURLWithPath:"/path/to/file.csv"))
print(table.columnTypes)
// actual output:
// ["value2": Int, "value1": Double] <--- type for value2 is 'Int'
//
// wanted output:
// ["value2": Double, "value1": Double] <--- how can I make it 'Double'?
Use MLDataColumn's map(to:) method to derive a new column from the existing one with the desired underlying type:
let squaresArrayInt = (1...5).map{$0 * $0}
var table = try! MLDataTable(dictionary: ["Ints" : squaresArrayInt])
print(table)
let squaresColumnDouble = table["Ints"].map(to: Double.self)
table.addColumn(squaresColumnDouble, named: "Doubles")
print(table)
Produces the following output:
Columns:
Ints integer
Rows: 5
Data:
+----------------+
| Ints |
+----------------+
| 1 |
| 4 |
| 9 |
| 16 |
| 25 |
+----------------+
[5 rows x 1 columns]
Columns:
Ints integer
Doubles float
Rows: 5
Data:
+----------------+----------------+
| Ints | Doubles |
+----------------+----------------+
| 1 | 1 |
| 4 | 4 |
| 9 | 9 |
| 16 | 16 |
| 25 | 25 |
+----------------+----------------+
[5 rows x 2 columns]

How to rank the data set having multiple columns in Scala?

I have data set like this which i am fetching from csv file but how to
store in Scala to do the processing.
+-----------+-----------+----------+
| recent | Freq | Monitor |
+-----------+-----------+----------+
| 1 | 1234| 199090|
| 4 | 2553| 198613|
| 6 | 3232 | 199090|
| 1 | 8823 | 498831|
| 7 | 2902 | 890000|
| 8 | 7991 | 081097|
| 9 | 7391 | 432370|
| 12 | 6138 | 864981|
| 7 | 6812 | 749821|
+-----------+-----------+----------+
Actually I need to sort the data and rank it.
I am new to Scala programming.
Thanks
Answering your question here is the solution, this code reads a csv and order by the third column
object CSVDemo extends App {
println("recent, freq, monitor")
val bufferedSource = io.Source.fromFile("./data.csv")
val list: Array[Array[String]] = (bufferedSource.getLines map { line => line.split(",").map(_.trim) }).toArray
val newList = list.sortBy(_(2))
newList map { line => println(line.mkString(" ")) }
bufferedSource.close
}
you read the file and you parse it to an Array[Array[String]], then you order by the third column, and you print
Here I am using the list and try to normalize each column at a time and then concatenating them. Is there any other way to iterate column wise and normalize them. Sorry my coding is very basic.
val col1 = newList.map(line => line.head)
val mi = newList.map(line => line.head).min
val ma = newList.map(line => line.head).max
println("mininumn value of first column is " +mi)
println("maximum value of first column is : " +ma)
// calculate scale for the first column
val scale = col1.map(x => math.round((x.toInt - mi.toInt) / (ma.toInt - mi.toInt)))
println("Here is the normalized range of first column of the data")
scale.foreach(println)