How to convert unstructured RDD into dataframe without defining schema in Pyspark? - pyspark

I got this following RDD after kafka streaming. I want to convert it into dataframe without defining Schema.
[
{u'Enrolment_Date': u'2008-01-01', u'Freq': 78},
{u'Enrolment_Date': u'2008-02-01', u'Group': u'Recorded Data'},
{u'Freq': 70, u'Group': u'Recorded Data'},
{u'Enrolment_Date': u'2008-04-01', u'Freq': 96}
]

You can use an OrderedDict to convert a RDD containing key value pairs to a dataframe. However, in your case not all keys are present in each row, and therefore you need to populate these first with None values. See the solution below
#Define test data
data = [{u'Enrolment_Date': u'2008-01-01', u'Freq': 78}, {u'Enrolment_Date': u'2008-02-01', u'Group': u'Recorded Data'}, {u'Freq': 70, u'Group': u'Recorded Data'}, {u'Enrolment_Date': u'2008-04-01', u'Freq': 96}]
rdd = sc.parallelize(data)
from pyspark.sql import Row
from collections import OrderedDict
#Determine all the keys in the input data
schema = rdd.flatMap(lambda x: x.keys()).distinct().collect()
#Add missing keys with a None value
rdd_complete= rdd.map(lambda r:{x:r.get(x) for x in schema})
#Use an OrderedDict to convert your data to a dataframe
#This ensures data ends up in the right column
df = rdd_complete.map(lambda r: Row(**OrderedDict(sorted(r.items())))).toDF()
df.show()
This gives as output:
+--------------+----+-------------+
|Enrolment_Date|Freq| Group|
+--------------+----+-------------+
| 2008-01-01| 78| null|
| 2008-02-01|null|Recorded Data|
| null| 70|Recorded Data|
| 2008-04-01| 96| null|
+--------------+----+-------------+

Related

pyspark SparseVectors dataframe columns .dot product or any other vectors type column computation using #udf or #pandas_udf

I do try to compute .dot product between 2 columns of a give dataframe,
SparseVectors has this ability in spark already so I try to execute this in an easy & scalable way without converting to RDDs or to
DenseVectors but i'm stuck, spent past 3 days to try find out of an
approach and does fail, doesn't return computation for passed 2 vector
columns from dataframe and looking for guidance on this matter,
please, because something I'm missing here and not sure what is root cause ...
For separate vectors and rdd vectors works this approach but does fail
to work when passing dataframe column vectors, to replicate the flow
and issues please see below, ideally would be this computation to happen in parallel since real work data is with billions or more rows (dataframe observations):
from pyspark.ml.linalg import Vectors, SparseVector
from pyspark.sql import Row
df = spark.createDataFrame(
[
[["a","b","c"], SparseVector(4527, {0:0.6363067860791387, 1:1.0888040725098247, 31:4.371858972705023}),SparseVector(4527, {0:0.6363067860791387, 1:2.0888040725098247, 31:4.371858972705023})],
[["d"], SparseVector(4527, {8: 2.729945780576634}), SparseVector(4527, {8: 4.729945780576634})],
], ["word", "i", "j"])
# # daframe content
df.show()
+---------+--------------------+--------------------+
| word| i| j|
+---------+--------------------+--------------------+
|[a, b, c]|(4527,[0,1,31],[0...|(4527,[0,1,31],[0...|
| [d]|(4527,[8],[2.7299...|(4527,[8],[4.7299...|
+---------+--------------------+--------------------+
#udf(returnType=ArrayType(FloatType()))
def sim_cos(v1, v2):
if v1 is not None and v2 is not None:
return float(v1.dot(v2))
# # calling udf
df = df.withColumn("dotP", sim_cos(df.i, df.j))
# # output after udf
df.show()
+---------+--------------------+--------------------+----------+
| word| i| j| dotP|
+---------+--------------------+--------------------+----------+
|[a, b, c]|(4527,[0,1,31],[0...|(4527,[0,1,31],[0...| null|
| [d]|(4527,[8],[2.7299...|(4527,[8],[4.7299...| null|
+---------+--------------------+--------------------+----------+
Rewriting udf as lambda does work on spark 2.4.5. Posting in case
anyone is interested in this approach for PySpark dataframes:
# # rewrite udf as lambda function:
sim_cos = F.udf(lambda x,y : float(x.dot(y)), FloatType())
# # executing udf on dataframe
df = df.withColumn("similarity", sim_cos(col("i"),col("j")))
# # end result
df.show()
+---------+--------------------+--------------------+----------+
| word| i| j|similarity|
+---------+--------------------+--------------------+----------+
|[a, b, c]|(4527,[0,1,31],[0...|(4527,[0,1,31],[0...| 21.792336|
| [d]|(4527,[8],[2.7299...|(4527,[8],[4.7299...| 12.912496|
+---------+--------------------+--------------------+----------+

How to read multiple nested json objects in one file extract by pyspark to dataframe in Azure databricks?

I have .log file in ADLS which contain multiple nested Json objects as follows
{"EventType":3735091736,"Timestamp":"2019-03-19","Data":{"Id":"event-c2","Level":2,"MessageTemplate":"Test1","Properties":{"CorrId":"d69b7489","ActionId":"d0e2c3fd"}},"Id":"event-c20b9c7eac0808d6321106d901000000"}
{"EventType":3735091737,"Timestamp":"2019-03-18","Data":{"Id":"event-d2","Level":2,"MessageTemplate":"Test1","Properties":{"CorrId":"f69b7489","ActionId":"d0f2c3fd"}},"Id":"event-d20b9c7eac0808d6321106d901000000"}
{"EventType":3735091738,"Timestamp":"2019-03-17","Data":{"Id":"event-e2","Level":1,"MessageTemplate":"Test1","Properties":{"CorrId":"g69b7489","ActionId":"d0d2c3fd"}},"Id":"event-e20b9c7eac0808d6321106d901000000"}
Need to read the above multiple nested Json objects in pyspark and convert to dataframe as follows
EventType Timestamp Data.[Id] ..... [Data.Properties.CorrId] [Data.Properties. ActionId]
3735091736 2019-03-19 event-c2 ..... d69b7489 d0e2c3fd
3735091737 2019-03-18 event-d2 ..... f69b7489 d0f2c3fd
3735091738 2019-03-17 event-e2 ..... f69b7489 d0d2c3fd
For above I am using ADLS,Pyspark in Azure DataBricks.
Does anyone know a general way to deal with above problem? Thanks!
You can read it into an RDD first. It will be read as a list of strings
You need to convert the json string into a native python datatype using
json.loads()
Then you can convert the RDD into a dataframe, and it can infer the schema directly using toDF()
Using the answer from Flatten Spark Dataframe column of map/dictionary into multiple columns, you can explode the Data column into multiple columns. Given your Id column is going to be unique. Note that, explode would return key, value columns for each entry in the map type.
You can repeat the 4th point to explode the properties column.
Solution:
import json
rdd = sc.textFile("demo_files/Test20191023.log")
df = rdd.map(lambda x: json.loads(x)).toDF()
df.show()
# +--------------------+----------+--------------------+----------+
# | Data| EventType| Id| Timestamp|
# +--------------------+----------+--------------------+----------+
# |[MessageTemplate ...|3735091736|event-c20b9c7eac0...|2019-03-19|
# |[MessageTemplate ...|3735091737|event-d20b9c7eac0...|2019-03-18|
# |[MessageTemplate ...|3735091738|event-e20b9c7eac0...|2019-03-17|
# +--------------------+----------+--------------------+----------+
data_exploded = df.select('Id', 'EventType', "Timestamp", F.explode('Data'))\
.groupBy('Id', 'EventType', "Timestamp").pivot('key').agg(F.first('value'))
# There is a duplicate Id column and might cause ambiguity problems
data_exploded.show()
# +--------------------+----------+----------+--------+-----+---------------+--------------------+
# | Id| EventType| Timestamp| Id|Level|MessageTemplate| Properties|
# +--------------------+----------+----------+--------+-----+---------------+--------------------+
# |event-c20b9c7eac0...|3735091736|2019-03-19|event-c2| 2| Test1|{CorrId=d69b7489,...|
# |event-d20b9c7eac0...|3735091737|2019-03-18|event-d2| 2| Test1|{CorrId=f69b7489,...|
# |event-e20b9c7eac0...|3735091738|2019-03-17|event-e2| 1| Test1|{CorrId=g69b7489,...|
# +--------------------+----------+----------+--------+-----+---------------+--------------------+
I was able to read the data by following code.
from pyspark.sql.functions import *
DF = spark.read.json("demo_files/Test20191023.log")
DF.select(col('Id'),col('EventType'),col('Timestamp'),col('Data.Id'),col('Data.Level'),col('Data.MessageTemplate'),
col('Data.Properties.CorrId'),col('Data.Properties.ActionId'))\
.show()```
***Result***
+--------------------+----------+----------+--------+-----+---------------+--------+--------+
| Id| EventType| Timestamp| Id|Level|MessageTemplate| CorrId|ActionId|
+--------------------+----------+----------+--------+-----+---------------+--------+--------+
|event-c20b9c7eac0...|3735091736|2019-03-19|event-c2| 2| Test1|d69b7489|d0e2c3fd|
|event-d20b9c7eac0...|3735091737|2019-03-18|event-d2| 2| Test1|f69b7489|d0f2c3fd|
|event-e20b9c7eac0...|3735091738|2019-03-17|event-e2| 1| Test1|g69b7489|d0d2c3fd|
+--------------------+----------+----------+--------+-----+---------------+--------+--------+

spark scala cartesian product of each element in a column

I have a dataframe which is like :
df:
col1 col2
a [p1,p2,p3]
b [p1,p4]
Desired output is that:
df_out:
col1 col2 col3
p1 p2 a
p1 p3 a
p2 p3 a
p1 p4 b
I did some research and i think that converting df to rdd and then flatMap with cartesian product are ideal for the problem. However i could not combine them together.
Thanks,
It looks like you are trying to do combination rather than cartesian. Please check my understanding.
This is in PySpark but the only python thing is the UDF, the rest is just DataFrame operations.
process is
Create dataframe
define UDF to get all pairs of combinations ignoring order
use UDF to convert array into array of pairs of structs, one for each element of the combination
explode the results to get rows of pair of structs
select each struct and original column 1 into desired result columns
from itertools import combinations
from pyspark.sql import functions as F
df = spark.createDataFrame([
("a", ["p1", "p2", "p3"]),
("b", ["p1", "p4"])
],
["col1", "col2"]
)
# define and register udf that takes an array and returns an array of struct of two strings
#udf("array<struct<_1: string, _2: string>>")
def combinations_list(x):
return combinations(x, 2)
resultDf = df.select("col1", F.explode(combinations_list(df.col2)).alias("combos"))
resultDf.selectExpr("combos._1 as col1", "combos._2 as col2", "col1 as col3").show()
Result:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| p1| p2| a|
| p1| p3| a|
| p2| p3| a|
| p1| p4| b|
+----+----+----+

How can I build a CoordinateMatrix in Spark using a DataFrame?

I am trying to use the Spark implementation of the ALS algorithm for recommendation systems, so I built the DataFrame depicted below, as training data:
|--------------|--------------|--------------|
| userId | itemId | rating |
|--------------|--------------|--------------|
Now, I would like to create a sparse matrix, to represent the interactions between every user and every item. The matrix will be sparse because if there is no interaction between a user and an item, the corresponding value in the matrix will be zero. Thus, in the end, most values will be zero.
But how can I achieve this, using a CoordinateMatrix? I'm saying CoordinateMatrix because I'm using Spark 2.1.1, with python, and in the documentation, I saw that a CoordinateMatrix should be used only when both dimensions of the matrix are huge and the matrix is very sparse.
In other words, how can I get from this DataFrame to a CoordinateMatrix, where the rows would be users, the columns would be items and the ratings would be the values in the matrix?
A CoordinateMatrix is just a wrapper for an RDD of MatrixEntrys. A MatrixEntry is just a wrapper over a (long, long, float) tuple. Pyspark allows you to create a CoordinateMatrix from an RDD of such tuples. If the userId and itemId fields are both IntegerTypes and the rating is something like a FloatType, then creating the desired matrix is very straightforward.
from pyspark.mllib.linalg.distributed import CoordinateMatrix
cmat=CoordinateMatrix(df.rdd.map(tuple))
It is only slightly more complicated if you have StringTypes for the userId and itemId fields. You would need to index those strings first and then pass the indices to the CoordinateMatrix.
With Spark 2.4.0, I am showing the whole example that I hope to meet your need.
Create dataframe using dictionary and pandas:
my_dict = {
'userId': [1,2,3,4,5,6],
'itemId': [101,102,103,104,105,106],
'rating': [5.7, 8.8, 7.9, 9.1, 6.6, 8.3]
}
import pandas as pd
pd_df = pd.DataFrame(my_dict)
df = spark.createDataFrame(pd_df)
See the dataframe:
df.show()
+------+------+------+
|userId|itemId|rating|
+------+------+------+
| 1| 101| 5.7|
| 2| 102| 8.8|
| 3| 103| 7.9|
| 4| 104| 9.1|
| 5| 105| 6.6|
| 6| 106| 8.3|
+------+------+------+
Create CoordinateMatrix from dataframe:
from pyspark.mllib.linalg.distributed import CoordinateMatrix, MatrixEntry
coorRDD = df.rdd.map(lambda x: MatrixEntry(x[0], x[1], x[2]))
coorMatrix = CoordinateMatrix(coorRDD)
Now see the data type of result:
type(coorMatrix)
pyspark.mllib.linalg.distributed.CoordinateMatrix

spark (Scala) dataframe filtering (FIR)

Let say I have a dataframe ( stored in scala val as df) which contains the data from a csv:
time,temperature
0,65
1,67
2,62
3,59
which I have no problem reading this from file as a spark dataframe in scala language.
I would like to add a filtered column (by filter I meant signal processing moving average filtering), (say I want to do (T[n]+T[n-1])/2.0):
time,temperature,temperatureAvg
0,65,(65+0)/2.0
1,67,(67+65)/2.0
2,62,(62+67)/2.0
3,59,(59+62)/2.0
(Actually, say for the first row, I want 32.5 instead of (65+0)/2.0. I wrote it to clarify the expected 2-time-step filtering operation output)
So how to achieve this? I am not familiar with spark dataframe operation which combine rows iteratively along column...
Spark 3.1+
Replace
$"time".cast("timestamp")
with
import org.apache.spark.sql.functions.timestamp_seconds
timestamp_seconds($"time")
Spark 2.0+
In Spark 2.0 and later it is possible to use window function as a input for groupBy. It allows you to specify windowDuration, slideDuration and startTime (offset). It works only with TimestampType column but it is not that hard to find a workaround for that. In your case it will require some additional steps to correct for boundaries but general solution can expressed as shown below:
import org.apache.spark.sql.functions.{window, avg}
df
.withColumn("ts", $"time".cast("timestamp"))
.groupBy(window($"ts", windowDuration="2 seconds", slideDuration="1 second"))
.avg("temperature")
Spark < 2.0
If there is a natural way to partition your data you can use window functions as follows:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.mean
val w = Window.partitionBy($"id").orderBy($"time").rowsBetween(-1, 0)
val df = sc.parallelize(Seq(
(1L, 0, 65), (1L, 1, 67), (1L, 2, 62), (1L, 3, 59)
)).toDF("id", "time", "temperature")
df.select($"*", mean($"temperature").over(w).alias("temperatureAvg")).show
// +---+----+-----------+--------------+
// | id|time|temperature|temperatureAvg|
// +---+----+-----------+--------------+
// | 1| 0| 65| 65.0|
// | 1| 1| 67| 66.0|
// | 1| 2| 62| 64.5|
// | 1| 3| 59| 60.5|
// +---+----+-----------+--------------+
You can create windows with arbitrary weights using lead / lag functions:
lit(0.6) * $"temperature" +
lit(0.3) * lag($"temperature", 1) +
lit(0.2) * lag($"temperature", 2)
It is still possible without partitionBy clause but will be extremely inefficient. If this is the case you won't be able to use DataFrames. Instead you can use sliding over RDD (see for example Operate on neighbor elements in RDD in Spark). There is also spark-timeseries package you may find useful.