Collapse a Spark DataFrame - scala

I am using Spark with Scala. Spark version 1.5 and I am trying to transform input dataframe which has name value combination to a new dataframe in which all name to be transposed to columns and values as rows.
I/P DataFrame:
ID Name Value
1 Country US
2 Country US
2 State NY
3 Country UK
4 Country India
4 State MH
5 Country US
5 State NJ
5 County Hudson
Link here for image
Transposed DataFrame
ID Country State County
1 US NULL NULL
2 US NY NULL
3 UK NULL NULL
4 India MH NULL
5 US NJ Hudson
Link to transposed image
Seems like pivot would help in this use case, but its not supported in spark 1.5.x version.
Any pointers/help?

This is a really ugly data but you can always filter and join:
val names = Seq("Country", "State", "County")
names.map(name =>
df.where($"Name" === name).select($"ID", $"Value".alias("name"))
).reduce((df1, df2) => df1.join(df2, Seq("ID"), "leftouter"))
map creates a list of three DataFrames where each table contains records containing only a single name. Next we simply reduce this list using left outer join. So putting it all together you get something like this:
(left-outer-join
(left-outer-join
(where df (=== name "Country"))
(where df (=== name "State")))
(where df (=== name "County")))
Note: If you use Spark >= 1.6 with Python or Scala, or Spark >= 2.0 with R, just use pivot with first:
Reshaping/Pivoting data in Spark RDD and/or Spark DataFrames
How to pivot DataFrame?

Related

splitting string column into multiple columns based on key value item using spark scala

I have a dataframe where one column contains several information in a 'key=value' format.
There are almost a 30 different 'key=value' that can appear in that column will use 4 columns
for understanding ( _age, _city, _sal, _tag)
id name properties
0 A {_age=10, _city=A, _sal=1000}
1 B {_age=20, _city=B, _sal=3000, tag=XYZ}
2 C {_city=BC, tag=ABC}
How can I convert this string column into multiple columns?
Need to use spark scala dataframe for it.
The expected output is:
id name _age _city _sal tag
0 A 10 A 1000
1 B 20 B 3000 XYZ
2 C BC ABC
Short answer
df
.select(
col("id"),
col("name"),
col("properties.*"),
..
)
Try this :
val s = df.withColumn("dummy", explode(split(regexp_replace($"properties", "\\{|\\}", ""), ",")))
val result= s.drop("properties").withColumn("col1",split($"dummy","=")(0)).withColumn("col1-value",split($"dummy","=")(1)).drop("dummy")
result.groupBy("id","name").pivot("col1").agg(first($"col1-value")).orderBy($"id").show

Convert a DataFrame to RDD and Split the RDD into the same number of Columns as DataFrame Dynamically

I am trying to Convert a DataFrame into RDD and Splitting them into Specific number of Columns based on Number of Columns in DataFrame Dynamically and Elegantly
i.e
This is a sample data from a table in hive employee
Id Name Age State City
123 Bob 34 Texas Dallas
456 Stan 26 Florida Tampa
val temp_df = spark.sql("Select * from employee")
val temp2_rdd = temp_df.rdd.map(x => (x(0),x(1),x(2),x(3))
I am looking to generate the tem2_rdd dynamically based on the number of columns from the table.
It should not be hard coded as i did.
As the maximum size of tuple is 22 in scala, any other collection that can hold the rdd efficiently.
Coding Language : Spark Scala
Please advise.
Instead of extracting and transforming each element using index you can use toSeq method of Row object.
val temp_df = spark.sql("Select * from employee")
// RDD[List[Any]]
val temp2_rdd = temp_df.rdd.map(_.toSeq.toList)
// RDD[List[String]]
val temp3_rdd = temp_df.rdd.map(_.toSeq.map(_.toString).toList)

How do I define a schema for a dataframe by deriving it from a table where the custom schema is provided?

I have a table in RDBMS which I'm taking into a dataframe(DF1):
1 employee_id
2 employee_name
3 salary
4 designation
And I have a dataframe(DF2) with the following:
_c0 _c1 _c2 _c3
101 monali 70000 developer
102 Amy 70000 developer
103 neha 65000 tester
How do I define the schema for DF2 from DF1. I want DF2 to have the schema that is defined in the above table.
expected output:
employee_id employee_name salary designation
101 monali 70000 developer
102 Amy 70000 developer
103 neha 65000 tester
I want to make it parameterized.
You can create a function mapColumnNames that takes two parameters, dataframe containing the columns (which I call columns dataframe) and dataframe you want to change columns' name (which I call data dataframe).
This function first retrieves name and id of each column in columns dataframe as a list of tuples. Then it iterates over this list of tuples, applying method withColumnRenamed on data dataframe on each iteration.
Then you can call this function mapColumnNames with DF1 as columns dataframe and DF2 as data dataframe.
Below the complete code:
def mapColumnNames(columns: DataFrame, data: DataFrame): DataFrame = {
val columnNames = columns.collect().map(x => (x.getInt(0) - 1, x.getString(1)))
columnNames.foldLeft(data)((data, columnName) => {
data.withColumnRenamed(s"_c${columnName._1}", columnName._2)
})
}
val output = mapColumnNames(DF1, DF2)
It wasn't clear what schema does your df1 holds, so used index 1 reference to fetch columns
val columns = df1.select($"1").collect()
Otherwise, we can get all the columns associated with the first dataframe
val columns = df1.schema.fieldNames.map(col(_))
and then use select with columns fetched for our new dataframe
val newDF = df2.select(columns :_*)

Comparing the value of columns in two dataframe

I have two dataframe, one has unique value of id and other can have multiple values of different id.
This is dataframe df1:
id | dt| speed | stats
358899055773504 2018-07-31 18:38:34 0 [9,-1,-1,13,0,1,0]
358899055773505 2018-07-31 18:48:23 4 [8,-1,0,22,1,1,1]
df2:
id | dt| speed | stats
358899055773504 2018-07-31 18:38:34 0 [9,-1,-1,13,0,1,0]
358899055773505 2018-07-31 18:54:23 4 [9,0,0,22,1,1,1]
358899055773504 2018-07-31 18:58:34 0 [9,0,-1,22,0,1,0]
358899055773504 2018-07-31 18:28:34 0 [9,0,-1,22,0,1,0]
358899055773505 2018-07-31 18:38:23 4 [8,-1,0,22,1,1,1]
I aim to compare the second dataframe with the first dataframe and updating the values in first dataframe, only if the value of dt of a particular id of df2 is greater than that in df1 and if it satisfies the greater than condition then comparing the other fields as well.
You need to join the two dataframes together to make any comparison of their columns.
What you can do is first joining the dataframes and then perform all the filtering to get a new dataframe with all rows that should be updated:
val diffDf = df1.as("a").join(df2.as("b"), Seq("id"))
.filter($"b.dt" > $"a.dt")
.filter(...) // Any other filter required
.select($"id", $"b.dt", $"b.speed", $"b.stats")
Note: In some situations it would be required to do a groupBy(id) or use a window function since there should only be one final row per id in the diffDf dataframe. This can be done as as follows (the example here will select the row with maximum in the speed, but it depends on the actual requirements):
val w = Window.partitionBy($"id").orderBy($"speed".desc)
val diffDf2 = diffDf.withColumn("rn", row_number.over(w)).where($"rn" === 1).drop("rn")
More in-depth information about different approaches can be seen here: How to max value and keep all columns (for max records per group)?.
To replace the old rows with the same id in the df1 dataframe, combine the dataframes with an outer join and coalesce:
val df = df1.as("a").join(diffDf.as("b"), Seq("id"), "outer")
.select(
$"id",
coalesce($"b.dt", $"a.dt").as("dt"),
coalesce($"b.speed", $"a.speed").as("speed"),
coalesce($"b.stats", $"a.stats").as("stats")
)
coalesce works by first trying to take the value from the diffDf (b) dataframe. If that value is null it will take the value from df1 (a).
Result when only using the time filter with the provided example input dataframes:
+---------------+-------------------+-----+-----------------+
| id| dt|speed| stats|
+---------------+-------------------+-----+-----------------+
|358899055773504|2018-07-31 18:58:34| 0|[9,0,-1,22,0,1,0]|
|358899055773505|2018-07-31 18:54:23| 4| [9,0,0,22,1,1,1]|
+---------------+-------------------+-----+-----------------+

Spark: Computing correlations of a DataFrame with missing values

I currently have a DataFrame of doubles with approximately 20% of the data being null values. I want to calculate the Pearson correlation of one column with every other column and return the columnId's of the top 10 columns in the DataFrame.
I want to filter out nulls using pairwise deletion, similar to R's pairwise.complete.obs option in its Pearson correlation function. That is, if one of the two vectors in any correlation calculation has a null at an index, I want to remove that row from both vectors.
I currently do the following:
val df = ... //my DataFrame
val cols = df.columns
df.registerTempTable("dataset")
val target = "Row1"
val mapped = cols.map {colId =>
val results = sqlContext.sql(s"SELECT ${target}, ${colId} FROM dataset WHERE (${colId} IS NOT NULL AND ${target} IS NOT NULL)")
(results.stat.corr(colId, target) , colId)
}.sortWith(_._1 > _._1).take(11).map(_._2)
This runs very slowly, as every single map iteration is its own job. Is there a way to do this efficiently, perhaps using Statistics.corr in the Mllib, as per this SO Question (Spark 1.6 Pearson Correlation)
There are "na" functions on DataFrame: DataFrameNaFunctions API
They work in the same way DataFramStatFunctions do.
You can drop the rows containing a null in either of your two dataframe columns with the following syntax:
myDataFrame.na.drop("any", target, colId)
if you want to drop rows containing null any of the columns then it is:
myDataFrame.na.drop("any")
By limiting the dataframe to the two columns you care about first, you can use the second method and avoid verbose!
As such your code would become:
val df = ??? //my DataFrame
val cols = df.columns
val target = "Row1"
val mapped = cols.map {colId =>
val resultDF = df.select(target, colId).na.drop("any")
(resultDF.stat.corr(target, colId) , colId)
}.sortWith(_._1 > _._1).take(11).map(_._2)
Hope this helps you.