Transform rows to multiple rows in Spark Scala - scala

I have a problem where I need to transform one row to multiple rows. This is based on a different mapping that I have. I have tried to provide an example below.
Suppose I have a parquet file with the below schema
ColA, ColB, ColC, Size, User
I need to aggregate the above data into multiple rows based on a lookup map. Suppose I have a static map
ColA, ColB, Sum(Size)
ColB, ColC, Distinct (User)
ColA, ColC, Sum(Size)
This means that one row in the input RDD needs to be transformed to 3 aggregate. I believe RDD is the way to go with FlatMapPair, but I am not sure how to go about this.
I am also OK to concat the columns into one key, something like ColA_ColB etc.
For creating multiple aggregates from the same data, I have started with something like this
val keyData: PairFunction[Row, String, Long] = new PairFunction[Row, String, Long]() {
override def call(x: Row) = {
(x.getString(1),x.getLong(5))
}
}
val ip15M = spark.read.parquet("a.parquet").toJavaRDD
val pairs = ip15M.mapToPair(keyData)
java.util.List[(String, Long)] = [(ios,22), (ios,23), (ios,10), (ios,37), (ios,26), (web,52), (web,1)]
I believe I need to do flatmaptopair instead of mapToPair. On similar lines, I tried
val FlatMapData: PairFlatMapFunction[Row, String, Long] = new PairFlatMapFunction[Row, String, Long]() {
override def call(x: Row) = {
(x.getString(1),x.getLong(5))
}
}
but it is giving Error
Expression of type (String, Long) doesn't conform to expected type util.Iterator[(String, Long)]
Any help is appreciated. Please let me know if I need to add any more details.

the outcome should only have 3 columns? I mean col1, col2, col3 (the agg outcome).
The second aggregate is a distinct count of users? (I assume yes).
If so you can basically create 3 data frames and then union them.
Something in the way of:
val df1 = spark.sql("select colA as col1, colB as col2 ,sum(Size) as colAgg group by colA,colB")
val df2 = spark.sql("select colB as col1, colC as col2 ,Distinct(User) as colAgg group by colB,colC")
val df3 = spark.sql("select colA as col1, colC as col2 ,sum(Size) as colAgg group by colA,colC")
df1.union(df2).union(df3)

Related

Spark Enhance Join between Terabytes of Datasets

I have five Hive tables assume the names is A, B, C, D, and E. For each table there is a customer_id as the key for join between them. Also, Each table contains at least 100:600 columns all of them is Parquet format.
Example of one table below:
CREATE TABLE table_a
(
customer_id Long,
col_1 STRING,
col_2 STRING,
col_3 STRING,
.
.
col_600 STRING
)
STORED AS PARQUET;
I need to achieve two points,
Join all of them together with the most optimum way using Spark Scala. I tried to sortByKey before join but still there is a performance bottleneck. I tried to reparation by key before join but the performance is still not good. I tried to increase the parallelism for Spark to make it 6000 with many executors but not able to achieve a good results.
After join I need to apply a separate function for some of these columns.
Sample of the join I tried below,
val dsA = spark.table(table_a)
val dsB = spark.table(table_b)
val dsC = spark.table(table_c)
val dsD = spark.table(table_d)
val dsE = spark.table(table_e)
val dsAJoineddsB = dsA.join(dsB,Seq(customer_id),"inner")
I think in this case the direct join is not the optimal case. You can acheive this task using the below simple way.
First, create case class for example FeatureData with two fields case class FeatureData(customer_id:Long, featureValue:Map[String,String])
Second, You will map each table to FeatureData case class key, [feature_name,feature_value]
Third, You will groupByKey and union all the dataset with the same key.
I the above way it will be faster to union than join. But it need more work.
After that, you will have a dataset with key,map. You will apply the transformation for key, Map(feature_name).
Simple example of the implementation as following:
You will map first the dataset to the case class then you can union all of them. After that you will groupByKey then map it and reduce it.
case class FeatureMappedData(customer_id:Long, feature: Map[String, String])
val dsAMapped = dsA.map(row ⇒
FeatureMappedData(row.customer_id,
Map("featureA" -> row.featureA,
"featureB" -> row.featureB)))
val unionDataSet = dsAMapped union dsBMapped
unionDataSet.groupByKey(_.customer_id)
.mapGroups({
case (eid, featureIter) ⇒ {
val featuresMapped: Map[String, String] = featureIter.map(_.feature).reduce(_ ++ _).withDefaultValue("0")
FeatureMappedData(customer_id, featuresMapped)
}
})

Dataframe : GroupBy by list of column names [duplicate]

This question already has an answer here:
Scala-Spark Dynamically call groupby and agg with parameter values
(1 answer)
Closed 4 years ago.
I have a Dataframe with multiple columns and a List of column names.
I want to process my Dataframe by grouping it according to my list.
Here is an example of what I am trying to do :
val tagList = List("col1","col3","col5")
var tagsForGroupBy = tagList(0)
if(tagList.length>1){
for(i <- 1 to tagList.length-1){
tagsForGroupBy = tagsForGroupBy+","+tags(i)
}
}
// df is a Dataframe with schema (col0, col1, col2, col3, col4, col5)
df.groupBy("col0",tagsForGroupBy)
I understand why it does not work, but I don't know how to make it work.
What is the best solution to do that ?
EDIT :
Here is a more complete example of what I am doing (including SCouto solution) :
I have my tagList that contains some column names ("col3","col5"). I also want to include "col0" and "col1" in my groupBy, independently of my list.
After my groupBy and my aggregations, I want to select all columns used for group By and the new columns from aggregation.
val tagList = List("col3","col5")
val tmpListForGroup = new ListBuffer[String]()
val tmpListForSelect = new ListBuffer[String]()
tmpListForGroup +=tagList (0)
tmpListForSelect +=tagList (0)
for(i <- 1 to tagList .length-1){
tmpListForGroup +=(tagList (i))
tmpListForSelect +=(tagList (i))
}
tmpListForGroup +="col0"
tmpListForGroup +="col1"
tmpListForSelect +="aggValue1"
tmpListForSelect +="aggValue2"
// df is a Dataframe with schema (col0, col1, col2, col3, col4, col5)
df.groupBy(tmpListForGroup.head,tmpListForGroup.tail:_*)
.agg(
[aggFunction].as("aggValue1"),
[aggFunction].as("aggValue1"))
)
.select(tmpListForSelect .head,tmpListForSelect .tail:_*)
This code do exactly what I want, but it look very ugly and complicated for something that (I think) should be simple.
Is there another solution for that ?
When sending column names as Strings, groupBy receives a column as first parameter and a sequence of them as second:
def groupBy(col1: String,cols: String*)
So you need to send two arguments and convert the second one to a sequence:
This will work fine for you:
df.groupBy(tagsForGroupBy.head, tagsForGroupBy.tail:_*)
Or if you want to separete col0 from the list as in your example:
df.groupBy("col0", tagsForGroupBy:_*)

get multiple columns within a map: rdd

I've a DF that I'm explicitly converting into an RDD and trying to fetch each column's record. Not able to fetch each of them within a map. Below is what I've tried:
val df = sql("Select col1, col2, col3, col4, col5 from tableName").rdd
The resultant df becomes the member of org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
Now I'm trying to access each element of this RDD via:
val dfrdd = df.map{x => x.get(0); x.getAs[String](1); x.get(3)}
The issue is, the above statement returns only the data present on the last transformation of map i.e., the data present on x.get(3). Can someone let me know what I'm doing wrong?
The last line is always returned from the map, In your case x.get(3) gets returned.
To return multiple values you can return tuples as below
val dfrdd = df.map{x => (x.get(0), x.getAs[String](1), x.get(3))}
Hope this helped!

how to select all columns that starts with a common label

I have a dataframe in Spark 1.6 and want to select just some columns out of it. The column names are like:
colA, colB, colC, colD, colE, colF-0, colF-1, colF-2
I know I can do like this to select specific columns:
df.select("colA", "colB", "colE")
but how to select, say "colA", "colB" and all the colF-* columns at once? Is there a way like in Pandas?
The process canbe broken down into following steps:
First grab the column names with df.columns,
then filter down to just the column names you want .filter(_.startsWith("colF")). This gives you an array of Strings.
But the select takes select(String, String*). Luckily select for columns is select(Column*), so finally convert the Strings into Columns with .map(df(_)),
and finally turn the Array of Columns into a var arg with : _*.
df.select(df.columns.filter(_.startsWith("colF")).map(df(_)) : _*).show
This filter could be made more complex (same as Pandas). It is however a rather ugly solution (IMO):
df.select(df.columns.filter(x => (x.equals("colA") || x.startsWith("colF"))).map(df(_)) : _*).show
If the list of other columns is fixed you could also merge a fixed array of columns names with filtered array.
df.select((Array("colA", "colB") ++ df.columns.filter(_.startsWith("colF"))).map(df(_)) : _*).show
Python (tested in Azure Databricks)
selected_columns = [column for column in df.columns if column.startswith("colF")]
df2 = df.select(selected_columns)
In PySpark, use: colRegex to select columns starting with colF
Whit the sample:
colA, colB, colC, colD, colE, colF-0, colF-1, colF-2
Apply:
df.select(col("colA"), col("colB"), df.colRegex("`(colF)+?.+`")).show()
The result is:
colA, colB, colF-0, colF-1, colF-2
I wrote a function that does that. Read the comments to see how it works.
/**
* Given a sequence of prefixes, select suitable columns from [[DataFrame]]
* #param columnPrefixes Sequence of prefixes
* #param dF Incoming [[DataFrame]]
* #return [[DataFrame]] with prefixed columns selected
*/
def selectPrefixedColumns(columnPrefixes: Seq[String], dF: DataFrame): DataFrame = {
// Find out if given column name matches any of the provided prefixes
def colNameStartsWith: String => Boolean = (colName: String) =>
columnsPrefix.map(prefix => colName.startsWith(prefix)).reduce(_ || _)
// Filter columns list by checking against given prefixes sequence
val columns = dF.columns.filter(colNameStartsWith)
// Select filtered columns list
dF.select(columns.head, columns.tail:_*)
}

How to preserve order of columns in cassandra

I have two tables in Cassandra:
CREATE TABLE table1 (
name text PRIMARY KEY,
grade text,
labid List<int>);
CREATE TABLE table2(
name text PRIMARY KEY,
deptid List<int>
grade text,);
for example:
val result: RDD[String, String, List[Int]] = myFunction();
result.saveToCassandra(keyspace, table1)
It is working fine.
but in case of using below line:
result.saveToCassandra(keyspace, table2)
m getting this type of error : com.datastax.spark.connector.types.TypeConversionException: Cannot convert object test_data of type class java.lang.String to List[AnyRef]
Is there any solution using SomeColumns which satisfy the both tables[we don't know which table will be executed]. eg:
result.saveToCassandra(keyspace, table, SomeColumns(....))?
By default the dataframe schema only cares about position, not column name, so if your c* table has a different column order, you will get incorrect writes. The solution is like you said, to use SomeColumns.
val columns = dataFrame.schema.map(_.name: ColumnRef)
dataFrame.rdd.saveToCassandra(keyspaceName, tableName, SomeColumns(columns: _*))
Now the dataframe columns will be written to c* using their name, not position.
You arguments should be in different order because the tables have different column types:
val result: RDD[String, String, List[Int]] = myFunction();
val reorder: RDD[String, List[Int], String] = result.map(r => r._1, r._3, r._2)
reorder.saveToCassandra(keyspace, table2)