How to group by one column in rdd in pyspark? - pyspark

The rdd in pyspark are consist of four elements in every list :
[id1, 'aaa',12,87]
[id2, 'acx',1,90]
[id3, 'bbb',77,10]
[id2, 'bbb',77,10]
.....
I want to group by the ids in the first columns, and get the aggregate result of the other three columns: for example => [id2,[['acx',1,90], ['bbb',77,10]...]]
How can I realize it ?

spark.version
# u'2.2.0'
rdd = sc.parallelize((['id1', 'aaa',12,87],
['id2', 'acx',1,90],
['id3', 'bbb',77,10],
['id2', 'bbb',77,10]))
rdd.map(lambda x: (x[0], x[1:])).groupByKey().mapValues(list).collect()
# result:
[('id2', [['acx', 1, 90], ['bbb', 77, 10]]),
('id3', [['bbb', 77, 10]]),
('id1', [['aaa', 12, 87]])]
or, if you prefer lists strictly, you can add one more map operation after mapValues:
rdd.map(lambda x: (x[0], x[1:])).groupByKey().mapValues(list).map(lambda x: list(x)).collect()
# result:
[['id2', [['acx', 1, 90], ['bbb', 77, 10]]],
['id3', [['bbb', 77, 10]]],
['id1', [['aaa', 12, 87]]]]

Related

Values of a Dataframe Column into an Array in Scala Spark

Say, I have dataframe
val df1 = sc.parallelize(List(
("A1",45, "5", 1, 90),
("A2",60, "1", 1, 120),
("A3", 45, "9", 1, 450),
("A4", 26, "7", 1, 333)
)).toDF("CID","age", "children", "marketplace_id","value")
Now I want all the values of column "children" into an separate array in the same order.
the below code works for smaller dataset with only one partition
val list1 = df.select("children").map(r => r(0).asInstanceOf[String]).collect()
output:
list1: Array[String] = Array(5, 1, 9, 7)
But the above code fails when we have partitions
val partitioned = df.repartition($"CID")
val list = partitioned.select("children").map(r => r(0).asInstanceOf[String]).collect()
output:
list: Array[String] = Array(9, 1, 7, 5)
is there way, that I can get all the values of a column into an array without changing an order?

spark dataframe : finding employees who is having salary more than the average salary of the organization

I am trying to run a test spark/scala code to find employees who is having salary more than the avarage salary with a test data using below spark dataframe . But this is failing while executing :
Exception in thread "main" java.lang.UnsupportedOperationException: Cannot evaluate expression: avg(input[4, double, false])
What might be the correct syntax to achieve this ?
val dataDF20 = spark.createDataFrame(Seq(
(11, "emp1", 2, 45, 1000.0),
(12, "emp2", 1, 34, 2000.0),
(13, "emp3", 1, 33, 3245.0),
(14, "emp4", 1, 54, 4356.0),
(15, "emp5", 2, 76, 56789.0)
)).toDF("empid", "name", "deptid", "age", "sal")
val condition1 : Column = col("sal") > avg(col("sal"))
val d0 = dataDF20.filter(condition1)
println("------ d0.show()----", d0.show())
You can get this done in two steps:
val avgVal = dataDF20.select(avg($"sal")).take(1)(0)(0)
dataDF20.filter($"sal" > avgVal).show()
+-----+----+------+---+-------+
|empid|name|deptid|age| sal|
+-----+----+------+---+-------+
| 15|emp5| 2| 76|56789.0|
+-----+----+------+---+-------+

Fetch columns based on list in Spark

I have a list List(0, 1, 2, 3, 4, 5, 6, 7, 10, 8, 13) and I have a dataframe which read input from text file with no headers. I want to fetch the columns mentioned in my List from that dataframe(inputFile). My input files has more 20 column but I want to fetch only columns mentioned in my list
val inputFile = spark.read
.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("delimiter", "|")
.load("C:\\demo.txt")
You can get the required columns using the following :
val fetchIndex = List(0, 1, 2, 3, 4, 5, 6, 7, 10, 8, 13)
val fetchCols = inputFile.columns.zipWithIndex
.filter { case (colName, idx) => fetchIndex.contains(idx) }
.map(x => col(x._1) )
inputFile.select( fetchCols : _* )
Basically what it does is, zipWithIndex adds a continuous index to each element of the collection. So you get something like this :
df.columns.zipWithIndex.filter { case (data, idx) => a.contains(idx) }.map(x => col(x._1))
res8: Array[org.apache.spark.sql.Column] = Array(companyid, event, date_time)
And then you can just use the splat operator to pass the generated array as varargs to the select function.
You can use the following steps to get the columns that you have defined in a list as indexes.
You can get the column names by doing the following
val names = df.schema.fieldNames
And you have a list of column indexes as
val list = List(0, 1, 2, 3, 4, 5, 6, 7, 10, 8, 13)
Now you can select the column names that the indexes that the list has by doing the following
val selectCols = list.map(x => names(x))
Last step is to select only the columns that has been selected by doing the following
import org.apache.spark.sql.functions.col
val selectedDataFrame = df.select(selectCols.map(col): _*)
You should have the dataframe with column indexes mentioned in the list.
Note: indexes in the list should not be greater than the column indexes present in the dataframe

How can I extract rows from table in SCALA corresponding to indexes present in a list.?

Using Scala, I loaded a hive table into spark
This table has 3 columns and 10,000 rows. I also have this list:
List[Int] = List(43, 48, 353, 413, 645, 674, 764, 873, 1018, 1170, 1206, 1626)
I have to extract all the rows (with all the columns) from the table corresponding to elements present in the given list. How can I do that?
I need the final output in data frame format.
val l = List(43, 48, 353, 413, 645, 674, 764, 873, 1018, 1170, 1206, 1626)
val table: Array[Array[T]] = ???
table.map(col => l.map(index => col(index)))
This will return the columns of table, but only with the rows you mentioned. Beware that this will throw some exception if you do not have at least 1627 rows.
This is assuming that your table is an Array whose elements are the columns. If it is an Array[Row], you need to do the analog:
l.map(index => table(index))

Spark SQL - Generate array of arrays from the sql function

I want to create an array of arrays. This is my data table:
// A case class for our sample table
case class Testing(name: String, age: Int, salary: Int)
// Create an RDD with some data
val x = sc.parallelize(Array(
Testing(null, 21, 905),
Testing("Noelia", 26, 1130),
Testing("Pilar", 52, 1890),
Testing("Roberto", 31, 1450)
))
// Convert RDD to a DataFrame
val df = sqlContext.createDataFrame(x)
// For SQL usage we need to register the table
df.registerTempTable("df")
I want to create an array of integer column "age". For that I use "collect_list":
sqlContext.sql("SELECT collect_list(age) as age from df").show
But now I want to generate an array containing multiple arrays as created above:
sqlContext.sql("SELECT collect_list(collect_list(age), collect_list(salary)) as arrayInt from df").show
But this does not work , or use the function org.apache.spark.sql.functions.array. Any ideas?
Ok, things can't get more simple. Let's consider the same data you are working on and go step by step from there
// A case class for our sample table
case class Testing(name: String, age: Int, salary: Int)
// Create an RDD with some data
val x = sc.parallelize(Array(
Testing(null, 21, 905),
Testing("Noelia", 26, 1130),
Testing("Pilar", 52, 1890),
Testing("Roberto", 31, 1450)
))
// Convert RDD to a DataFrame
val df = sqlContext.createDataFrame(x)
// For SQL usage we need to register the table
df.registerTempTable("df")
sqlContext.sql("select collect_list(age) as age from df").show
// +----------------+
// | age|
// +----------------+
// |[21, 26, 52, 31]|
// +----------------+
sqlContext.sql("select collect_list(collect_list(age), collect_list(salary)) as arrayInt from df").show
As the error message says :
org.apache.spark.sql.AnalysisException: No handler for Hive udf class
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCollectList because: Exactly one argument is expected..; line 1 pos 52 [...]
collest_list takes just one argument. Let's check the documentation here.
It actually takes one argument ! But let's go further in the documentation of the functions object. You seem to have noticed that the array function allows you to create a new array column out of a Column or a repeated Column parameter. So let's use that :
sqlContext.sql("select array(collect_list(age), collect_list(salary)) as arrayInt from df").show(false)
The array function create indeed a column from the column list create before-hand by collect_list on both age and salary :
// +-------------------------------------------------------------------+
// |arrayInt |
// +-------------------------------------------------------------------+
// |[WrappedArray(21, 26, 52, 31), WrappedArray(905, 1130, 1890, 1450)]|
// +-------------------------------------------------------------------+
Where do we go from here ?
You have to remember that a Row from a DataFrame is just another collection wrapped by a Row.
The first thing I'll do is work on that collection. So How do we flatten a WrappedArray[WrappedArray[Int]] ?
Scala is kind of magical you just need to use .flatten
import scala.collection.mutable.WrappedArray
val firstRow: mutable.WrappedArray[mutable.WrappedArray[Int]] =
sqlContext.sql("select array(collect_list(age), collect_list(salary)) as arrayInt from df")
.first.get(0).asInstanceOf[WrappedArray[WrappedArray[Int]]]
// res26: scala.collection.mutable.WrappedArray[scala.collection.mutable.WrappedArray[Int]] =
// WrappedArray(WrappedArray(21, 26, 52, 31), WrappedArray(905, 1130, 1890, 1450))
firstRow.flatten
// res27: scala.collection.mutable.IndexedSeq[Int] = ArrayBuffer(21, 26, 52, 31, 905, 1130, 1890, 1450)
Now let's wrap it in a UDF so we can use it on the DataFrame :
def flatten(array: WrappedArray[WrappedArray[Int]]) = array.flatten
sqlContext.udf.register("flatten", flatten(_: WrappedArray[WrappedArray[Int]]))
Since we registered the UDF, we can now use it inside the sqlContext :
sqlContext.sql("select flatten(array(collect_list(age), collect_list(salary))) as arrayInt from df").show(false)
// +---------------------------------------+
// |arrayInt |
// +---------------------------------------+
// |[21, 26, 52, 31, 905, 1130, 1890, 1450]|
// +---------------------------------------+
I hope this helps !
Let's create the DataFrame the way have created above.
// A case class for our sample table
import org.apache.spark.sql.functions._
case class Testing(name: String, age: Int, salary: Int)
// Create an RDD with some data
val x = sc.parallelize(Array(
Testing(null, 21, 905),
Testing("Noelia", 26, 1130),
Testing("Pilar", 52, 1890),
Testing("Roberto", 31, 1450)
))
// Convert RDD to a DataFrame
val df = spark.createDataFrame(x)
Here we can use array_union function to achieve the desired result. array_unionfunction will return the union of all elements from the input arrays. This function is available since spark 2.4.0
// Scala Ref : https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$
// Pyspark Ref : https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.array_union
df.select(collect_list("age").as("age"), collect_list("salary").as("salary"))
.withColumn("new_col", array_union($"age", $"salary")).show(truncate=false)
// Output
+----------------+-----------------------+---------------------------------------+
|age |salary |new_col |
+----------------+-----------------------+---------------------------------------+
|[21, 26, 52, 31]|[905, 1130, 1890, 1450]|[21, 26, 52, 31, 905, 1130, 1890, 1450]|
+----------------+-----------------------+---------------------------------------+
I hope this helps.