How can I extract rows from table in SCALA corresponding to indexes present in a list.? - scala

Using Scala, I loaded a hive table into spark
This table has 3 columns and 10,000 rows. I also have this list:
List[Int] = List(43, 48, 353, 413, 645, 674, 764, 873, 1018, 1170, 1206, 1626)
I have to extract all the rows (with all the columns) from the table corresponding to elements present in the given list. How can I do that?
I need the final output in data frame format.

val l = List(43, 48, 353, 413, 645, 674, 764, 873, 1018, 1170, 1206, 1626)
val table: Array[Array[T]] = ???
table.map(col => l.map(index => col(index)))
This will return the columns of table, but only with the rows you mentioned. Beware that this will throw some exception if you do not have at least 1627 rows.
This is assuming that your table is an Array whose elements are the columns. If it is an Array[Row], you need to do the analog:
l.map(index => table(index))

Related

Values of a Dataframe Column into an Array in Scala Spark

Say, I have dataframe
val df1 = sc.parallelize(List(
("A1",45, "5", 1, 90),
("A2",60, "1", 1, 120),
("A3", 45, "9", 1, 450),
("A4", 26, "7", 1, 333)
)).toDF("CID","age", "children", "marketplace_id","value")
Now I want all the values of column "children" into an separate array in the same order.
the below code works for smaller dataset with only one partition
val list1 = df.select("children").map(r => r(0).asInstanceOf[String]).collect()
output:
list1: Array[String] = Array(5, 1, 9, 7)
But the above code fails when we have partitions
val partitioned = df.repartition($"CID")
val list = partitioned.select("children").map(r => r(0).asInstanceOf[String]).collect()
output:
list: Array[String] = Array(9, 1, 7, 5)
is there way, that I can get all the values of a column into an array without changing an order?

Work on some partition from multiple

I have a text file which has around 4264k records. I am splitting the records in 5 partitions and want to select first partition for processing. How can I achieve that?
rdd = sc.textFile("file:///user/somelocation/a.txt", 5)
How can the first partition be selected for further processing?
If I understood your question correctly, you want to see/process data of a particular partition.
In below example I extracted partition 2 data.
val rd = sc.parallelize((1 to 100), 4) // created an rdd of 1 to 100 numbers in 4 partitions
rd.mapPartitionsWithIndex((index, iter) => Iterator((index,iter.toList)), true) // mapping partitions with its index
.filter(x => x._1 == 2) // filtering only 2 nd partition
.collect.foreach(println)
Output:
(2,List(51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75)) // partition number followed by List of values in the partition

How to group by one column in rdd in pyspark?

The rdd in pyspark are consist of four elements in every list :
[id1, 'aaa',12,87]
[id2, 'acx',1,90]
[id3, 'bbb',77,10]
[id2, 'bbb',77,10]
.....
I want to group by the ids in the first columns, and get the aggregate result of the other three columns: for example => [id2,[['acx',1,90], ['bbb',77,10]...]]
How can I realize it ?
spark.version
# u'2.2.0'
rdd = sc.parallelize((['id1', 'aaa',12,87],
['id2', 'acx',1,90],
['id3', 'bbb',77,10],
['id2', 'bbb',77,10]))
rdd.map(lambda x: (x[0], x[1:])).groupByKey().mapValues(list).collect()
# result:
[('id2', [['acx', 1, 90], ['bbb', 77, 10]]),
('id3', [['bbb', 77, 10]]),
('id1', [['aaa', 12, 87]])]
or, if you prefer lists strictly, you can add one more map operation after mapValues:
rdd.map(lambda x: (x[0], x[1:])).groupByKey().mapValues(list).map(lambda x: list(x)).collect()
# result:
[['id2', [['acx', 1, 90], ['bbb', 77, 10]]],
['id3', [['bbb', 77, 10]]],
['id1', [['aaa', 12, 87]]]]

How to convert Dataframes from avro to GenericRecord in scala

I am stuck in converting the avro dataframes to GenericRecord/ByteArray where I surfed in google which they provide me solution the other way round.
Have anyone tried converting AVRO RDD/Dataframes to GenericRecord or ByteArray in scala?
I used this command to read my avro files.
spark.read.avro("/app/q.avro")
It returns me dataframes like this.
res0: org.apache.spark.sql.DataFrame = [recordType: string, recordVersion: string ... 6 more fields]
So How to convert sql.DataFrame to GenericRecord/ByteArray?
After creating a dataframe:
val df=spark.read.avro("/app/q.avro")
You can convert it into either an rdd or a list of strings.
val listOfStrings=df.rdd.collect.toList
Now, you can convert the list of strings into byteArray, like this:
scala> var lst=List("scala","Java","Python","JavaScript")
lst: List[String] = List(scala, Java, Python, JavaScript)
scala> lst.map(_.getBytes).toArray
res5: Array[Array[Byte]] = Array(Array(115, 99, 97, 108, 97), Array(74, 97, 118, 97), Array(80, 121, 116, 104, 111, 110), Array(74, 97, 118, 97, 83, 99, 114, 105, 112, 116))

Spark SQL - Generate array of arrays from the sql function

I want to create an array of arrays. This is my data table:
// A case class for our sample table
case class Testing(name: String, age: Int, salary: Int)
// Create an RDD with some data
val x = sc.parallelize(Array(
Testing(null, 21, 905),
Testing("Noelia", 26, 1130),
Testing("Pilar", 52, 1890),
Testing("Roberto", 31, 1450)
))
// Convert RDD to a DataFrame
val df = sqlContext.createDataFrame(x)
// For SQL usage we need to register the table
df.registerTempTable("df")
I want to create an array of integer column "age". For that I use "collect_list":
sqlContext.sql("SELECT collect_list(age) as age from df").show
But now I want to generate an array containing multiple arrays as created above:
sqlContext.sql("SELECT collect_list(collect_list(age), collect_list(salary)) as arrayInt from df").show
But this does not work , or use the function org.apache.spark.sql.functions.array. Any ideas?
Ok, things can't get more simple. Let's consider the same data you are working on and go step by step from there
// A case class for our sample table
case class Testing(name: String, age: Int, salary: Int)
// Create an RDD with some data
val x = sc.parallelize(Array(
Testing(null, 21, 905),
Testing("Noelia", 26, 1130),
Testing("Pilar", 52, 1890),
Testing("Roberto", 31, 1450)
))
// Convert RDD to a DataFrame
val df = sqlContext.createDataFrame(x)
// For SQL usage we need to register the table
df.registerTempTable("df")
sqlContext.sql("select collect_list(age) as age from df").show
// +----------------+
// | age|
// +----------------+
// |[21, 26, 52, 31]|
// +----------------+
sqlContext.sql("select collect_list(collect_list(age), collect_list(salary)) as arrayInt from df").show
As the error message says :
org.apache.spark.sql.AnalysisException: No handler for Hive udf class
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCollectList because: Exactly one argument is expected..; line 1 pos 52 [...]
collest_list takes just one argument. Let's check the documentation here.
It actually takes one argument ! But let's go further in the documentation of the functions object. You seem to have noticed that the array function allows you to create a new array column out of a Column or a repeated Column parameter. So let's use that :
sqlContext.sql("select array(collect_list(age), collect_list(salary)) as arrayInt from df").show(false)
The array function create indeed a column from the column list create before-hand by collect_list on both age and salary :
// +-------------------------------------------------------------------+
// |arrayInt |
// +-------------------------------------------------------------------+
// |[WrappedArray(21, 26, 52, 31), WrappedArray(905, 1130, 1890, 1450)]|
// +-------------------------------------------------------------------+
Where do we go from here ?
You have to remember that a Row from a DataFrame is just another collection wrapped by a Row.
The first thing I'll do is work on that collection. So How do we flatten a WrappedArray[WrappedArray[Int]] ?
Scala is kind of magical you just need to use .flatten
import scala.collection.mutable.WrappedArray
val firstRow: mutable.WrappedArray[mutable.WrappedArray[Int]] =
sqlContext.sql("select array(collect_list(age), collect_list(salary)) as arrayInt from df")
.first.get(0).asInstanceOf[WrappedArray[WrappedArray[Int]]]
// res26: scala.collection.mutable.WrappedArray[scala.collection.mutable.WrappedArray[Int]] =
// WrappedArray(WrappedArray(21, 26, 52, 31), WrappedArray(905, 1130, 1890, 1450))
firstRow.flatten
// res27: scala.collection.mutable.IndexedSeq[Int] = ArrayBuffer(21, 26, 52, 31, 905, 1130, 1890, 1450)
Now let's wrap it in a UDF so we can use it on the DataFrame :
def flatten(array: WrappedArray[WrappedArray[Int]]) = array.flatten
sqlContext.udf.register("flatten", flatten(_: WrappedArray[WrappedArray[Int]]))
Since we registered the UDF, we can now use it inside the sqlContext :
sqlContext.sql("select flatten(array(collect_list(age), collect_list(salary))) as arrayInt from df").show(false)
// +---------------------------------------+
// |arrayInt |
// +---------------------------------------+
// |[21, 26, 52, 31, 905, 1130, 1890, 1450]|
// +---------------------------------------+
I hope this helps !
Let's create the DataFrame the way have created above.
// A case class for our sample table
import org.apache.spark.sql.functions._
case class Testing(name: String, age: Int, salary: Int)
// Create an RDD with some data
val x = sc.parallelize(Array(
Testing(null, 21, 905),
Testing("Noelia", 26, 1130),
Testing("Pilar", 52, 1890),
Testing("Roberto", 31, 1450)
))
// Convert RDD to a DataFrame
val df = spark.createDataFrame(x)
Here we can use array_union function to achieve the desired result. array_unionfunction will return the union of all elements from the input arrays. This function is available since spark 2.4.0
// Scala Ref : https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$
// Pyspark Ref : https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.array_union
df.select(collect_list("age").as("age"), collect_list("salary").as("salary"))
.withColumn("new_col", array_union($"age", $"salary")).show(truncate=false)
// Output
+----------------+-----------------------+---------------------------------------+
|age |salary |new_col |
+----------------+-----------------------+---------------------------------------+
|[21, 26, 52, 31]|[905, 1130, 1890, 1450]|[21, 26, 52, 31, 905, 1130, 1890, 1450]|
+----------------+-----------------------+---------------------------------------+
I hope this helps.