Get first element in array Pyspark - pyspark

I want to add new 2 columns value services arr first and second value
but I'm getting the error:
Field name should be String Literal, but it's 0;
production_target_datasource_df.withColumn("newcol",production_target_datasource_df["Services"].getItem(0))
+------------------+--------------------+
| cid | Services|
+------------------+--------------------+
|845124826013182686| [112931, serv1]|
|845124826013182686| [146936, serv1]|
|845124826013182686| [32718, serv2]|
|845124826013182686| [28839, serv2]|
|845124826013182686| [8710, serv2]|
|845124826013182686| [2093140, serv3]|

You don't have to use .getItem(0)
production_target_datasource_df["Services"][0] would be enough.
# Constructing your table:
from pyspark.sql import Row
df = sc.parallelize([Row(cid=1,Services=["2", "serv1"]),
Row(cid=1, Services=["3", "serv1"]),
Row(cid=1, Services=["4", "serv2"])]).toDF()
df.show()
+---+----------+
|cid| Services|
+---+----------+
| 1|[2, serv1]|
| 1|[3, serv1]|
| 1|[4, serv2]|
+---+----------+
# Adding the two columns:
new_df = df.withColumn("first_element", df.Services[0])
new_df = new_df.withColumn("second_element", df.Services[1])
new_df.show()
+---+----------+-------------+--------------+
|cid| Services|first_element|second_element|
+---+----------+-------------+--------------+
| 1|[2, serv1]| 2| serv1|
| 1|[3, serv1]| 3| serv1|
| 1|[4, serv2]| 4| serv2|
+---+----------+-------------+--------------+

As the error is saying, you need to pass a string not a 0.
Then, you wonder : what string should I pass ?
If you follow #pault advice, and printSchema, you will actually know what are the corresponding keys to your values in the list.
Here is the documentation of getItem, helping you figure this out.
Another way to know what to pass, is to simply pass any string, you could type:
production_target_datasource_df.withColumn("newcol",production_target_datasource_df["Services"].getItem('0'))
and the logs will tell you what keys were expected.
Hope this helps ;)

Related

Split single String column to multiple columns in Spark-Scala

I have a dataframe as:
+----+--------------------------+
|city|Types |
+----+--------------------------+
|BNG |school |
|HYD |school,restaurant |
|MUM |school,restaurant,hospital|
+----+--------------------------+
I wanna split Types column in multiple cols with ','.
The problem is column size is not fixed so I not getting how to do it.
I saw another related question in pyspark but I wanna do it in spark-scala and not pyspark
Any help is appreciated.
Thanks in advance
one way to address the irregular size in the column is to tweak the representation.
for example:
val data = Seq(("BNG", "school"),("HYD", "school,res"),("MUM", "school,res,hos")).toDF("city","types")
+----+--------------+
|city| types|
+----+--------------+
| BNG| school|
| HYD| school,res|
| MUM|school,res,hos|
+----+--------------+
data.withColumn("isSchool", array_contains(split(col("types"),","), "school")).withColumn("isRes", array_contains(split(col("types"),","), "res")).withColumn("isHos", array_contains(split(col("types"),","), "hos"))
+----+--------------+--------+-----+-----+
|city| types|isSchool|isRes|isHos|
+----+--------------+--------+-----+-----+
| BNG| school| true|false|false|
| HYD| school,res| true| true|false|
| MUM|school,res,hos| true| true| true|
+----+--------------+--------+-----+-----+

how to access values from a array column in scala dataframe

I have a dataframe coming with scala array of tuples (index, value) like the following, index has values from 1 to 4
id | units_flag_tuples
id1 | [(3,2.0), (4,6.0)]
id2 | [(1,10.0), (2,2.0), (3,5.0)]
I would like to access the value from the array and put it into columns based on index (unit1, unit2, unit3, unit4):
ID | unit1| unit2 | unit3 | unit 4
id1 | null | null | 2.0 | 6.0
id2 | 10.0 | 2.0 | 5.0 | null
here is the code:
df
.withColumn("unit1", col("units_flag_tuples").find(_._1 == '1').get._2 )
.withColumn("unit2", col("units_flag_tuples").find(_._1 == '2').get._2 )
.withColumn("unit3", col("units_flag_tuples").find(_._1 == '3').get._2 )
.withColumn("unit4", col("units_flag_tuples").find(_._1 == '4').get._2 )
Here is the error message I am getting:
error: value find is not a member of org.apache.spark.sql.Column
How to resolve this error or any better ways to do it?
Here is my different approach, that I have used the map_from_entries function to make a map for array and get each column by choosing the key from the map.
val df = Seq(("id1", Seq((3,2.0), (4,6.0))), ("id2", Seq((1,10.0), (2,2.0), (3,5.0)))).toDF("id", "units_flag_tuples")
df.show(false)
df.withColumn("map", map_from_entries(col("units_flag_tuples")))
.withColumn("unit1", col("map.1"))
.withColumn("unit2", col("map.2"))
.withColumn("unit3", col("map.3"))
.withColumn("unit4", col("map.4"))
.drop("map", "units_flag_tuples").show
The result is:
+---+-----+-----+-----+-----+
| id|unit1|unit2|unit3|unit4|
+---+-----+-----+-----+-----+
|id1| null| null| 2.0| 6.0|
|id2| 10.0| 2.0| 5.0| null|
+---+-----+-----+-----+-----+

How to check whether a the whole column in a pyspark contains a value using Expr

In pyspark how can i use expr to check whether a whole column contains the value in columnA of that row.
pseudo code below
df=df.withColumn("Result", expr(if any the rows in column1 contains the value colA (for this row) then 1 else 0))
Take an arbitrary example:
valuesCol = [('rose','rose is red'),('jasmine','I never saw Jasmine'),('lily','Lili dont be silly'),('daffodil','what a flower')]
df = sqlContext.createDataFrame(valuesCol,['columnA','columnB'])
df.show()
+--------+-------------------+
| columnA| columnB|
+--------+-------------------+
| rose| rose is red|
| jasmine|I never saw Jasmine|
| lily| Lili dont be silly|
|daffodil| what a flower|
+--------+-------------------+
Application of expr() here. How one can use expr(), just look for the corresponding SQL syntax and it should work with expr() mostly.
df = df.withColumn('columnA_exists',expr("(case when instr(lower(columnB), lower(columnA))>=1 then 1 else 0 end)"))
df.show()
+--------+-------------------+--------------+
| columnA| columnB|columnA_exists|
+--------+-------------------+--------------+
| rose| rose is red| 1|
| jasmine|I never saw Jasmine| 1|
| lily| Lili dont be silly| 0|
|daffodil| what a flower| 0|
+--------+-------------------+--------------+

How to convert numerical values to a categorical variable using pyspark

pyspark dataframe which have a range of numerical variables.
for eg
my dataframe have a column value from 1 to 100.
1-10 - group1<== the column value for 1 to 10 should contain group1 as value
11-20 - group2
.
.
.
91-100 group10
how can i achieve this using pyspark dataframe
# Creating an arbitrary DataFrame
df = spark.createDataFrame([(1,54),(2,7),(3,72),(4,99)], ['ID','Var'])
df.show()
+---+---+
| ID|Var|
+---+---+
| 1| 54|
| 2| 7|
| 3| 72|
| 4| 99|
+---+---+
Once the DataFrame has been created, we use floor() function to find the integral part of a number. For eg; floor(15.5) will be 15. We need to find the integral part of the Var/10 and add 1 to it, because the indexing starts from 1, as opposed to 0. Finally, we have need to prepend group to the value. Concatenation can be achieved with concat() function, but keep in mind that since the prepended word group is not a column, so we need to put it inside lit() which creates a column of a literal value.
# Requisite packages needed
from pyspark.sql.functions import col, floor, lit, concat
df = df.withColumn('Var',concat(lit('group'),(1+floor(col('Var')/10))))
df.show()
+---+-------+
| ID| Var|
+---+-------+
| 1| group6|
| 2| group1|
| 3| group8|
| 4|group10|
+---+-------+

how to convert rows into columns in spark dataframe, scala

Is there any way to transpose dataframe rows into columns.
I have following structure as a input:
val inputDF = Seq(("pid1","enc1", "bat"),
("pid1","enc2", ""),
("pid1","enc3", ""),
("pid3","enc1", "cat"),
("pid3","enc2", "")
).toDF("MemberID", "EncounterID", "entry" )
inputDF.show:
+--------+-----------+-----+
|MemberID|EncounterID|entry|
+--------+-----------+-----+
| pid1| enc1| bat|
| pid1| enc2| |
| pid1| enc3| |
| pid3| enc1| cat|
| pid3| enc2| |
+--------+-----------+-----+
expected result:
+--------+----------+----------+----------+-----+
|MemberID|Encounter1|Encounter2|Encounter3|entry|
+--------+----------+----------+----------+-----+
| pid1| enc1| enc2| enc3| bat|
| pid3| enc1| enc2| null| cat|
+--------+----------+----------+----------+-----+
Please suggest if there is any optimized direct API available for transposing rows into columns.
my input data size is quite huge, so actions like collect, I wont be able to perform as it would take all the data on driver.
I am using Spark 2.x
I am not sure that what you need is what you actually asked. Yet, just in case here is an idea:
val entries = inputDF.where('entry isNotNull)
.where('entry !== "")
.select("MemberID", "entry").distinct
val df = inputDF.groupBy("MemberID")
.agg(collect_list("EncounterID") as "encounterList")
.join(entries, Seq("MemberID"))
df.show
+--------+-------------------------+-----+
|MemberID| encounterList |entry|
+--------+-------------------------+-----+
| pid1| [enc2, enc1, enc3]| bat|
| pid3| [enc2, enc1]| cat|
+--------+-------------------------+-----+
The order of the list is not deterministic but you may sort it and then extract new columns from it with .withColumn("Encounter1", sort_array($"encounterList")(0))...
Other idea
In case what you want is to put the value of entry in the corresponding "Encounter" column, you can use a pivot:
inputDF
.groupBy("MemberID")
.pivot("EncounterID", Seq("enc1", "enc2", "enc3"))
.agg(first("entry")).show
+--------+----+----+----+
|MemberID|enc1|enc2|enc3|
+--------+----+----+----+
| pid1| bat| | |
| pid3| cat| | |
+--------+----+----+----+
Adding Seq("enc1", "enc2", "enc3") is optionnal but since you know the content of the column, it will speed up the computation.