Capture and write string inside of dataframe using foreach row - scala

Trying to capture and write a string value after substituting contents obtained from specific fields from each row of a dataframe using scala. But since it is deployed on cluster not able to capture any records. Can anyone provide a solution?
Assuming TEST_DB.finalresult has 2 fields input1 and input2:
val finalresult=spark.sql("select * from TEST_DB.finalresult")
finalResult.foreach { row =>
val param1=row.getAs("input1").asInstanceOf[String]
val param2=row.getAs("input2").asInstanceOf[String]
val string = """new values of param1 and param2 are -> """ + param1 + """,""" + param2
// how to append modified string to csv file continously for each microbatch in hdfs ??
}

In your code you create the wanted string variable but it is not being saved anywhere, hence you can't see the result.
You can potentially in each foreach execution open up the wanted csv file and append the new string, but I'd like to propose a different solution.
If you can, try to always use built-in functionality of Spark, since it is (usually) more optimised and better in handling null inputs. You can achieve the same by:
import org.apache.spark.sql.functions.{lit, concat, col}
val modifiedFinalResult = finalResult.select(
concat(
lit("new values of param1 and param2 are -> "),
col("input1"),
lit(","),
col("input2")
).alias("string")
)
In variable modifiedFinalResult you will have a spark dataframe with single column named string, which represents the exact same output as your variable string in your code. Afterwards you can save the dataframe directly as a single csv file (using the repartition functionality):
modifiedFinalResult.repartition(1).write.format("csv").save("path/to/your/csv/output")
PS: Also a suggestion for the future, try to avoid naming variables after data types.
UPDATE: Fixed the empty rows issue by using "concat_ws" instead of concat and coalesce to each fields. It seems some of the values which were null were transforming the entire concatenated string to null after the transformation. Nevertheless this solution works for now!

Related

Spark-Scala: Get Dataframe Variable by concatenating two String Variables

I have a scenario where I need to form a dataframe name from two string variable. Which is pretty easy and can be done by concatenating.
Example: "df_" + "part1324"
The above code will return a String variable. I want this to be a Dataframe variable through which I can perform further operation on the data frame.
Map can be used for assign names to DataFrames:
val df = List(("df_value")).toDF()
val stringVariable = "part1324"
// assign name to dataframe
val namedDataFrames = Map("df_" + stringVariable -> df)
// get dataframe by name
namedDataFrames("df_part1324").show(false)
Your question is confusing. What do you mean by dataframe variable? Concatenating two strings will always return String. In order to create a dataframe, you need to apply the different methods available to create a dataframe.
val df:Dataframe cannot be equal to df_part1234 (String)as per your example but to use it as dataframe, you need to do something like below
val df_part1234 = sc.range(1000).toDF("number") where sc is your Sparksession variable.
In case you need to generate this variable dynamically, place it under the logic of variable generation like Loop and add the statement to create the dataframe.
Please rewrite your question if you are trying to achieve something else (along with code snippet to reproduce the issue) or accept the answer if you are clear on the issue

PySpark Parsing nested array of struct

I would like to parse and get the value of specific key from the PySpark SQL dataframe with the below format
I could able to achieve this with UDF but it takes almost 20 mins to process 40 columns with the JSON size of 100MB. Tried explode as well but it gives seperate rows for each array element. but i need only the specific value of the key in a given array of struct.
Format
array<struct<key:string,value:struct<int_value:string,string_value:string>>>
Function to get a specific key values
def getValueFunc(searcharray, searchkey):
for val in searcharray:
if val["key"] == searchkey:
if val["value"]["string_value"] is not None:
actual = val["value"]["string_value"]
return actual
elif val["value"]["int_value"] is not None:
actual = val["value"]["int_value"]
return str(actual)
else:
return "---"
.....
getValue = udf(getValueFunc, StringType())
....
# register the name rank udf template
spark.udf.register("getValue", getValue)
.....
df.select(getValue(col("event_params"), lit("category")).alias("event_category"))
For Spark 2.40+, you can use SparkSQL's filter() function to find the first array element which matches key == serarchkey and then retrieve its value. Below is a Spark SQL snippet template(searchkey as a variable) to do the first part mentioned above.
stmt = '''filter(event_params, x -> x.key == "{}")[0]'''.format(searchkey)
Run the above stmt with expr() function, and assign the value (StructType) to a temporary column f1, and then use coalesce() function to retrieve the non-null value.
from pyspark.sql.functions import expr
df.withColumn('f1', expr(stmt)) \
.selectExpr("coalesce(f1.value.string_value, string(f1.value.int_value),'---') AS event_category") \
.show()
Let me know if you have any problem running the above code.

IndexOutOfBoundsException when writing dataframe into CSV

So, I'm trying to read an existing file, save that into a DataFrame, once that's done I make a "union" between that existing DataFrame and a new one I have already created, both have the same columns and share the same schema.
ALSO I CANNOT GIVE SIGNIFICANT NAME TO VARS NOR GIVE ANYMORE DATA BECAUSE OF RESTRICTIONS
val dfExist = spark.read.format("csv").option("header", "true").option("delimiter", ",").schema(schema).load(filePathAggregated3)
val df5 = df4.union(dfExist)
Once that's done I get the "start_ts" (a timestamp on Epoch format) that's duplicate in the union between the above dataframes (df4 and dfExist) and also I get rid of some characters I don't want
val df6 = df5.select($"start_ts").collect()
val df7 = df6.diff(df6.distinct).distinct.mkString.replace("[", "").replace("]", "")
Now I use this "start_ts" duplicate to filter the DataFrame and create 2 new DataFrames selecting the items of this duplicate timestamp, and the items that are not like this duplicate timestamp
val itemsNotDup = df5.filter(!$"start_ts".like(df7)).select($"start_ts",$"avg_value",$"Number_of_val")
val items = df5.filter($"start_ts".like(df7)).select($"start_ts",$"avg_value",$"Number_of_val")
And then I save in 2 different lists the avg_value and the Number_of_values
items.map(t => t.getAs[Double]("avg_value")).collect().foreach(saveList => listDataDF += saveList.toString)
items.map(t => t.getAs[Long]("Number_of_val")).collect().foreach(saveList => listDataDF2 += saveList.toString)
Now I make some maths with the values on the lists (THIS IS WHERE I'M GETTING ISSUES)
val newAvg = ((listDataDF(0).toDouble*listDataDF2(0).toDouble) - (listDataDF(1).toDouble*listDataDF2(1).toDouble)) / (listDataDF2(0) + listDataDF2(1)).toInt
val newNumberOfValues = listDataDF2(0).toDouble + listDataDF2(1).toDouble
Then save the duplicate timestamp (df7), the avg and the number of values into a list as a single item, this list transforms into a DataFrame and then I transform I get a new DataFrame with the columns how are supposed to be.
listDataDF3 += df7 + ',' + newAvg.toString + ',' + newNumberOfValues.toString + ','
val listDF = listDataDF3.toDF("value")
val listDF2 = listDF.withColumn("_tmp", split($"value", "\\,")).select(
$"_tmp".getItem(0).as("start_ts"),
$"_tmp".getItem(1).as("avg_value"),
$"_tmp".getItem(2).as("Number_of_val")
).drop("_tmp")
Finally I join the DataFrame without duplicates with the new DataFrame which have the duplicate timestamp and the avg of the duplicate avg values and the sum of number of values.
val finalDF = itemsNotDup.union(listDF2)
finalDF.coalesce(1).write.mode(SaveMode.Overwrite).format("csv").option("header","true").save(filePathAggregated3)
When I run this code in SPARK it gives me the error, I supposed it was related to empty lists (since it's giving me the error when making some maths with the values of the lists) but If I delete the line where I write to CSV, the code runs perfectly, also I saved the lists and values of the math calcs into files and they are not empty.
My supposition, is that, is deleting the file before reading it (because of how spark distribute tasks between workers) and that's why the list is empty therefore I'm getting this error when trying to make maths with those values.
I'm trying to be as clear as possible but I cannot give much more details, nor show any of the output.
So, how can I avoid this error? also I've been only 1 month with scala/spark so any code recommendation will be nice as well.
Thanks beforehand.
This error comes because of the Data. Any of your list does not contains columns as expected. When you refer to that index, the List gives this error to you
It was a problem related to reading files, I made a check (df.rdd.isEmpty) and wether the DF was empty I was getting this error. Made this as an if/else statement to check if the DF is empty, and now it works fine.

Replace Empty values with nulls in Spark Dataframe

I have a data frame with n number of columns and I want to replace empty strings in all these columns with nulls.
I tried using
val ReadDf = rawDF.na.replace("columnA", Map( "" -> null));
and
val ReadDf = rawDF.withColumn("columnA", if($"columnA"=="") lit(null) else $"columnA" );
Both of them didn't work.
Any leads would be highly appreciated. Thanks.
Your first approach seams to fail due to a bug that prevents replace from being able to replace values with nulls, see here.
Your second approach fails because you're confusing driver-side Scala code for executor-side Dataframe instructions: your if-else expression would be evaluated once on the driver (and not per record); You'd want to replace it with a call to when function; Moreover, to compare a column's value you need to use the === operator, and not Scala's == which just compares the driver-side Column object:
import org.apache.spark.sql.functions._
rawDF.withColumn("columnA", when($"columnA" === "", lit(null)).otherwise($"columnA"))

Removing Blank Strings from a Spark Dataframe

Attempting to remove rows in which a Spark dataframe column contains blank strings. Originally did val df2 = df1.na.drop() but it turns out many of these values are being encoded as "".
I'm stuck using Spark 1.3.1 and also cannot rely on DSL. (Importing spark.implicit_ isn't working.)
Removing things from a dataframe requires filter().
newDF = oldDF.filter("colName != ''")
or am I misunderstanding your question?
In case someone dont want to drop the records with blank strings, but just convvert the blank strings to some constant value.
val newdf = df.na.replace(df.columns,Map("" -> "0")) // to convert blank strings to zero
newdf.show()
You can use this:
df.filter(!($"col_name"===""))
It filters out the columns where the value of "col_name" is "" i.e. nothing/blankstring. I'm using the match filter and then inverting it by "!"
I am also new to spark So I don't know if below mentioned code is more complex or not but it works.
Here we are creating udf which is converting blank values to null.
sqlContext.udf().register("convertToNull",(String abc) -> (abc.trim().length() > 0 ? abc : null),DataTypes.StringType);
After above code you can use "convertToNull" (works on string) in select clause and make all fields null which are blank and than use .na().drop().
crimeDataFrame.selectExpr("C0","convertToNull(C1)","C2","C3").na().drop()
Note : You can use same approach in scala.
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-udfs.html