Spark-Scala: Get Dataframe Variable by concatenating two String Variables - scala

I have a scenario where I need to form a dataframe name from two string variable. Which is pretty easy and can be done by concatenating.
Example: "df_" + "part1324"
The above code will return a String variable. I want this to be a Dataframe variable through which I can perform further operation on the data frame.

Map can be used for assign names to DataFrames:
val df = List(("df_value")).toDF()
val stringVariable = "part1324"
// assign name to dataframe
val namedDataFrames = Map("df_" + stringVariable -> df)
// get dataframe by name
namedDataFrames("df_part1324").show(false)

Your question is confusing. What do you mean by dataframe variable? Concatenating two strings will always return String. In order to create a dataframe, you need to apply the different methods available to create a dataframe.
val df:Dataframe cannot be equal to df_part1234 (String)as per your example but to use it as dataframe, you need to do something like below
val df_part1234 = sc.range(1000).toDF("number") where sc is your Sparksession variable.
In case you need to generate this variable dynamically, place it under the logic of variable generation like Loop and add the statement to create the dataframe.
Please rewrite your question if you are trying to achieve something else (along with code snippet to reproduce the issue) or accept the answer if you are clear on the issue

Related

How to drop first row from parquet file?

I have parquet file which contain two columns(id,feature).file consists of 14348 row.file
How i drop first row id,feature from file
code
val df = spark.read.format("parquet").load("file:///usr/local/spark/dataset/model/data/user/part-r-00000-7d55ba81-5761-4e36-b488-7e6214df2a68.snappy.parquet")
val header = df.first()
val data = df.filter(row => row != header)
data .show()
result seems as output
If you are trying to "ignore" the schema defined in the file, it is implicitly done once you read your file, using spark like:
spark.read.format("parquet").load(your_file)
If you are trying to only skip the first row on your DF and if you already know the id you can do: val filteredDF = originalDF.filter(s"id != '${excludeID}' "). If you don't know the id, you can use monotonically_increasing_id to tag it and then filter, similar like: filter spark dataframe based on maximum value of a column
You need to drop the first row based on id if you know that, else go for indexing approach i.e., assigning the row number and delete the first row.
I'm using Spark 2.4.0, and you could use the header option to the DataFrameReader call like so -
spark.read.format("csv").option("header", true).load(<path_to_file>)
Reference for the other options for DataFrameReader are here

Capture and write string inside of dataframe using foreach row

Trying to capture and write a string value after substituting contents obtained from specific fields from each row of a dataframe using scala. But since it is deployed on cluster not able to capture any records. Can anyone provide a solution?
Assuming TEST_DB.finalresult has 2 fields input1 and input2:
val finalresult=spark.sql("select * from TEST_DB.finalresult")
finalResult.foreach { row =>
val param1=row.getAs("input1").asInstanceOf[String]
val param2=row.getAs("input2").asInstanceOf[String]
val string = """new values of param1 and param2 are -> """ + param1 + """,""" + param2
// how to append modified string to csv file continously for each microbatch in hdfs ??
}
In your code you create the wanted string variable but it is not being saved anywhere, hence you can't see the result.
You can potentially in each foreach execution open up the wanted csv file and append the new string, but I'd like to propose a different solution.
If you can, try to always use built-in functionality of Spark, since it is (usually) more optimised and better in handling null inputs. You can achieve the same by:
import org.apache.spark.sql.functions.{lit, concat, col}
val modifiedFinalResult = finalResult.select(
concat(
lit("new values of param1 and param2 are -> "),
col("input1"),
lit(","),
col("input2")
).alias("string")
)
In variable modifiedFinalResult you will have a spark dataframe with single column named string, which represents the exact same output as your variable string in your code. Afterwards you can save the dataframe directly as a single csv file (using the repartition functionality):
modifiedFinalResult.repartition(1).write.format("csv").save("path/to/your/csv/output")
PS: Also a suggestion for the future, try to avoid naming variables after data types.
UPDATE: Fixed the empty rows issue by using "concat_ws" instead of concat and coalesce to each fields. It seems some of the values which were null were transforming the entire concatenated string to null after the transformation. Nevertheless this solution works for now!

change a dataframe row value with dynamic number of columns spark scala

I have a dataframe (contains 10 columns) for which I want to change the value of a row (for the last column only). I have written following code for this:
val newDF = spark.sqlContext.createDataFrame(WRADF.rdd.map(r=> {
Row(r.get(0), r.get(1),
r.get(2), r.get(3),
r.get(4), r.get(5),
r.get(6), r.get(7),
r.get(8), decrementCounter(r))
}), WRADF.schema)
I want to change the value of a row for 10th column only (for which I wrote decrementCounter() function). But the above code only runs for dataframes with 10 columns. I don't know how to convert this code so that it can run for different dataframe (with different number of columns). Any help will be appreciated.
Don't do something like this. Define udf
import org.apache.spark.sql.functions.udf._
val decrementCounter = udf((x: T) => ...) // adjust types and content to your requirements
df.withColumn("someName", decrementCounter($"someColumn"))
I think UDF will be a better choice because it can be applied using the Column name itself.
For more on udf you can take a look here : https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html
For your code just use this :
import org.apache.spark.sql.functions.udf._
val decrementCounterUDF = udf(decrementCounter _)
df.withColumn("columnName", decrementCounterUDF($"columnName"))
What it will does is apply this decrementCounter function on each and every value of column columnName.
I hope this helps, cheers !

Replace Empty values with nulls in Spark Dataframe

I have a data frame with n number of columns and I want to replace empty strings in all these columns with nulls.
I tried using
val ReadDf = rawDF.na.replace("columnA", Map( "" -> null));
and
val ReadDf = rawDF.withColumn("columnA", if($"columnA"=="") lit(null) else $"columnA" );
Both of them didn't work.
Any leads would be highly appreciated. Thanks.
Your first approach seams to fail due to a bug that prevents replace from being able to replace values with nulls, see here.
Your second approach fails because you're confusing driver-side Scala code for executor-side Dataframe instructions: your if-else expression would be evaluated once on the driver (and not per record); You'd want to replace it with a call to when function; Moreover, to compare a column's value you need to use the === operator, and not Scala's == which just compares the driver-side Column object:
import org.apache.spark.sql.functions._
rawDF.withColumn("columnA", when($"columnA" === "", lit(null)).otherwise($"columnA"))

Aggregate a Spark data frame using an array of column names, retaining the names

I would like to aggregate a Spark data frame using an array of column names as input, and at the same time retain the original names of the columns.
df.groupBy($"id").sum(colNames:_*)
This works but fails to preserve the names. Inspired by the answer found here I unsucessfully tried this:
df.groupBy($"id").agg(sum(colNames:_*).alias(colNames:_*))
error: no `: _*' annotation allowed here
It works to take a single element like
df.groupBy($"id").agg(sum(colNames(2)).alias(colNames(2)))
How can make this happen for the entire array?
Just provide an sequence of columns with aliases:
val colNames: Seq[String] = ???
val exprs = colNames.map(c => sum(c).alias(c))
df.groupBy($"id").agg(exprs.head, exprs.tail: _*)