I'm applying the same methods to multiple dataframes in spark scala, how can I parallelize this? - scala

Hi how's it going? I'm currently looping through all of my dataframes and running essentially the same queries/filters on them. Would there be a way to run this more effectively, in parallel? Here's sample code...
for (db <- list_of_dbs)
{
var df1 = spark.read
.format("csv")
.option("sep",",")
.option("inferSchema","true")
.option("header","true")
.load(path+db+".csv")
.withColumn("name_of_data",lit(db))
if (db!="rules") {
val conversion = mappingDF
.filter(col("col1").isNotNull and col("name") === db)
}
etc etc...
Is there a way so that this can be run on all dataframes at once, essentially getting rid of the for loop?

you can union all your dataframe than apply the filter/querys to it:
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html#unionByName(other:org.apache.spark.sql.Dataset[T]):org.apache.spark.sql.Dataset[T]
you can also use csv method of DatasetReader to load multiple csv in same time (recommended) :
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrameReader.html#csv(paths:String*):org.apache.spark.sql.DataFrame

Related

How to access the last two character of each cell of Spark DataFrame to do some calculations on its value using Scala

I am using Spark with Scala. After loading the data to Spark Dataframe, I want to access each cell of the Dataframe to do some calculations. The code is in the following:
val spark = SparkSession
.builder
.master("local[4]")
.config("spark.executor.memory", "8g")
.config("spark.executor.cores", 4)
.config("spark.task.cpus",1)
.appName("Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
val df = spark.read
.format("jdbc")
.option("url", "jdbc:oracle:thin:#x.x.x.x:1521:orcldb")
.option("dbtable", "table")
.option("user", "orcl")
.option("password", "*******")
.load()
val row_cnt = df.count()
for(i<-0 to row_cnt.toInt){
val t = df.select("CST_NUM").take(i)
println("start_date:",t)
}
The Output is like this:
(start_date:,[Lorg.apache.spark.sql.Row;#38ae69f7)
I can access value of each cell using forEach; but,I cannot return it to another function.
Would you please guide me how to access value of each cell of Spark Dataframe?
any help is really appreciated.
You need to learn how to work with Spark efficiently - right now your code isn't very optimal. I recommend to read first chapters of the Learning Spark, 2ed book to understand how to work with Spark - it's freely downloadable from Databricks' site.
Regarding your case, you need to change your code just to do the single .select instead of doing it in the loop. And then you can return your data to caller. But it will depend on the amount of data that you need to return - usually you should return Dataframe itself (maybe only subset of columns), and people will transform data as they need. In this case you can take advantage of distributed computation.
If you have a small dataset (hundreds/thousands of rows), then you can materialize dataset as a Spark/Java collection & return it to caller. For example, this could be done as following:
val t = df.select("CST_NUM")
val coll = t.collect().map(_.getInt(0))
in this case, we're selecting only one column from our dataframe (CST_NUM), then use .collect to bring all rows to the driver node, and then extracting the column value from the row object (I've used .getInt for that, but you can use something else from the Row API, find a function that matches the type of your column - .getLong, .getString, etc.)

spark read doesn't work inside Scala UDF function

I am trying to use spark.read to get file count inside my UDF, but when i execute the program hangs at that point.
i am calling an UDF in withcolumn of dataframe. the udf has to read a file and return a count of it. But it is not working. i am passing a variable value to UDF function. when i remove the spark.read code and simply return a number it works. but spark.read is not working through UDF
def prepareRowCountfromParquet(jobmaster_pa: String)(implicit spark: SparkSession): Int = {
print("The variable value is " + jobmaster_pa)
print("the count is " + spark.read.format("csv").option("header", "true").load(jobmaster_pa).count().toInt)
spark.read.format("csv").option("header", "true").load(jobmaster_pa).count().toInt
}
val SRCROWCNT = udf(prepareRowCountfromParquet _)
df
.withColumn("SRC_COUNT", SRCROWCNT(lit(keyPrefix)))
SRC_COUNT column should get lines of the file
UDFs cannot use spark context as it exists only in the driver and it isn't serializable.
generally speaking, you need to read all the csvs, calc the counts using a groupBy and then you can do a left join to the df

spark scala reducekey dataframe operation

I'm trying to do a count in scala with dataframe. My data has 3 columns and I've already loaded the data and split by tab. So I want to do something like this:
val file = file.map(line=>line.split("\t"))
val x = file1.map(line=>(line(0), line(2).toInt)).reduceByKey(_+_,1)
I want to put the data in dataframe, and having some trouble on the syntax
val file = file.map(line=>line.split("\t")).toDF
val file.groupby(line(0))
.count()
Can someone help check if this is correct?
spark needs to know the schema of the df
there are many ways to specify the schema, here is one option:
val df = file
.map(line=>line.split("\t"))
.map(l => (l(0), l(1).toInt)) //at this point spark knows the number of columns and their types
.toDF("a", "b") //give the columns names for ease of use
df
.groupby('a)
.count()

null pointer exception while converting dataframe to list inside udf

I am reading 2 different .csv files which has only column as below:
val dF1 = sqlContext.read.csv("some.csv").select($"ID")
val dF2 = sqlContext.read.csv("other.csv").select($"PID")
trying to search if dF2("PID") exists in dF1("ID"):
val getIdUdf = udf((x:String)=>{dF1.collect().map(_(0)).toList.contains(x)})
val dfFinal = dF2.withColumn("hasId", getIdUdf($"PID"))
This gives me null pointer exception.
but if I convert dF1 outside and use list in udf it works:
val dF1 = sqlContext.read.csv("some.csv").select($"ID").collect().map(_(0)).toList
val getIdUdf = udf((x:String)=>{dF1.contains(x)})
val dfFinal = dF2.withColumn("hasId", getIdUdf($"PID"))
I know I can use join to get this done but want to know what is the reason of null pointer exception here.
Thanks.
Please check this question about accessing dataframe inside the transformation of another dataframe. This is exactly what you are doing with your UDF, and this is not possible in spark. Solution is either to use join, or collect outside of transformation and broadcast.

How do I convert Array[Row] to RDD[Row]

I have a scenario where I want to convert the result of a dataframe which is in the format Array[Row] to RDD[Row]. I have tried using parallelize, but I don't want to use it as it needs to contain entire data in a single system which is not feasible in production box.
val Bid = spark.sql("select Distinct DeviceId, ButtonName from stb").collect()
val bidrdd = sparkContext.parallelize(Bid)
How do I achieve this? I tried the approach given in this link (How to convert DataFrame to RDD in Scala?), but it didn't work for me.
val bidrdd1 = Bid.map(x => (x(0).toString, x(1).toString)).rdd
It gives an error value rdd is not a member of Array[(String, String)]
The variable Bid which you've created here is not a DataFrame, it is an Array[Row], that's why you can't use .rdd on it. If you want to get an RDD[Row], simply call .rdd on the DataFrame (without calling collect):
val rdd = spark.sql("select Distinct DeviceId, ButtonName from stb").rdd
Your post contains some misconceptions worth noting:
... a dataframe which is in the format Array[Row] ...
Not quite - the Array[Row] is the result of collecting the data from the DataFrame into Driver memory - it's not a DataFrame.
... I don't want to use it as it needs to contain entire data in a single system ...
Note that as soon as you use collect on the DataFrame, you've already collected entire data into a single JVM's memory. So using parallelize is not the issue.