Spark : Dynamic generation of the query based on the fields in s3 file - scala

Oversimplified Scenario:
A process which generates monthly data in a s3 file. The number of fields could be different in each monthly run. Based on this data in s3,we load the data to a table and we manually (as number of fields could change in each run with addition or deletion of few columns) run a SQL for few metrics.There are more calculations/transforms on this data,but to have starter Im presenting the simpler version of the usecase.
Approach:
Considering the schema-less nature, as the number of fields in the s3 file could differ in each run with addition/deletion of few fields,which requires manual changes every-time in the SQL, Im planning to explore Spark/Scala, so that we can directly read from s3 and dynamically generate SQL based on the fields.
Query:
How I can achieve this in scala/spark-SQL/dataframe? s3 file contains only the required fields from each run.Hence there is no issue reading the dynamic fields from s3 as it is taken care by dataframe.The issue is how can we generate SQL dataframe-API/spark-SQL code to handle.
I can read s3 file via dataframe and register the dataframe as createOrReplaceTempView to write SQL, but I dont think it helps manually changing the spark-SQL, during addition of a new field in s3 during next run. what is the best way to dynamically generate the sql/any better ways to handle the issue?
Usecase-1:
First-run
dataframe: customer,1st_month_count (here dataframe directly points to s3, which has only required attributes)
--sample code
SELECT customer,sum(month_1_count)
FROM dataframe
GROUP BY customer
--Dataframe API/SparkSQL
dataframe.groupBy("customer").sum("month_1_count").show()
Second-Run - One additional column was added
dataframe: customer,month_1_count,month_2_count) (here dataframe directly points to s3, which has only required attributes)
--Sample SQL
SELECT customer,sum(month_1_count),sum(month_2_count)
FROM dataframe
GROUP BY customer
--Dataframe API/SparkSQL
dataframe.groupBy("customer").sum("month_1_count","month_2_count").show()
Im new to Spark/Scala, would be helpful if you can provide the direction so that I can explore further.

It sounds like you want to perform the same operation over and over again on new columns as they appear in the dataframe schema? This works:
from pyspark.sql import functions
#search for column names you want to sum, I put in "month"
column_search = lambda col_names: 'month' in col_names
#get column names of temp dataframe w/ only the columns you want to sum
relevant_columns = original_df.select(*filter(column_search, original_df.columns)).columns
#create dictionary with relevant column names to be passed to the agg function
columns = {col_names: "sum" for col_names in relevant_columns}
#apply agg function with your groupBy, passing in columns dictionary
grouped_df = original_df.groupBy("customer").agg(columns)
#show result
grouped_df.show()
Some important concepts can help you to learn:
DataFrames have data attributes stored in a list: dataframe.columns
Functions can be applied to lists to create new lists as in "column_search"
Agg function accepts multiple expressions in a dictionary as explained here which is what I pass into "columns"
Spark is lazy so it doesn't change data state or perform operations until you perform an action like show(). This means writing out temporary dataframes to use one element of the dataframe like column as I do is not costly even though it may seem inefficient if you're used to SQL.

Related

how to insert the data from delta table to a variable in order to apply drools rule on them

I am using spark with scala in which I am getting streaming datas from eventhubs and then storing them in delta table. In order to apply drools rule on them ,i need to pass them through variables...i am stuck where i have to get the data from delta table to variable.
It really depends what data you need to pass to that drools rules, and what you need to return. You can either use:
User defined function - you define a function that will receive one or more parameters (column values of specific rows). (more examples)
Use map function of Dataset / Dataframe class to process the whole Row (doc, and examples)
Delta Tables can be read into DataFrames. A variable can be assigned to point to the DataFrame.
df = spark.read.format("delta").load("some/delta/path")
Once the Delta Table is read, you can apply your custom transformations:
transformed_df = df.transform(first_transform).transform(second_transform)
Hope this helps point you in the right direction.

Is there a way to export csv or other files in spark 3.0.1 using scala with name different than part*?

I have created a cube on two dimensions in spark using scala. The data is coming from two different dataframes. The names are "borrowersTable" and 'loansTable". They have been created with the "createOrReplaceTempView" option so that it is possible to run sql queries on them. The goal was to create the cube on two dimensions (gender and department) summing up the total number of loans for books for a library. With the command
val cube=spark.sql("""
select
borrowersTable.department,borrowersTable.gender,count(loansTable.bibno)
from borrowersTable,loansTable
where borrowersTable.bid=loansTable.bid
group by borrowersTable.gender,borrowersTable.department with cube;
""")
i create the cube which has this result:
Then using the command
cube.write.format("csv").save("file:///....../data/cube")
Spark creates a folder named cube which includes 34 files named part*.csv which include columns for department, gender, and sum of loans (every group by).
The goal here is to create files taking the names of the first two columns (attributes) in this way: for GroupBy (Attr1, Attr2) the file should be named Attr1_Attr2.
e.g. For (Economics, M) the file should be named Economics_M. For (Mathematics, null) it should be Mathematics_null and so on. Any help would be appreciated.
When you call df.write.format("...").save("...") each Spark executor saves partitions it holds into corresponding part* file. This is the mechanism for storing and loading big files and you can not change it. However you can try the following alternatives whatever works better in you case:
partitionBy:
cube
.write
.partitionBy("department", "gender")
.format("csv")
.save("file:///....../data/cube")
This will create subfolders with names like department=Physics/gender=M still containing part* files inside. This structure can be later loaded back to Spark and used for effective joins by partitioned columns.
collect
val csvRows = cube
.collect()
.foreach {
case Row(department: String, gender: String, _) =>
// just the simple way to write CSV, you can use any CSV lib here as well
Files.write(Paths.get(s"$department_$gender.csv"), s"$department,$gender".getBytes(StandardCharsets.UTF_8))
}
If you call collect() you receive you data frame on driver side as Array[Row] and then you can do with it whatever you want. The important limitation of this approach is that you data frame should fit into driver's memory.

How can I resolve table names to Parquet on the fly?

I need to run Spark SQL queries with my own custom correspondence from table names to Parquet data. Reading Parquet data to DataFrames with sqlContext.read.parquet and registering the DataFrames with df.registerTempTable isn't cutting it for my use case, because those calls have to be run before the SQL query, when I might not even know what tables are needed.
Rather than using registerTempTable, I'm trying to write an Analyzer that resolves table names using my own logic. However, I need to be able to resolve an UnresolvedRelation to a LogicalPlan representing Parquet data, but sqlContext.read.parquet gives a DataFrame, not a LogicalPlan.
A DataFrame seems to have a logicalPlan attribute, but that's marked protected[sql]. There's also a ParquetRelation class, but that's private[sql]. That's all I found for ways to get a LogicalPlan.
How can I resolve table names to Parquet with my own logic? Am I even on the right track with Analyzer?
You can actually retrieve the logicalPlan of your DataFrame with
val myLogicalPlan: LogicalPlan = myDF.queryExecution.logical

Is it inefficient to manually iterate Spark SQL data frames and create column values?

In order to run a few ML algorithms, I need to create extra columns of data. Each of these columns involves some fairly intense calculations that involves keeping moving averages and recording information as you go through each row (and updating it meanwhile). I've done a mock through with a simple Python script and it works, and I am currently looking to translate it to a Scala Spark script that could be run on a larger data set.
The issue is it seems that for these to be highly efficient, using Spark SQL, it is preferred to use the built in syntax and operations (which are SQL-like). Encoding the logic in a SQL expression seems to be a very thought-intensive process, so I'm wondering what the downsides will be if I just manually create the new column values by iterating through each row, keeping track of variables and inserting the column value at the end.
You can convert an rdd into dataframe. Then use map on the data frame and process each row as you wish. If you need to add new column, then you can use, withColumn. However this will only allow one column to be added and it happens for the entire dataframe. If you want more columns to be added, then inside map method,
a. you can gather new values based on the calculations
b. Add these new column values to main rdd as below
val newColumns: Seq[Any] = Seq(newcol1,newcol2)
Row.fromSeq(row.toSeq.init ++ newColumns)
Here row, is the reference of row in map method
c. Create new schema as below
val newColumnsStructType = StructType{Seq(new StructField("newcolName1",IntegerType),new StructField("newColName2", IntegerType))
d. Add to the old schema
val newSchema = StructType(mainDataFrame.schema.init ++ newColumnsStructType)
e. Create new dataframe with new columns
val newDataFrame = sqlContext.createDataFrame(newRDD, newSchema)

Incrementally adding to a Hive table w/Scala + Spark 1.3

Our cluster has Spark 1.3 and Hive
There is a large Hive table that I need to add randomly selected rows to.
There is a smaller table that I read and check a condition, if that condition is true, then I grab the variables I need to then query for the random rows to fill. What I did was do a query on that condition, table.where(value<number), then make it an array by using take(num rows). Then since all of these rows contain the information I need on which random rows are needed from the large hive table, I iterate through the array.
When I do the query I use ORDER BY RAND() in the query (using sqlContext). I created a var Hive table ( to be mutable) adding a column from the larger table. In the loop, I do a unionAll newHiveTable = newHiveTable.unionAll(random_rows)
I have tried many different ways to do this, but am not sure what is the best way to avoid CPU and temp disk use. I know that Dataframes aren't intended for incremental adds.
One thing I have though now to try is to create a cvs file, write the random rows to that file incrementally in the loop, then when the loop is finished, load the cvs file as a table, and do one unionAll to get my final table.
Any feedback would be great. Thanks
I would recommend that you create an external table with hive, defining the location, and then let spark write the output as csv to that directory:
in Hive:
create external table test(key string, value string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ';'
LOCATION '/SOME/HDFS/LOCATION'
And then from spark with the aide of https://github.com/databricks/spark-csv , write the dataframe to csv files and appending to the existing ones:
df.write.format("com.databricks.spark.csv").save("/SOME/HDFS/LOCATION/", SaveMode.Append)