How to access the last two character of each cell of Spark DataFrame to do some calculations on its value using Scala - scala

I am using Spark with Scala. After loading the data to Spark Dataframe, I want to access each cell of the Dataframe to do some calculations. The code is in the following:
val spark = SparkSession
.builder
.master("local[4]")
.config("spark.executor.memory", "8g")
.config("spark.executor.cores", 4)
.config("spark.task.cpus",1)
.appName("Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
val df = spark.read
.format("jdbc")
.option("url", "jdbc:oracle:thin:#x.x.x.x:1521:orcldb")
.option("dbtable", "table")
.option("user", "orcl")
.option("password", "*******")
.load()
val row_cnt = df.count()
for(i<-0 to row_cnt.toInt){
val t = df.select("CST_NUM").take(i)
println("start_date:",t)
}
The Output is like this:
(start_date:,[Lorg.apache.spark.sql.Row;#38ae69f7)
I can access value of each cell using forEach; but,I cannot return it to another function.
Would you please guide me how to access value of each cell of Spark Dataframe?
any help is really appreciated.

You need to learn how to work with Spark efficiently - right now your code isn't very optimal. I recommend to read first chapters of the Learning Spark, 2ed book to understand how to work with Spark - it's freely downloadable from Databricks' site.
Regarding your case, you need to change your code just to do the single .select instead of doing it in the loop. And then you can return your data to caller. But it will depend on the amount of data that you need to return - usually you should return Dataframe itself (maybe only subset of columns), and people will transform data as they need. In this case you can take advantage of distributed computation.
If you have a small dataset (hundreds/thousands of rows), then you can materialize dataset as a Spark/Java collection & return it to caller. For example, this could be done as following:
val t = df.select("CST_NUM")
val coll = t.collect().map(_.getInt(0))
in this case, we're selecting only one column from our dataframe (CST_NUM), then use .collect to bring all rows to the driver node, and then extracting the column value from the row object (I've used .getInt for that, but you can use something else from the Row API, find a function that matches the type of your column - .getLong, .getString, etc.)

Related

How to append an index column to a spark data frame using spark streaming in scala?

I am using something like this:
df.withColumn("idx", monotonically_increasing_id())
But I get an exception as it is NOT SUPPORTED:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Expression(s): monotonically_increasing_id() is not supported with streaming DataFrames/Datasets;;
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForStreaming(UnsupportedOperationChecker.scala:143)
at org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:250)
at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:316)
Any ideas how to add an index or row number column to spark streaming dataframe in scala?
Full stacktrace: https://justpaste.it/5bdqr
There are a few operations that cannot exists anywhere in a streaming plan of Spark Streaming, unfortunately including monotonically_increasing_id(). Double check for this fact transformed1 is failing with the error as in your question, here is a reference on this check in Spark source code:
import org.apache.spark.sql.functions._
val df = Seq(("one", 1), ("two", 2)).toDF("foo", "bar")
val schema = df.schema
df.write.parquet("/tmp/out")
val input = spark.readStream.format("parquet").schema(schema).load("/tmp/out")
val transformed1 = input.withColumn("id", monotonically_increasing_id())
transformed1.writeStream.format("parquet").option("format", "append") .option("path", "/tmp/out2") .option("checkpointLocation", "/tmp/checkpoint_path").outputMode("append").start()
import org.apache.spark.sql.expressions.Window
val windowSpecRowNum = Window.partitionBy("foo").orderBy("foo")
val transformed2 = input.withColumn("row_num", row_number.over(windowSpecRowNum))
transformed2.writeStream.format("parquet").option("format", "append").option("path", "/tmp/out2").option("checkpointLocation", "/tmp/checkpoint_path").outputMode("append").start()
Also I tried to add indexing with Window over a column in DF - transformed2 in the snapshot above - it also failed, but with a different error):
"Non-time-based windows are not supported on streaming
DataFrames/Datasets"
All unsupported operator checks for Spark Streaming you can find here - seems like the traditional ways of adding an index column in Spark Batch don't work in Spark Streaming.

I'm applying the same methods to multiple dataframes in spark scala, how can I parallelize this?

Hi how's it going? I'm currently looping through all of my dataframes and running essentially the same queries/filters on them. Would there be a way to run this more effectively, in parallel? Here's sample code...
for (db <- list_of_dbs)
{
var df1 = spark.read
.format("csv")
.option("sep",",")
.option("inferSchema","true")
.option("header","true")
.load(path+db+".csv")
.withColumn("name_of_data",lit(db))
if (db!="rules") {
val conversion = mappingDF
.filter(col("col1").isNotNull and col("name") === db)
}
etc etc...
Is there a way so that this can be run on all dataframes at once, essentially getting rid of the for loop?
you can union all your dataframe than apply the filter/querys to it:
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html#unionByName(other:org.apache.spark.sql.Dataset[T]):org.apache.spark.sql.Dataset[T]
you can also use csv method of DatasetReader to load multiple csv in same time (recommended) :
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrameReader.html#csv(paths:String*):org.apache.spark.sql.DataFrame

Spark Efficiently Extract Large Query Result From Relational DB

(Spark beginner here) I wrote a Spark job to extract data from an Oracle database and write it to Parquet files. It works, but I am not satisfied with the batching solution I used. Is there a more efficient way?
I need to extract data from queries, not simply tables. I used a straightforward solution: get 1000 IDs (the max that can fit in a WHERE clause in Oracle), and built the query as a string. Then I passed it to Spark and extracted the data to Parquet. This works, but I wonder whether there are better, more efficient/idiomatic ways to do this. For instance, I am doing all the batching myself, whereas Spark was built to split and distribute work.
My code works on a small data set. I will have to scale it up by a factor of 100, and I would like to know the most idiomatic way of doing this.
val query =
s"""
|(SELECT tbl.*
|FROM $tableName tbl
|WHERE tbl.id IN (${idList.mkString("'", "','", "'")})
|) $tableName
""".stripMargin
private def queryToDf(query: String, props: Properties)(implicit spark: SparkSession, appConfig: AppConfig): sql.DataFrame = {
spark.read.format("jdbc")
.option("url", appConfig.dbURL)
.option("dbtable", query)
.option("user", appConfig.dbUsername).option("password", appConfig.dbPassword)
.option("fetchsize", appConfig.fetchsize.toString)
.option("driver", appConfig.jdbcDriver)
.load()
}
Using Spark 2.4.0, Scala 2.12, Oracle DB.
This would probably work better if you just let spark do all the work of distributing the data that is loaded and how to process it. Here you are making a filter before loading the data. I haven't worked with a jdbc source before, but I would assume that the query is passed to the jdbc before loading the data for spark.
So the solution is to pass the heavy work of filtering the data to spark, by making the property value of dbtable to the actual table name, and the query to spark:
val query =
s"""
|(SELECT tbl.*
|FROM $tableName tbl
|WHERE tbl.id IN (${idList.mkString("'", "','", "'")})
|) $tableName
""".stripMargin
private def queryToDf(tableName: String, query: String, props: Properties)(implicit spark: SparkSession, appConfig: AppConfig): sql.DataFrame = {
spark.read.format("jdbc")
.option("url", appConfig.dbURL)
.option("dbtable", tableName)
.option("user", appConfig.dbUsername).option("password", appConfig.dbPassword)
.option("fetchsize", appConfig.fetchsize.toString)
.option("driver", appConfig.jdbcDriver)
.load()
.selectExpr(query)
}
I have not tested this tho, so there may be some mistakes (query could not be a valid for the selectExpr(), but shouldn't be that different).

spark OOM on simple read and write

I'm using spark to read from a postgres table and dump it to Google cloud storage as json. The table is quite big, many 100's of GBs. The code is relatively straightforward (plz see below) but it fails with OOM. It seems like spark is trying to read the entire table in memory before starting to write it. Is this true? How can I change the behavior such that it reads and writes in a streaming fashion?
Thanks.
SparkSession sparkSession = SparkSession
.builder()
.appName("01-Getting-Started")
.getOrCreate();
Dataset<Row> dataset = sparkSession.read().jdbc("jdbc:postgresql://<ip>:<port>/<db>", "<table>", properties);
dataset.write().mode(SaveMode.Append).json("gs://some/path");
There are a couple of overloaded DataFrameReader.jdbc() methods that are useful for splitting up JDBC data on input.
jdbc(String url, String table, String[] predicates, java.util.Properties connectionProperties) - the resulting DataFrame will have one partition for each predicate given, e.g.
String[] preds = {“state=‘Alabama’”, “state=‘Alaska’”, “state=‘Arkansas’”, …};
Dataset<Row> dataset = sparkSession.read().jdbc("jdbc:postgresql://<ip>:<port>/<db>", "<table>", preds, properties);
jdbc(String url, String table, String columnName, long lowerBound, long upperBound, int numPartitions, java.util.Properties connectionProperties) - Spark will divide the data based on a numeric column columnName into numPartitions partitions between lowerBound and upperBound inclusive, e.g.:
Dataset<Row> dataset = sparkSession.read().jdbc("jdbc:postgresql://<ip>:<port>/<db>", "<table>", “<idColumn>”, 1, 1000, 100, properties);

spark scala reducekey dataframe operation

I'm trying to do a count in scala with dataframe. My data has 3 columns and I've already loaded the data and split by tab. So I want to do something like this:
val file = file.map(line=>line.split("\t"))
val x = file1.map(line=>(line(0), line(2).toInt)).reduceByKey(_+_,1)
I want to put the data in dataframe, and having some trouble on the syntax
val file = file.map(line=>line.split("\t")).toDF
val file.groupby(line(0))
.count()
Can someone help check if this is correct?
spark needs to know the schema of the df
there are many ways to specify the schema, here is one option:
val df = file
.map(line=>line.split("\t"))
.map(l => (l(0), l(1).toInt)) //at this point spark knows the number of columns and their types
.toDF("a", "b") //give the columns names for ease of use
df
.groupby('a)
.count()