Spark-shell running a SELECT for a dataframe - scala

I've created a spark scala script to load a file with customers information. Then I have created a case class to map the records and show them up as a table, my script below:
//spark context
sc
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
//Define the class to map customers coming from the data inputh
case class customer (cusid: Int, name: String, city : String, province: String, postalcode: String)
//load the file info
val customer_file = sc.textFile("file:////home/ingenieroandresangel/scalascripts/customer.txt")
val customer_rdd = customer_file.map(_.split(",")).map(p => customer(p(0).toInt,p(1),p(2),p(3),p(4)))
val cusstomerdf = customer_rdd.toDF()
current results:
Now, I need to perform spark sql queries to get back just a column coming from my dataframe, example the column name:
print(cusstomerdf.select("name"))
Nevertheless, the results are not as expected. I need to get back the rows for the column name but instead, I get this result:
Question: How should I run the right select to get back just the column name on my dataframe?? thanks

The result is correct. You are doing a transformation only as select is a transformation.
If you save it in a parquet file or a csv file you will see the result and can confirm that the column is already selected.
Meanwhile you can see the result on the screen by doing
val selecteddf = customerdf.select("name")
selecteddf.show(false)
which will show the 20 rows of name column

Related

Find columns to select, for spark.read(), from another Dataset - Spark Scala

I have a Dataset[Year] that has the following schema:
case class Year(day: Int, month: Int, Year: Int)
Is there any way to make a collection of the current schema?
I have tried:
println("Print -> "+ds.collect().toList)
But the result were:
Print -> List([01,01,2022], [31,01,2022])
I expected something like:
Print -> List(Year(01,01,2022), Year(31,01,2022)
I know that with a map I can adjust it, but I am trying to create a generic method that accepts any schema, and for this I cannot add a map doing the conversion.
That is my method:
class SchemeList[A]{
def set[A](ds: Dataset[A]): List[A] = {
ds.collect().toList
}
}
Apparently the method return is getting the correct signature, but when running the engine, it gets an error:
val setYears = new SchemeList[Year]
val YearList: List[Year] = setYears.set(df)
Exception in thread "main" java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to schemas.Schemas$Year
Based on your additional information in your comment:
I need this list to use as variables when creating another dataframe via jdbc (I need to make a specific select within postgresql). Is there a more performative way to pass values from a dataframe as parameters in a select?
Given your initial dataset:
val yearsDS: Dataset[Year] = ???
and that you want to do something like:
val desiredColumns: Array[String] = ???
spark.read.jdbc(..).select(desiredColumns.head, desiredColumns.tail: _*)
You could find the column names of yearsDS by doing:
val desiredColumns: Array[String] = yearsDS.columns
Spark achieves this by using def schema, which is defined on Dataset.
You can see the definition of def columns here.
May be you got a DataFrame,not a DataSet.
try to use "as" to transform dataframe to dataset.
like this
val year = Year(1,1,1)
val years = Array(year,year).toList
import spark.implicits._
val df = spark.
sparkContext
.parallelize(years)
.toDF("day","month","Year")
.as[Year]
println(df.collect().toList)

Unsupported operation exception from spark: Schema for type org.apache.spark.sql.types.DataType is not supported

Spark Streaming:
I am receiving a dataframe that consists of two columns. The first column is of string type that contains a json string and the second column consists of schema for each value(first column).
Batch: 0
-------------------------------------------
+--------------------+--------------------+
| value| schema|
+--------------------+--------------------+
|{"event_time_abc...|`event_time_abc...|
+--------------------+--------------------+
The table is stored in the val input(non mutable variable). I am using DataType.fromDDL function to convert the string type to a json DataFrame in the following way:
val out= input.select(from_json(col("value").cast("string"),ddl(col("schema"))))
where ddl is a predefined function,DataType.from_ddl(_:String):DataType in spark(scala) but i have registered it so that i can use it on whole column instead of a string only. I have done it in following way:
val ddl:UserDefinedFunction = udf(DataType.fromDDL(_:String):DataType)
and here is the final transformation on both column, value and schema of input table.
val out = input.select(from_json(col("value").cast("string"),ddl(col("schema"))))
However, i get exception from the registration at this line:
val ddl:UserDefinedFunction = udf(DataType.fromDDL(_:String):DataType)
The error is:
java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.types.DataType is not supported
If i use:
val out = input.select(from_json(col("value").cast("string"),DataType.fromDDL("`" + "event_time_human2"+"`" +" STRING")).alias("value"))
then it works but as you see i am only using a string(manually typed coming from the schema column) inside the function DataType.fromDDL(_:String):DataType.
So how can i apply this function to whole column without registration or is there any other way to register the function?
EDIT: from_json function's first argument requires a column while second argument requires a schema and not a column. Hence, i guess a manual approach is required to parse each value field with each schema field. After some investigation i found out that DataFrames do not support DataType.
Since a bounty has been set on this question. I would like to provide additional information regarding the data and schema. The schema is defined in DDL(string type) and it can be parsed with from_DDL function. The value is simple json string that will be parsed with the schema that we derive using from_DDL function.
The basic idea is that each value has it's own schema and needs to be parsed with corresponding schema. A new column should be created where the result will be store.
Data:
Here is one example of the data:
value = {"event_time_human2":"09:45:00 +0200 09/27/2021"}
schema = "`event_time_human2` STRING"
It is not needed to convert to correct time format. Just a string will be fine.
It is in streaming context. So ,not all approaches work.
Schemas are being applied and validated before runtime, that is, before the Spark code is executed on the executors. Parsed schemas must be part of the execution plan therefore schema parsing can't be executed dynamically as you intended until now. This is the reason that you see the exception:
java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.types.DataType is not supported only for the UDF. Consequently that implies that DataType.fromDDL should be used only inside the driver code and not in the runtime/executor code, which is the code within your UDF function. Inside the UDF function Spark has already executed the transformation of the imported data applying the schemas that you specified on the driver side. This is the reason that you can't use DataType.fromDDL directly in your UDF because it is essentially useless. All the above means that inside the UDF functions we can only use primitive Scala/Java types and some wrappers provided by the Spark API i.e WrappedArray.
An alternative could be to collect all the schemas on the driver. Then create a map with the pair (schema, dataframe) for each schema.
Keep in mind that collecting data to the driver is an expensive operation and it would make sense only if you have a reasonable number of unique schemas, i.e max some thousands. Also, applying these schemas to each dataset need to be done sequentially in the driver, which is quite expensive too, therefore it is important to realize that the suggested solution will only work efficiently if you have a limited amount of unique schemas.
Up to this point, your code could look as next:
import org.apache.spark.sql.functions.from_json
import org.apache.spark.sql.types.StructType
import spark.implicits._
val df = Seq(
("""{"event_time_human2":"09:45:00 +0200 09/27/2021", "name":"Pinelopi"}""", "`event_time_human2` STRING, name STRING"),
("""{"first_name":"Pin", "last_name":"Halen"}""", "first_name STRING, last_name STRING"),
("""{"code":993, "full_name":"Sofia Loren"}""", "code INT, full_name STRING")
).toDF("value", "schema")
val schemaDf = df.select("schema").distinct()
val dfBySchema = schemaDf.collect().map{ row =>
val schemaValue = row.getString(0)
val ddl = StructType.fromDDL(schemaValue)
val filteredDf = df.where($"schema" === schemaValue)
.withColumn("value", from_json($"value", ddl))
(schemaValue, filteredDf)
}.toMap
// Map(
// `event_time_human2` STRING, name STRING -> [value: struct<event_time_human2: string, name: string>, schema: string],
// first_name STRING, last_name STRING -> [value: struct<first_name: string, last_name: string>, schema: string],
// code INT, full_name STRING -> [value: struct<code: int, full_name: string>, schema: string]
// )
Explanation: first we gather each unique schema with schemaDf.collect(). Then we iterate through schemas and filter the initial df based on the current schema. We also use from_json to convert current string value column to the specific schema.
Note that we can't have one common column with different data type, this is the reason that we are creating a different df for each schema and not one final df.

Saving and Overwriting a file in Spark Scala

I have a text file where my first column is represented with table name and the second column is represented with date. The delimiter between two columns is represented by space. The data is represented as follows
employee.txt
organization 4-15-2018
employee 5-15-2018
My requirement is to read the file and update the date column based on the business logic and save/overwrite the file. Below is my code
object Employee {
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local").setAppName("employeedata")
val sc = new SparkContext(conf)
var input = sc.textFile("D:\\employee\\employee.txt")
.map(line => line.split(' '))
.map(kvPair => (kvPair(0), kvPair(1)))
.collectAsMap()
//Do some operations
// Do iteration and update the hashmap as follows
val finalMap = input + (tableName -> updatedDate)
sc.stop()
}
How to save/overwrite(if exists) the finalMap in the above scenario?
My requirement is to read the file and update the date column based on the business logic and save/overwrite the file.
Never do something like this directly. Always:
Write data to a temporary storage first.
Delete original using standard file system tools.
Rename temporary output using standard file system tools.
An attempt to overwrite data directly will, with high probability, result in a partial or complete data loss.

Return Temporary Spark SQL Table in Scala

First I convert a CSV file to a Spark DataFrame using
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("/usr/people.csv")
after that type df and return I can see
res30: org.apache.spark.sql.DataFrame = [name: string, age: string, gender: string, deptID: string, salary: string]
Then I use df.registerTempTable("people") to convert df to a Spark SQL table.
But after that when I do people Instead got type table, I got
<console>:33: error: not found: value people
Is it because people is a temporary table?
Thanks
When you register an temp table using the registerTempTable command you used, it will be available inside your SQLContext.
This means that the following is incorrect and will give you the error you are getting :
scala> people.show
<console>:33: error: not found: value people
To use the temp table, you'll need to call it with your sqlContext. Example :
scala> sqlContext.sql("select * from people")
Note : df.registerTempTable("df") will register a temporary table with name df correspond to the DataFrame df you apply the method on.
So persisting on df wont persist the table but the DataFrame, even thought the SQLContext will be using that DataFrame.
The above answer is right for Zeppelin too. If you want to run println to see data, you have to send it back to the driver to see output.
val querystrings = sqlContext.sql("select visitorDMA,
visitorIpAddress, visitorState, allRequestKV
from {redacted}
limit 1000")
querystrings.collect.foreach(entry => {
print(entry.getString(3).toString() + "\n")
})

How to add source file name to each row in Spark?

I'm new to Spark and am trying to insert a column to each input row with the file name that it comes from.
I've seen others ask a similar question, but all their answers used wholeTextFile, but I'm trying to do this for larger CSV files (read using the Spark-CSV library), JSON files, and Parquet files (not just small text files).
I can use the spark-shell to get a list of the filenames:
val df = sqlContext.read.parquet("/blah/dir")
val names = df.select(inputFileName())
names.show
but that's a dataframe.
I am not sure how to add it as a column to each row (and if that result is ordered the same as the initial data either, though I assume it always is) and how to do this as a general solution for all input types.
Another solution I just found to add file name as one of the columns in DataFrame
val df = sqlContext.read.parquet("/blah/dir")
val dfWithCol = df.withColumn("filename",input_file_name())
Ref:
spark load data and add filename as dataframe column
When you create a RDD from a text file, you probably want to map the data into a case class, so you could add the input source in that stage:
case class Person(inputPath: String, name: String, age: Int)
val inputPath = "hdfs://localhost:9000/tmp/demo-input-data/persons.txt"
val rdd = sc.textFile(inputPath).map {
l =>
val tokens = l.split(",")
Person(inputPath, tokens(0), tokens(1).trim().toInt)
}
rdd.collect().foreach(println)
If you do not want to mix "business data" with meta data:
case class InputSourceMetaData(path: String, size: Long)
case class PersonWithMd(name: String, age: Int, metaData: InputSourceMetaData)
// Fake the size, for demo purposes only
val md = InputSourceMetaData(inputPath, size = -1L)
val rdd = sc.textFile(inputPath).map {
l =>
val tokens = l.split(",")
PersonWithMd(tokens(0), tokens(1).trim().toInt, md)
}
rdd.collect().foreach(println)
and if you promote the RDD to a DataFrame:
import sqlContext.implicits._
val df = rdd.toDF()
df.registerTempTable("x")
you can query it like
sqlContext.sql("select name, metadata from x").show()
sqlContext.sql("select name, metadata.path from x").show()
sqlContext.sql("select name, metadata.path, metadata.size from x").show()
Update
You can read the files in HDFS using org.apache.hadoop.fs.FileSystem.listFiles() recursively.
Given a list of file names in a value files (standard Scala collection containing org.apache.hadoop.fs.LocatedFileStatus), you can create one RDD for each file:
val rdds = files.map { f =>
val md = InputSourceMetaData(f.getPath.toString, f.getLen)
sc.textFile(md.path).map {
l =>
val tokens = l.split(",")
PersonWithMd(tokens(0), tokens(1).trim().toInt, md)
}
}
Now you can reduce the list of RDDs into a single one: The function for reduce concats all RDDs into a single one:
val rdd = rdds.reduce(_ ++ _)
rdd.collect().foreach(println)
This works, but I cannot test if this distributes/performs well with large files.