Apache Spark Scala - data analysis - error - scala

I am new to/still learning Apache Spark/Scala. I am trying to analyze a dataset and have loaded the dataset into Scala. However, when I try to perform a basic analysis such as max, min or average, I get an error -
error: value select is not a member of org.apache.spark.rdd.RDD[Array[String]]
Could anyone please shed some light on this please? I am running Spark on the cloudlab of an organization.
Code:
// Reading in the csv file
val df = sc.textFile("/user/Spark/PortbankRTD.csv").map(x => x.split(","))
// Select Max of Age
df.select(max($"age")).show()
Error:
<console>:40: error: value select is not a member of org.apache.spark.rdd.RDD[Array[String]]
df.select(max($"age")).show()
Please let me know if you need any more information.
Thanks

Following up on my comment, the textFile method returns an RDD[String]. select is a method on DataFrame. You will need to convert your RDD[String] into a DataFrame. You can do this in a number of ways. One example is
import spark.implicits._
val rdd = sc.textFile("/user/Spark/PortbankRTD.csv")
val df = rdd.toDF()
There are also built-in readers for many types of input files:
spark.read.csv("/user/Spark/PortbankRTD.csv")
returns a DataFrame immediately.

Related

How do I use a from_json() dataframe in Spark?

I'm trying to create a dataset from a json-string within a dataframe in Databricks 3.5 (Spark 2.2.1). In the code block below 'jsonSchema' is a StructType with the correct layout for the json-string which is in the 'body' column of the dataframe.
val newDF = oldDF.select(from_json($"body".cast("string"), jsonSchema))
This returns a dataframe where the root object is
jsontostructs(CAST(body AS STRING)):struct
followed by the fields in the schema (looks correct). When I try another select on the newDF
val transform = newDF.select($"propertyNameInTheParsedJsonObject")
it throws the exception
org.apache.spark.sql.AnalysisException: cannot resolve '`columnName`' given
input columns: [jsontostructs(CAST(body AS STRING))];;
I'm aparently missing something. I hoped from_json would return a dataframe I could manipulate further.
My ultimate objective is to cast the json-string within the oldDF body-column to a dataset.
from_json returns a struct or (array<struct<...>>) column. It means it is a nested object. If you've provided a meaningful name:
val newDF = oldDF.select(from_json($"body".cast("string"), jsonSchema) as "parsed")
and the schema describes a plain struct you could use standard methods like
newDF.select($"parsed.propertyNameInTheParsedJsonObject")
otherwise please follow the instructions for accessing arrays.

How to parse a csv string into a Spark dataframe using scala?

I would like to convert a RDD containing records of strings, like below, to a Spark dataframe.
"Mike,2222-003330,NY,34"
"Kate,3333-544444,LA,32"
"Abby,4444-234324,MA,56"
....
The schema line is not inside the same RDD, but in a another variable:
val header = "name,account,state,age"
So now my question is, how do I use the above two, to create a dataframe in Spark? I am using Spark version 2.2.
I did search and saw a post:
Can I read a CSV represented as a string into Apache Spark using spark-csv
.
However it's not exactly what I need and I can't figure out a way to modify this piece of code to work in my case.
Your help is greatly appreciated.
The easier way would probably be to start from the CSV file and read it directly as a dataframe (by specifying the schema). You can see an example here: Provide schema while reading csv file as a dataframe.
When the data already exists in an RDD you can use toDF() to convert to a dataframe. This function also accepts column names as input. To use this functionality, first import the spark implicits using the SparkSession object:
val spark: SparkSession = SparkSession.builder.getOrCreate()
import spark.implicits._
Since the RDD contains strings it needs to first be converted to tuples representing the columns in the dataframe. In this case, this will be a RDD[(String, String, String, Int)] since there are four columns (the last age column is changed to int to illustrate how it can be done).
Assuming the input data are in rdd:
val header = "name,account,state,age"
val df = rdd.map(row => row.split(","))
.map{ case Array(name, account, state, age) => (name, account, state, age.toInt)}
.toDF(header.split(","):_*)
Resulting dataframe:
+----+-----------+-----+---+
|name| account|state|age|
+----+-----------+-----+---+
|Mike|2222-003330| NY| 34|
|Kate|3333-544444| LA| 32|
|Abby|4444-234324| MA| 56|
+----+-----------+-----+---+

Scala repartition cannot resolve symbol

I am trying to save my dataframe aa parquet file with one partition per day. So trying to use the date column. However, I want to write one file per partition, so using repartition($"date"), but keep getting errors:
This error "cannot resolve symbol repartition" and "value $ is not a member of stringContext" when I use,
DF.repartition($"date")
.write
.mode("append")
.partitionBy("date")
.parquet("s3://file-path/")
This error Type mismatch, expected column, actual string, when I use:
DF.repartition("date")
.write
.mode("append")
.partitionBy("date")
.parquet("s3://file-path/")
However, this works fine without any error.
DF.write.mode("append").partitionBy("date").parquet("s3://file-path/")
Cant we use date type in repartition? Whats wrong here?
To use the $ symbol inplace of col(), you need to first import spark.implicits. spark here is an instance of a SparkSession, hence the import must be done after the creation of a SparkSession. A simple example:
val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
This import will also enable other functionallity such as converting RDDs to Dataframe of Datasets with toDF() and toDS() respectively.

Making RDD operations on sqlContext

I am working on a tutorial in apache spark, and I making use of the Cassandra database, Spark2.0, and Python
I am trying to do an RDD Operation on a sql Query, using this tutorial,
https://spark.apache.org/docs/2.0.0-preview/sql-programming-guide.html
it says #The results of SQL queries are RDDs and support all the normal RDD operations.
I currently have this line of codes that says
sqlContext = SQLContext(sc)
results = sqlContext.sql("SELECT word FROM tweets where word like '%#%'").show(20, False)
df = sqlContext.read\
.format("org.apache.spark.sql.cassandra")\
.options(table="wordcount", keyspace= "demo")\
.load()
df.select("word")
df.createOrReplaceTempView("tweets")
usernames = results.map(lambda p: "User: " + p.word)
for name in usernames.collect():
print(name)
AttributeError: 'NoneType' object has no attribute 'map'
If the variable results is a result of sql Query, why am I getting this error? Can anyone please explain this to me.
Everything works fine, the tables print, only time I get an error is when I
try to do a RDD Operation.
please bear in mind sc is an existing spark context
It's because show() only prints content.
Use:
results = sqlContext.sql("SELECT word FROM tweets where word like '%#%'")
result.show(20, False)

Spark DataFrame filtering: retain element belonging to a list

I am using Spark 1.5.1 with Scala on Zeppelin notebook.
I have a DataFrame with a column called userID with Long type.
In total I have about 4 million rows and 200,000 unique userID.
I have also a list of 50,000 userID to exclude.
I can easily build the list of userID to retain.
What is the best way to delete all the rows that belong to the users to exclude?
Another way to ask the same question is: what is the best way to keep the rows that belong to the users to retain?
I saw this post and applied its solution (see the code below), but the execution is slow, knowing that I am running SPARK 1.5.1 on my local machine, an I have decent RAM memory of 16GB and the initial DataFrame fits in the memory.
Here is the code that I am applying:
import org.apache.spark.sql.functions.lit
val finalDataFrame = initialDataFrame.where($"userID".in(listOfUsersToKeep.map(lit(_)):_*))
In the code above:
the initialDataFrame has 3885068 rows, each row has 5 columns, one of these columns called userID and it contains Long values.
The listOfUsersToKeep is an Array[Long] and it contains 150,000 Long userID.
I wonder if there is a more efficient solution than the one I am using.
Thanks
You can either use join:
val usersToKeep = sc.parallelize(
listOfUsersToKeep.map(Tuple1(_))).toDF("userID_")
val finalDataFrame = usersToKeep
.join(initialDataFrame, $"userID" === $"userID_")
.drop("userID_")
or a broadcast variable and an UDF:
import org.apache.spark.sql.functions.udf
val usersToKeepBD = sc.broadcast(listOfUsersToKeep.toSet)
val checkUser = udf((id: Long) => usersToKeepBD.value.contains(id))
val finalDataFrame = initialDataFrame.where(checkUser($"userID"))
It should be also possible to broadcast a DataFrame:
import org.apache.spark.sql.functions.broadcast
initialDataFrame.join(broadcast(usersToKeep), $"userID" === $"userID_")