How to assign values from a CSV to individual variables using Scala - scala

I'd like first to appologize if this is not the proper way of asking a question but it's my first one.
I have a CSV with 2 columns, one with names and another one with values.
I have imported the CSV as a Scala DataFrame into Spark and I want to assign all the values to individual variables (getting the variable names from the column name).
The DataFrame looks like this (there are more variables and the total number of variables can change)
| name|value|
+--------------------+-----+
| period_id| 188|
| homeWork| 0.75|
| minDays| 7|
...
I am able to assign the value of one individual row to a variable using the code below, but I'd like to do it for all the records automatically.
val period_id = vbles.filter(vbles("name").equalTo("period_id")).select("value").first().getString(0).toDouble
The idea I had was to iterate through all the rows in the DataFrame and run the code above each time, something like
for (valName <- name) {
val valName = vbles.filter(vbles("name").equalTo(valName )).select("value").first().getString(0).toDouble
}
I have tried some ways of iterating through the rows of the DataFrame but I haven't been successful.
What I'd like to get is as if I'd do this manually
val period_id = 188
val homeWork = 0.75
val minDays = 7
...
I suppose that there are smarter ways of getting this, but if I could just iterate through the rows it'd be fine, although any solution that works is welcome.
Thanks a lot

Related

Is there anyway to get schema from the parquet files being queried?

So, I have parquet files separated by folder with date in it, something like
root_folder
|_date=20210101
|_ file_A.parquet
|_date=20210102
|_ file_B.parquet
file_A has 2 column X,Y, file_B has 3 column X,Y,Z
but when i query using sparksession on the date 20210102, it's using schema from the topmost folder that is 20210101 and when i tried querying column Z it doesn't exist.
I've tried using mergeSchema=true option, but it doesn't fit my use case because I need to treat those with column Z differently, and i'm checking if there's column Z using DataFrame.columns.
Is there any workaround for this? I need to get schema from the one i query only.
If computational cost is not a concern, you can solve this problem by reading the entire dataset into spark, filter to the date you are looking for, and then drop the column if is entirely null.
This performs a pass over the data just to figure out if the column should be dropped, which is not great. Luckily .where and .count parallelize pretty well so you have enough compute it might be okay.
val base = spark.read
.option("mergeSchema", true)
.parquet("root_folder/")
.where(col("date") === "20210101")
val df = if (base.where(col("Z").isNotNull).count > 0) base.drop("Z") else base
df.schema // Should only have X, Y
If you want to generalize this into a function that drops all empty columns, you can compute the .isNotNull count for all columns in 1 pass.

Update multiple column values of a dataframe unconditionally with different variables

I have dataframe with some 10 columns. I have selected 4 columns out of these 10 and cleaned their values(by calling some external API and using its response). I would like to create new dataframe now (as old one cannot be updated) and update these 4 columns with its cleaned value(as returned by the API) and keep other 6 as is.
Tried exploring .na.replace and .withColumn but they all work on some condition for the columns.
val newdf = df.withColumn("col1", when(col("col1") === "XYZ", cleanedcol1)
.otherwise(col("col1")));
And
val newdf = df.na.replace("col1", Map("col1" -> cleanedcol1))
The first snippet matches col1 value with XYZ and then replaces it. I want unconditional change.
The second one actually tries to look for String "col1" for col1 column and hence does not replace anything.
What is the optimum approach to achieve this? The source of the df is Kafka and hence traffic would be fast.
You can have unconditional change with withCoulumn, just write
val newdf = df.withColumn("col1", newColumnValue)
Where newColumnValue is the new value you want to set for the column.

How to generate huge numbers of random numbers at runtime in spark with immutable dataframe

I have a problem where I need to generate millions of unique random numbers for an application running in spark. Since dataframes are immutable, everytime I add a generated number, I do a union to the existing dataframe which in turn creates a new dataframe. With millions of numbers required, this might cause performance issues. Is there any mutable data structure that can be used for this requirement
I have tried with dataframes doing a union with the existing dataframe
You can use the following code to generate a dataframe having millions of unique random numbers.
import scala.util.Random
val df = Random.shuffle((1 to 1000000)).toDF
df.show(20)
Have tried generating a dataframe with 1 million unique random numbers and it hardly took 1-2 seconds.
+------+
| value|
+------+
|204913|
|882174|
|407676|
|913166|
|236148|
|788069|
|176180|
|819827|
|779280|
| 63172|
| 3797|
|962902|
|775383|
|583273|
|172932|
|429650|
|225793|
|849386|
|403140|
|622971|
+------+
only showing top 20 rows
The dataframe that I created looked something like this. Hope this would cater to your requirement.

Getting the value of a DataFrame column in Spark

I am trying to retrieve the value of a DataFrame column and store it in a variable. I tried this :
val name=df.select("name")
val name1=name.collect()
But none of the above is returning the value of column "name".
Spark version :2.2.0
Scala version :2.11.11
There are couple of things here. If you want see all the data collect is the way to go. However in case your data is too huge it will cause drive to fail.
So the alternate is to check few items from the dataframe. What I generally do is
df.limit(10).select("name").as[String].collect()
This will provide output of 10 element. But now the output doesn't look good
So, 2nd alternative is
df.select("name").show(10)
This will print first 10 element, Sometime if the column values are big it generally put "..." instead of actual value which is annoying.
Hence there is third option
df.select("name").take(10).foreach(println)
Takes 10 element and print them.
Now in all the cases you won't get a fair sample of the data, as the first 10 data will be picked. So to truely pickup randomly from the dataframe you can use
df.select("name").sample(.2, true).show(10)
or
df.select("name").sample(.2, true).take(10).foreach(println)
You can check the "sample" function on dataframe
The first will do :)
val name = df.select("name") will return another DataFrame. You can do for example name.show() to show content of the DataFrame. You can also do collect or collectAsMap to materialize results on driver, but be aware, that data amount should not be too big for driver
You can also do:
val names = df.select("name").as[String].collect()
This will return array of names in this DataFrame

Using groupBy in Spark and getting back to a DataFrame

I have a difficulty when working with data frames in spark with Scala. If I have a data frame that I want to extract a column of unique entries, when I use groupBy I don't get a data frame back.
For example, I have a DataFrame called logs that has the following form:
machine_id | event | other_stuff
34131231 | thing | stuff
83423984 | notathing | notstuff
34131231 | thing | morestuff
and I would like the unique machine ids where event is thing stored in a new DataFrame to allow me to do some filtering of some kind. Using
val machineId = logs
.where($"event" === "thing")
.select("machine_id")
.groupBy("machine_id")
I get a val of Grouped Data back which is a pain in the butt to use (or I don't know how to use this kind of object properly). Having got this list of unique machine id's, I then want to use this in filtering another DataFrame to extract all events for individual machine ids.
I can see I'll want to do this kind of thing fairly regularly and the basic workflow is:
Extract unique id's from a log table.
Use unique ids to extract all events for a particular id.
Use some kind of analysis on this data that has been extracted.
It's the first two steps I would appreciate some guidance with here.
I appreciate this example is kind of contrived but hopefully it explains what my issue is. It may be I don't know enough about GroupedData objects or (as I'm hoping) I'm missing something in data frames that makes this easy. I'm using spark 1.5 built on Scala 2.10.4.
Thanks
Just use distinct not groupBy:
val machineId = logs.where($"event"==="thing").select("machine_id").distinct
Which will be equivalent to SQL:
SELECT DISTINCT machine_id FROM logs WHERE event = 'thing'
GroupedData is not intended to be used directly. It provides a number of methods, where agg is the most general, which can be used to apply different aggregate functions and convert it back to DataFrame. In terms of SQL what you have after where and groupBy is equivalent to something like this
SELECT machine_id, ... FROM logs WHERE event = 'thing' GROUP BY machine_id
where ... has to be provided by agg or equivalent method.
A group by in spark followed by aggregation and then a select statement will return a data frame. For your example it should be something like:
val machineId = logs
.groupBy("machine_id", "event")
.agg(max("other_stuff") )
.select($"machine_id").where($"event" === "thing")