Question regarding best choice for specifying schema in spark and How to Delete/remove column from rdd spark? - scala

I have two datasets NYC taxi data and weather data. Weather data a huge number of columns from, around 100, of which I need only 5-10. And I want them to be typed instead of strings hence need a schema. I know two ways for this
Rdd->Rows-> give schema and then convert to DF
Dataframe InferSchema (the documentation says its double passes)
Is inferSchema a good choice for the 100 column thing and don't have to write a StructType or CaseClass for 100 columns?
And the taxi data with billion+ records which also has around 60 columns but I need only around 10. What would be a suitable choice for this dataset? Write schema for all 60 columns?
Second Question: as I mentioned I don't need all the columns so I'm dropping columns. from the documentation and Internet, I got to know how to do this using DF just with the select function.
But I in case I have to write schema which in my knowledge is only possible using RDD. How to remove/drop columns in RDD?
Ideally, using any RDD/DF id like to drop columns first then specify a schema. Is this possible?
I know it's a lot of questions but I'm a newbie with spark all this popped up in my mind and i want to do it the right way.
Thanks

You don't need RDD to achieve it, it's really simple. Just load your data to DF then select and cast wanted columns.
scala> val df = Seq("1","2","3").toDF("c1")
scala> df.show()
+---+
| c1|
+---+
| 1|
| 2|
| 3|
+---+
scala> df.printSchema()
root
|-- c1: string (nullable = true)
scala> val newDF = df.select('c1.cast("int"))
scala> newDF.printSchema()
root
|-- c1: integer (nullable = true)

Related

How to generate huge numbers of random numbers at runtime in spark with immutable dataframe

I have a problem where I need to generate millions of unique random numbers for an application running in spark. Since dataframes are immutable, everytime I add a generated number, I do a union to the existing dataframe which in turn creates a new dataframe. With millions of numbers required, this might cause performance issues. Is there any mutable data structure that can be used for this requirement
I have tried with dataframes doing a union with the existing dataframe
You can use the following code to generate a dataframe having millions of unique random numbers.
import scala.util.Random
val df = Random.shuffle((1 to 1000000)).toDF
df.show(20)
Have tried generating a dataframe with 1 million unique random numbers and it hardly took 1-2 seconds.
+------+
| value|
+------+
|204913|
|882174|
|407676|
|913166|
|236148|
|788069|
|176180|
|819827|
|779280|
| 63172|
| 3797|
|962902|
|775383|
|583273|
|172932|
|429650|
|225793|
|849386|
|403140|
|622971|
+------+
only showing top 20 rows
The dataframe that I created looked something like this. Hope this would cater to your requirement.

How to parse a csv string into a Spark dataframe using scala?

I would like to convert a RDD containing records of strings, like below, to a Spark dataframe.
"Mike,2222-003330,NY,34"
"Kate,3333-544444,LA,32"
"Abby,4444-234324,MA,56"
....
The schema line is not inside the same RDD, but in a another variable:
val header = "name,account,state,age"
So now my question is, how do I use the above two, to create a dataframe in Spark? I am using Spark version 2.2.
I did search and saw a post:
Can I read a CSV represented as a string into Apache Spark using spark-csv
.
However it's not exactly what I need and I can't figure out a way to modify this piece of code to work in my case.
Your help is greatly appreciated.
The easier way would probably be to start from the CSV file and read it directly as a dataframe (by specifying the schema). You can see an example here: Provide schema while reading csv file as a dataframe.
When the data already exists in an RDD you can use toDF() to convert to a dataframe. This function also accepts column names as input. To use this functionality, first import the spark implicits using the SparkSession object:
val spark: SparkSession = SparkSession.builder.getOrCreate()
import spark.implicits._
Since the RDD contains strings it needs to first be converted to tuples representing the columns in the dataframe. In this case, this will be a RDD[(String, String, String, Int)] since there are four columns (the last age column is changed to int to illustrate how it can be done).
Assuming the input data are in rdd:
val header = "name,account,state,age"
val df = rdd.map(row => row.split(","))
.map{ case Array(name, account, state, age) => (name, account, state, age.toInt)}
.toDF(header.split(","):_*)
Resulting dataframe:
+----+-----------+-----+---+
|name| account|state|age|
+----+-----------+-----+---+
|Mike|2222-003330| NY| 34|
|Kate|3333-544444| LA| 32|
|Abby|4444-234324| MA| 56|
+----+-----------+-----+---+

How to use QuantileDiscretizer across groups in a DataFrame?

I have a DataFrame with the following columns.
scala> show_times.printSchema
root
|-- account: string (nullable = true)
|-- channel: string (nullable = true)
|-- show_name: string (nullable = true)
|-- total_time_watched: integer (nullable = true)
This is data about how many times customer has watched watched a particular show. I'm supposed to categorize the customer for each show based on total time watched.
The dataset has 133 million rows in total with 192 distinct show_names.
For each individual show I'm supposed to bin the customer into 3 categories (1,2,3).
I use Spark MLlib's QuantileDiscretizer
Currently I loop through every show and run QuantileDiscretizer in the sequential manner as in the code below.
What I'd like to have in the end is for the following sample input to get the sample output.
Sample Input:
account,channel,show_name,total_time_watched
acct1,ESPN,show1,200
acct2,ESPN,show1,250
acct3,ESPN,show1,800
acct4,ESPN,show1,850
acct5,ESPN,show1,1300
acct6,ESPN,show1,1320
acct1,ESPN,show2,200
acct2,ESPN,show2,250
acct3,ESPN,show2,800
acct4,ESPN,show2,850
acct5,ESPN,show2,1300
acct6,ESPN,show2,1320
Sample Output:
account,channel,show_name,total_time_watched,Time_watched_bin
acct1,ESPN,show1,200,1
acct2,ESPN,show1,250,1
acct3,ESPN,show1,800,2
acct4,ESPN,show1,850,2
acct5,ESPN,show1,1300,3
acct6,ESPN,show1,1320,3
acct1,ESPN,show2,200,1
acct2,ESPN,show2,250,1
acct3,ESPN,show2,800,2
acct4,ESPN,show2,850,2
acct5,ESPN,show2,1300,3
acct6,ESPN,show2,1320,3
Is there a more efficient and distributed way to do it using some groupBy-like operation instead of looping through each show_name and bin it one after other?
I know nothing about QuantileDiscretizer, but think you're mostly concerned with the dataset to apply QuantileDiscretizer to. I think you want to figure out how to split your input dataset into smaller datasets per show_name (you said that there are 192 distinct show_name in the input dataset).
Solution 1: Partition Parquet Dataset
I've noticed that you use parquet as the input format. My understanding of the format is very limited but I've noticed that people are using some partitioning scheme to split large datasets into smaller chunks that they could then process whatever they like (per some partitioning scheme).
In your case the partitioning scheme could include show_name.
That would make your case trivial as the splitting were done at writing time (aka not my problem anymore).
See How to save a partitioned parquet file in Spark 2.1?
Solution 2: Scala's Future
Given your iterative solution, you could wrap every iteration into a Future that you'd submit to process in parallel.
Spark SQL's SparkSession (and Spark Core's SparkContext) are thread-safe.
Solution 3: Dataset's filter and union operators
I would think twice before following this solution since it puts burden on your shoulders which I think could easily be sorted out by solution 1.
Given you've got one large 133-million-row parquet file, I'd first build the 192 datasets per show_name using filter operator (as you did to build show_rdd which is against the name as it's a DataFrame not RDD) and union (again as you did).
See Dataset API.
Solution 4: Use Window Functions
That's something I think could work, but didn't check it out myself.
You could use window functions (see WindowSpec and Column's over operator).
Window functions would give you partitioning (windows) while over would somehow apply QuantileDiscretizer to a window/partition. That would however require "destructuring" QuantileDiscretizer into an Estimator to train a model and somehow fit the result model to the window again.
I think it's doable, but haven't done it myself. Sorry.
This is older question. However answering it to help someone with same situation in future.
It can be achieved using pandas udf function. Both input and output of pandas UDF function is dataframe. We need to provide schema of the output dataframe as shown in annotation in below code sample. Below code sample can achieve required result.
output_schema = StructType(df.schema.fields + [StructField('Time_watched_bin', IntegerType(), True)])
#pandas_udf(output_schema, PandasUDFType.GROUPED_MAP)
# pdf: pandas dataframe
def get_buckets(pdf):
pdf['Time_watched_bin'] = pd.cut(pdf['total_time_watched'], 3, labels=False)
return pdf
df = df.groupby('show_name').apply(get_buckets)
df will have new column 'Time_watched_bin' with bucket information.

Understanding some basics of Spark SQL

I'm following http://spark.apache.org/docs/latest/sql-programming-guide.html
After typing:
val df = spark.read.json("examples/src/main/resources/people.json")
// Displays the content of the DataFrame to stdout
df.show()
// +----+-------+
// | age| name|
// +----+-------+
// |null|Michael|
// | 30| Andy|
// | 19| Justin|
// +----+-------+
I have some questions that I didn't see the answers to.
First, what is the $-notation?
As in
df.select($"name", $"age" + 1).show()
Second, can I get the data from just the 2nd row (and I don't know what the data is in the second row).
Third, how would you read in a color image with spark sql?
4th, I'm still not sure what the difference is between a dataset and dataframe in spark. The variable df is a dataframe, so could I change "Michael" to the integer 5? Could I do that in a dataset?
$ is not annotation. It is a method call (shortcut for new ColumnName("name")).
You wouldn't. Spark SQL has no notion of row indexing.
You wouldn't. You can use low level RDD API with specific input formats (like ones from HIPI project) and then convert.
Difference between DataSet API and DataFrame
1) For question 1, $ sign is used as a shortcut for selecting a column and applying functions on top of it. For example:
df.select($"id".isNull).show
which can be other wise written as
df.select(col("id").isNull)
2) Spark does not have indexing, but for prototyping you can use df.take(10)(i) where i could be the element you want. Note: the behaviour could be different each time as the underlying data is partitioned.

How to filter jbehave examples table rows based on scenario meta data

Is there a way we can filter jbehave examples table rows at runtime using the scenario meta data? For e.g
Scenario: my scenario title
Meta:
#id 1
Examples:
|Meta:|col1|col2|
|id 1 |val1|val2|
|id 2| val |val |
|id 1| val |val |
When we run this scenario it should iterate only for the 1st and 3rd row, based on the meta data set on the scenario.
What I am trying to do is to externalize data across scenarios/ stories and try to use filtered data rows applicable for particular scenario.
I found some similar topics based meta filtering but not specific to this.
Appreciate any help. Thanks
A meta character # must be used in the example table, in this way:
Scenario: some scenario
Meta: #id
Given I pass value '1'
Examples:
|Meta:|col1|col2|
|#id 1|val1|val2|
|#id 2| val|val |
|#id 1| val|val |
Then you need to define the filter in the configuration, for example:
configuredEmbedder().useMetaFilters(Arrays.asList("+id 1"));
More on this topic can be found here:
http://jbehave.org/reference/stable/meta-filtering.html