New to spark here and I'm trying to read a pipe delimited file in spark. My file looks like this:
user1|acct01|A|Fairfax|VA
user1|acct02|B|Gettysburg|PA
user1|acct03|C|York|PA
user2|acct21|A|Reston|VA
user2|acct42|C|Fairfax|VA
user3|acct66|A|Reston|VA
and I do the following in scala:
scala> case class Accounts (usr: String, acct: String, prodCd: String, city: String, state: String)
defined class Accounts
scala> val accts = sc.textFile("accts.csv").map(_.split("|")).map(
| a => (a(0), Accounts(a(0), a(1), a(2), a(3), a(4)))
| )
I then try to group the key value pair by the key, and this is not sure if I'm doing this right...is this how I do it?
scala> accts.groupByKey(2)
res0: org.apache.spark.rdd.RDD[(String, Iterable[Accounts])] = ShuffledRDD[4] at groupByKey at <console>:26
I thought the (2) is to give me the first two results back but I don't seem to get anything back at the console...
If I run a distinct...I get this too..
scala> accts.distinct(1).collect(1)
<console>:26: error: type mismatch;
found : Int(1)
required: PartialFunction[(String, Accounts),?]
accts.distinct(1).collect(1)
EDIT:
Essentially I'm trying to get to a key value pair nested mapping. For example, user1 would looke like this:
user1 | {'acct01': {prdCd: 'A', city: 'Fairfax', state: 'VA'}, 'acct02': {prdCd: 'B', city: 'Gettysburg', state: 'PA'}, 'acct03': {prdCd: 'C', city: 'York', state: 'PA'}}
trying to learn this step by step so thought I'd break it down into chunks to understand...
I think you might have better luck if you put your data into a DataFrame if you've already gone through the process of defining a schema. First off, you need to modify the split comment to use single quotes. (See this question). Also, you can get rid of the a(0) in the beginning. Then, converting to a DataFrame is trivial. (Note that DataFrames are available on spark 1.3+.)
val accts = sc.textFile("/tmp/accts.csv").map(_.split('|')).map(a => Accounts(a(0), a(1), a(2), a(3), a(4)))
val df = accts.toDF()
Now df.show produces:
+-----+------+------+----------+-----+
| usr| acct|prodCd| city|state|
+-----+------+------+----------+-----+
|user1|acct01| A| Fairfax| VA|
|user1|acct02| B|Gettysburg| PA|
|user1|acct03| C| York| PA|
|user2|acct21| A| Reston| VA|
|user2|acct42| C| Fairfax| VA|
|user3|acct66| A| Reston| VA|
+-----+------+------+----------+-----+
It should be easier for you to work with the data. For example, to get a list of the unique users:
df.select("usr").distinct.collect()
produces
res42: Array[org.apache.spark.sql.Row] = Array([user1], [user2], [user3])
For more details, check out the docs.
3 observations that may help you understand the problem:
1)
groupByKey(2) does not return first 2 results, the parameter 2 is used as number of partitions for the resulting RDD. See docs.
2) collect does not take Int parameter. See docs.
3) split takes 2 types of parameters, Char or String. String version uses Regex so "|" needs escaping if intended as literal.
Related
I am curious as to why this will not work in Spark Scala on a dataframe:
df.withColumn("answer", locate(df("search_string"), col("hit_songs"), pos=1))
It works with a UDF, but not as per above. Col vs. String aspects. Seems awkward and lacking aspect. I.e. how to convert a column to a string for passing to locate that needs String.
df("search_string") allows a String to be generated is my understanding.
But error gotten is:
command-679436134936072:15: error: type mismatch;
found : org.apache.spark.sql.Column
required: String
df.withColumn("answer", locate(df("search_string"), col("hit_songs"), pos=1))
Understanding what's going wrong
I'm not sure which version of Spark you're on, but the locate method has the following function signature on both Spark 3.3.1 (the current latest version) and Spark 2.4.5 (the version running on my local running Spark shell).
This function signature is the following:
def locate(substr: String, str: Column, pos: Int): Column
So substr can't be a Column, it needs to be a String. In your case, you were using df("search_string"). This actually calls the apply method with the following function signature:
def apply(colName: String): Column
So it makes sense that you're having a problem since the locate function needs a String.
Trying to fix your issue
If I correctly understood, you want to be able to locate a substring from one column inside of a string in another column without UDFs. You can use a map on a Dataset to do that. Something like this:
import spark.implicits._
case class MyTest (A:String, B: String)
val df = Seq(
MyTest("with", "potatoes with meat"),
MyTest("with", "pasta with cream"),
MyTest("food", "tasty food"),
MyTest("notInThere", "don't forget some nice drinks")
).toDF("A", "B").as[MyTest]
val output = df.map{
case MyTest(a,b) => (a, b, b indexOf a)
}
output.show(false)
+----------+-----------------------------+---+
|_1 |_2 |_3 |
+----------+-----------------------------+---+
|with |potatoes with meat |9 |
|with |pasta with cream |6 |
|food |tasty food |6 |
|notInThere|don't forget some nice drinks|-1 |
+----------+-----------------------------+---+
Once you're inside of a map operation of a strongly typed Dataset, you have the Scala language at your disposal.
Hope this helps!
I've got some data in spark, result: DataFrame = ..., where two integer columns are of interest; week and year. The values of these columns are identical for all rows.
I want to extract these two integer values, and pass them as parameters to create a WeekYear:
case class WeekYear(week: Int, year: Int)
Below is my current solution, but I'm thinking there must be a more elegant way to do this. How can this be done without the intermediate step of creating temp?
val temp = result
.select("week", "year")
.first
.toSeq
.map(_.toString.toInt)
val resultWeekYear = WeekYear(temp(0), temp(1))
The best way to utilize a case class with dataframes is to allow spark to convert it to a dataset with the .as() method. As long as your case class has attributes which match all of the column names, it should work very easily.
case class WeekYear(week: Int, year: Int)
val df = spark.createDataset(Seq((1, 1), (2, 2), (3, 3))).toDF("week", "year")
val ds = df.as[WeekYear]
ds.show()
Which provides a Dataset[WeekYear] that looks like this:
+----+----+
|week|year|
+----+----+
| 1| 1|
| 2| 2|
| 3| 3|
+----+----+
You can utilize some more complicated nested classes, but you have to start working with Encoders for that, so that spark knows how to convert back and forth.
Spark does some implicit conversions, so ds may still look like a Dataframe, but it is actually a strongly typed Dataset[WeekYear], instead of a Dataset[Row] that has arbitrary columns. You operate on it similarly to an RDD. Then just grab the .first() one of those and you'll already have the type you need.
val resultWeekYear = ds.first
Given the below two Spark Datasets, flights and capitals, what would be the most efficient way to return combined (i.e. "joined") result without converting first to a DataFrame or writing out all the columns out by name in a .select() method? I know, for example, that I can access either tuple with (e.g. .map(x => x._1) or use the * operator with:
result.select("_1.*","_2.*")
But the latter may result in duplicate column names and I'm hoping for a cleaner solution.
Thank you for your help.
case class Flights(tripNumber: Int, destination: String)
case class Capitals(state: String, capital: String)
val flights = Seq(
(55, "New York"),
(3, "Georgia"),
(12, "Oregon")
).toDF("tripNumber","destination").as[Flights]
val capitals = Seq(
("New York", "Albany"),
("Georgia", "Atlanta"),
("Oregon", "Salem")
).toDF("state","capital").as[Capitals]
val result = flights.joinWith(capitals,flights.col("destination")===capitals.col("state"))
There are 2 options, but you will have to use join instead of joinWith:
That is the best part of the Dataset API, is to drop one of the join columns
, thus no need to repeat projection columns in a select: val result = flights.join(capitals,flights("destination")===capitals("state")).drop(capitals("state"))
rename join column to be the same in both datasets and use a slightly different way of specifying the join: val result = flights.join(capitals.withColumnRenamed("state", "destination"), Seq("destination"))
Output:
result.show
+-----------+----------+-------+
|destination|tripNumber|capital|
+-----------+----------+-------+
| New York| 55| Albany|
| Georgia| 3|Atlanta|
| Oregon| 12| Salem|
+-----------+----------+-------+
Still a beginner in Scala and Spark, I think I'm just being brainless here. I have two RDDs, one of the type :-
((String, String), Int) = ((" v67430612_serv78i"," fb_201906266952256"),1)
Other of the type :-
(String, String, String) = (r316079113_serv60i,fb_100007609418328,-795000)
As it can be seen, the first two columns of the two RDDs are of the same format. Basically they are ID's, one is 'tid' and the other is 'uid'.
The question is this :
Is there a method by which I can compare the two RDDs in such a manner that the tid and uid are matched in both and all the data for the same matching ids is displayed in a single row without any repetitions?
Eg : If I get a match of tid and uid between the two RDDs
((String, String), Int) = ((" v67430612_serv78i"," fb_201906266952256"),1)
(String, String, String) = (" v67430612_serv78i"," fb_201906266952256",-795000)
Then the output is:-
((" v67430612_serv78i"," fb_201906266952256",-795000),1)
The IDs in the two RDDs are not in any fixed order. They are random i.e the same uid and tid serial number may not correspond in both the RDDs.
Also, how will the solution change if the first RDD type stays the same but the second RDD changes to type :-
((String, String, String), Int) = ((daily_reward_android_5.76,fb_193055751144610,81000),1)
I have to do this without the use of Spark SQL.
I would suggest you to convert your rdds to dataframes and apply join for easiness.
Your first dataframe should be
+------------------+-------------------+-----+
|tid |uid |count|
+------------------+-------------------+-----+
| v67430612_serv78i| fb_201906266952256|1 |
+------------------+-------------------+-----+
The second dataframe should be
+------------------+-------------------+-------+
|tid |uid |amount |
+------------------+-------------------+-------+
| v67430612_serv78i| fb_201906266952256|-795000|
+------------------+-------------------+-------+
Then getting the final output is just inner join as
df2.join(df1, Seq("tid", "uid"))
which will give output as
+------------------+-------------------+-------+-----+
|tid |uid |amount |count|
+------------------+-------------------+-------+-----+
| v67430612_serv78i| fb_201906266952256|-795000|1 |
+------------------+-------------------+-------+-----+
Edited
If you want to do it without dataframe/spark sql then there is join in rdd way too but you will have to modify as below
rdd2.map(x => ((x._1, x._2), x._3)).join(rdd1).map(y => ((y._1._1, y._1._2, y._2._1), y._2._2))
This will work only if you have rdd1 and rdd2 as defined in your question as ((" v67430612_serv78i"," fb_201906266952256"),1) and (" v67430612_serv78i"," fb_201906266952256",-795000) respectively.
you should have final output as
(( v67430612_serv78i, fb_201906266952256,-795000),1)
Make sure that you trim the values for empty spaces. This will help you to be sure that both rdds have same values for key while joining, otherwise you might get an empty result.
I am trying to compare count of 2 different queries/tables. Is it possible to perform this operation in Scala(Spark SQL)?
Here is my code:
val parquetFile1 = sqlContext.read.parquet("/user/njrbars2/ars/mbr_addr/2016/2016_000_njars_09665_mbr_addr.20161222031015221601.parquet")
val parquetFile2 =sqlContext.read.parquet("/user/njrbars2/ars/mbr_addr/2017/part-r-00000-70ce4958-57fe-487f-a45b-d73b7ef20289.snappy.parquet")
parquetFile1.registerTempTable("parquetFile1")
parquetFile2.registerTempTable("parquetFile2")
scala> var first_table_count=sqlContext.sql("select count(*) from parquetFile1")
first_table_count: org.apache.spark.sql.DataFrame = [_c0: bigint]
scala> var second_table_count=sqlContext.sql("select count(*) from parquetFile1 where LINE1_ADDR is NULL and LINE2_ADDR is NULL")
second_table_count: org.apache.spark.sql.DataFrame = [_c0: bigint]
scala> first_table_count.show()
+------+
| _c0|
+------+
|119928|
+------+
scala> second_table_count.show()
+---+
|_c0|
+---+
|617|
+---+
I am trying to get difference value of both these queries but getting error.
scala> first_table_count - second_table_count
<console>:30: error: value - is not a member of org.apache.spark.sql.DataFrame
first_table_count - second_table_count
whereas if I do normal substraction, it works
scala> 2 - 1
res7: Int = 1
It seems I have to do some data conversion but not able to find appropriate solution.
In newer version of spark count not return Long value instead it is reaped inside dataframe object i.e. Dataframe[BigInt].
you can try this
val diffrence = first_table_count.first.getLong(0) - second_table_count.first.getLong(0);
And subtract method is not available on dataframe.
You need something like the following to do the conversion:
first_table_count.first.getLong(0)
And here is why you need it:
A DataFrame represents a tabular data structure. So although your SQL seems to return a single value, it actually returns a table containing a single row, and the row contains a single column. Hence we use the above code to extract the first column (index 0) of the first row.