Joining two HDFS files in in Spark - scala

I want to join two files from HDFS using spark shell.
Both the files are tab separated and I want to join on second column
Tried code
But not giving any output
val ny_daily= sc.parallelize(List("hdfs://localhost:8020/user/user/NYstock /NYSE_daily"))
val ny_daily_split = ny_daily.map(line =>line.split('\t'))
val enKeyValuePair = ny_daily_split.map(line => (line(0).substring(0, 5), line(3).toInt))
val ny_dividend= sc.parallelize(List("hdfs://localhost:8020/user/user/NYstock/NYSE_dividends"))
val ny_dividend_split = ny_dividend.map(line =>line.split('\t'))
val enKeyValuePair1 = ny_dividend_split.map(line => (line(0).substring(0, 4), line(3).toInt))
enKeyValuePair1.join(enKeyValuePair)
But I am not getting any information for how to join files on particular column
Please suggest

I am not getting any information for how to join files on particular column
RDDs are joined on their keys, so you decided the column to join on when you wrote:
val enKeyValuePair = ny_daily_split.map(line => (line(0).substring(0, 5), line(3).toInt))
...
val enKeyValuePair1 = ny_daily_split.map(line => (line(0).substring(0, 4), line(3).toInt))
Your RDDs will be joined on the values coming from line(0).substring(0, 5) and line(0).substring(0, 4).
You can find the join function (and many other useful functions) here and the Spark Programming Guide is a great reference to understand how Spark works.
Tried code But not giving any output
In order to see the output, you have to ask Spark to print it:
enKeyValuePair1.join(enKeyValuePair).foreach(println)
Note: to load data from files you should use sc.textFile(): sc.parallelize() is only used to make RDDs out of Scala collections.
The following code should do the job:
val ny_daily_split = sc.textFile("hdfs://localhost:8020/user/user/NYstock/NYSE_daily").map(line =>line.split('\t'))
val ny_dividend_split = sc.textFile("hdfs://localhost:8020/user/user/NYstock/NYSE_dividends").map(line =>line.split('\t'))
val enKeyValuePair = ny_daily_split.map(line => line(0).substring(0, 5) -> line(3).toInt)
val enKeyValuePair1 = ny_dividend_split.map(line => line(0).substring(0, 4) -> line(3).toInt)
enKeyValuePair1.join(enKeyValuePair).foreach(println)
By the way, you mentioned that you want to join on the second column but you are actually using line(0), is this intended?
Hope this helps!

Related

Subquery vs Dataframe filter function in spark

I am running the below spark SQL with the subquery.
val df = spark.sql("""select * from employeesTableTempview where dep_id in (select dep_id from departmentTableTempview)""")
df.count()
I also run the same with the help of dataframe functional way like below, Let's assume we read the employee table and department table as a dataframes and their names should be empDF and DepDF respectively,
val depidList = DepDF.map(x=>x(0).string).collect().toList()
val empdf2 = empDF.filter(col("dep_id").isin(depidList:_*))
empdf2.count
In these above two scenarios, which one gives better performance and why? Please help me to understand this scenarios in spark scala.
I can give you classic answer: it depends :D
Lets take a look at first case. I prepared similar example:
import org.apache.spark.sql.functions._
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
val data = Seq(("test", "3"),("test", "3"), ("test2", "5"), ("test3", "7"), ("test55", "86"))
val data2 = Seq(("test", "3"),("test", "3"), ("test2", "5"), ("test3", "6"), ("test33", "76"))
val df1 = data.toDF("name", "dep_id")
val df2 = data2.toDF("name", "dep_id")
df1.createOrReplaceTempView("employeesTableTempview")
df2.createOrReplaceTempView("departmentTableTempview")
val result = spark.sql("select * from employeesTableTempview where dep_id in (select dep_id from departmentTableTempview)")
result.count
I am setting autoBroadcastJoinThreshold to -1 because i assume that your datasets are going to be bigger than default 10mb for this parameter
This Sql query generates this plan:
As you can see spark is performing a SMJ which will be a case most of the time for datasets bigger than 10mb. This requires data to be shuffled and then sorted so its quiet heavy operation
Now lets check option2 (first lines of codes are the same as previously):
val depidList = df1.map(x=>x.getString(1)).collect().toList
val empdf2 = df2.filter(col("dep_id").isin(depidList:_*))
empdf2.count
For this option plan is different. You dont have the join obviously but there are two separate sqls. First is for reading DepDF dataset and then collecting one column as a list. In second sql this list is used to filter the data in empDF dataset.
When DepDF is relatively small it should be fine, but if you need more generic solution you may stick to sub-query which is going to resolve to join. You can also use join directly on your dataframes with Spark df api

In Scala, What is correct way to filter Spark Cassandra RDD by a List[String]?

I have a list of ids in string format, this list can be roughly 20,000 ids long:
var timelineIds = source.map(a => a.timelineid);
timelineIds = timelineIds.distinct.cache; // disticnt list we need this for later
var timelineIdsString = timelineIds.map(a => a.asInstanceOf[String]).collect.toList;
When I use this list against one of my cassandra tables it works just fine, no matter the size of timelineIdsString:
var timelineHistorySource = sc.cassandraTable[Timeline]("acd", "timeline_history_bytimelineid")
.select("ownerid", "userid", "timelineid", "timelinetype", "starttime", "endtime", "attributes", "states")
if (constrain)
timelineHistorySource = timelineHistorySource.where("timelineid IN ?", timelineIdsString)
When I do it against another of my tables, it never completes when I have over 1000 ids in the List:
var dispositionSource = sc.cassandraTable[DispositionSource]("acd","dispositions_bytimelineid")
.select("ownerid","dispositionid","month","timelineid","createddate","createduserid")
if(constrain)
dispositionSource = dispositionSource.where("timelineid IN ?", timelineIdsString);
Both cassandra tables have the key as the timelineid so I know that its correct. This code works fine as long as timelineids is a small list.
Is there a better way to filter from cassandra RDD? Is it the size of the IN clause causing it to choke?
Instead of performing join on Spark level it's better to perform join using Cassandra itself - in this case you'll read from Cassandra only the necessary data (given that join key is partition or primary key). For RDDs this is could be done with .joinWithCassandraTable function (doc):
import com.datastax.spark.connector._
val toJoin = sc.parallelize(1 until 5).map(x => Tuple1(x.toInt))
val joined = toJoin.joinWithCassandraTable("test", "jtest1")
.on(SomeColumns("pk"))
scala> joined.toDebugString
res21: String =
(8) CassandraJoinRDD[150] at RDD at CassandraRDD.scala:18 []
| ParallelCollectionRDD[147] at parallelize at <console>:33 []
For Dataframes it's so called direct join that is available since SCC 2.5 (see announcement) - you need to pass some configs to enable it, see docs:
import spark.implicits._
import org.apache.spark.sql.cassandra._
val cassdata = spark.read.cassandraFormat("jtest1", "test").load
val toJoin = spark.range(1, 5).select($"id".cast("int").as("id"))
val joined = toJoin.join(cassdata, cassdata("pk") === toJoin("id"))
scala> joined.explain
== Physical Plan ==
Cassandra Direct Join [pk = id#2] test.jtest1 - Reading (pk, c1, c2, v) Pushed {}
+- *(1) Project [cast(id#0L as int) AS id#2]
+- *(1) Range (1, 5, step=1, splits=8)
I have quite a long & detailed blog post about joins with Cassandra - check it for more details.
You can try instead keeping the IDs list as a dataframe, timelineIds, and inner joining the table with it based on timelineid. Then remove the unnecessary column (timelineIds.timelineid) from the resulting df.

Spark scala- How to apply transformation logic on a generic set of columns defined in a file

I am using spark scala 1.6 version.
I have 2 files, one is a schema file which has hundreds of columns separated by commas and another file is .gz file which contains data.
I am trying to read the data using the schema file and apply different transformation logic on a set of few columns .
I tried running a sample code but I have hardcoded the columns numbers in the attached pic.
Also I want to write a udf which could read any set of columns and apply the transformation like replacing a special character and give the output.
Appreciate any suggestion
import org.apache.spark.SparkContext
val rdd1 = sc.textFile("../inp2.txt")
val rdd2 = rdd1.map(line => line.split("\t"))
val rdd2 = rdd1.map(line => line.split("\t")(1)).toDF
val replaceUDF = udf{s: String => s.replace(".", "")}
rdd2.withColumn("replace", replaceUDF('_1)).show
You can read the field name file with simple scala code and create a list of column names as
// this reads the file and creates a list of columnnames
val line = Source.fromFile("path to file").getLines().toList.head
val columnNames = line.split(",")
//read the text file as an rdd and convert to Dataframe
val rdd1 = sc.textFile("../inp2.txt")
val rdd2 = rdd1.map(line => line.split("\t")(1))
.toDF(columnNames : _*)
This creates a dataframe with columns names that you have in a separate file.
Hope this helps!

spark scala reducekey dataframe operation

I'm trying to do a count in scala with dataframe. My data has 3 columns and I've already loaded the data and split by tab. So I want to do something like this:
val file = file.map(line=>line.split("\t"))
val x = file1.map(line=>(line(0), line(2).toInt)).reduceByKey(_+_,1)
I want to put the data in dataframe, and having some trouble on the syntax
val file = file.map(line=>line.split("\t")).toDF
val file.groupby(line(0))
.count()
Can someone help check if this is correct?
spark needs to know the schema of the df
there are many ways to specify the schema, here is one option:
val df = file
.map(line=>line.split("\t"))
.map(l => (l(0), l(1).toInt)) //at this point spark knows the number of columns and their types
.toDF("a", "b") //give the columns names for ease of use
df
.groupby('a)
.count()

Finding the union of RDDs which may not exist

I'm trying to get the union of a few RDDs. The RDDs are being read in via SparkContext.textFile, but some may not exist on the file system.
val rdd1 = Try(Repository.fetch(data1Path))
val rdd2 = Try(Repository.fetch(data2Path))
val rdd3 = Try(Repository.fetch(data3Path))
val rdd4 = Try(Repository.fetch(data4Path))
val all = Seq(rdd1, rdd2, rdd3, rdd4)
val union = sc.union(all.map {case Success(r) => r})
val results = union.filter(some-filter-logic).collect
However due to lazy evaluation, all those Try statements evaluate to Success regardless of whether the files are present or not, and I end up with a FileNotFoundException upon evaluation when collect is called.
Is there a way around this?
You can try to run a loop to check whether the file exists and in the same loop you can create the RDDs and get a union.
OR
you can use wholeTextFiles API to read all the files present in one directory as key,value pair.
val rdd=sc.wholeTextFiles(path, minPartitions)
If any file will be empty also, it will not not create any issue.