Import TSV File in spark - scala

I am new to spark so forgive me for asking a basic question. I'm trying to import my tsv file into spark but I'm not sure if its working.
val lines = sc.textFile("/home/cloudera/Desktop/Test/test.tsv
val split_lines = lines.map(_.split("\t"))
split_lines.first()
Everything seems to be working fine. Is there a way I can see if the tsv file has loaded properly?
SAMPLE DATA: (all using tabs as spaces)
hastag 200904 24 blackcat
hastag 200908 1 oaddisco
hastag 200904 1 blah
hastag 200910 3 mydda

So far, your code looks good to me. If you print that first line to the console, do you see the expected data?
To explore the Spark API, the best thing to do is to use the Spark-shell, a Scala REPL enriched with Spark-specifics that builds a default Spark Context for you.
It will let you explore the data a lot easier.
Here's an example loading ~65k lines csv file. Similar usecase to what you're doing, I guess:
$><spark_dir>/bin/spark-shell
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.0.0-SNAPSHOT
/_/
scala> val lines=sc.textFile("/home/user/playground/ts-data.csv")
lines: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at <console>:12
scala> val csv=lines.map(_.split(";"))
csv: org.apache.spark.rdd.RDD[Array[String]] = MappedRDD[2] at map at <console>:14
scala> csv.count
(... spark processing ...)
res0: Long = 67538
// let's have a look at the first record
scala> csv.first
14/06/01 12:22:17 INFO SparkContext: Starting job: first at <console>:17
14/06/01 12:22:17 INFO DAGScheduler: Got job 1 (first at <console>:17) with 1 output partitions (allowLocal=true)
14/06/01 12:22:17 INFO DAGScheduler: Final stage: Stage 1(first at <console>:17)
14/06/01 12:22:17 INFO DAGScheduler: Parents of final stage: List()
14/06/01 12:22:17 INFO DAGScheduler: Missing parents: List()
14/06/01 12:22:17 INFO DAGScheduler: Computing the requested partition locally
14/06/01 12:22:17 INFO HadoopRDD: Input split: file:/home/user/playground/ts-data.csv:0+1932934
14/06/01 12:22:17 INFO SparkContext: Job finished: first at <console>:17, took 0.003210457 s
res1: Array[String] = Array(20140127, 0000df, d063b4, ***, ***-Service,app180000m,49)
// groupby id - count unique
scala> csv.groupBy(_(4)).count
(... Spark processing ...)
res2: Long = 37668
// records per day
csv.map(record => record(0)->1).reduceByKey(_+_).collect
(... more Spark processing ...)
res8: Array[(String, Int)] = Array((20140117,1854), (20140120,2028), (20140124,3398), (20140131,6084), (20140122,5076), (20140128,8310), (20140123,8476), (20140127,1932), (20140130,8482), (20140129,8488), (20140118,5100), (20140109,3488), (20140110,4822))
* Edit using data added to the question *
val rawData="""hastag 200904 24 blackcat
hastag 200908 1 oaddisco
hastag 200904 1 blah
hastag 200910 3 mydda"""
//split lines
val data= rawData.split("\n")
val rdd= sc.parallelize(data)
// Split using space as separator
val byId=rdd.map(_.split(" ")).groupBy(_(1))
byId.count
res11: Long = 3

Related

How to check the number of partitions of a Spark DataFrame without incurring the cost of .rdd

There are a number of questions about how to obtain the number of partitions of a n RDD and or a DataFrame : the answers invariably are:
rdd.getNumPartitions
or
df.rdd.getNumPartitions
Unfortunately that is an expensive operation on a DataFrame because the
df.rdd
requires conversion from the DataFrame to an rdd. This is on the order of the time it takes to run
df.count
I am writing logic that optionally repartition's or coalesce's a DataFrame - based on whether the current number of partitions were within a range of acceptable values or instead below or above them.
def repartition(inDf: DataFrame, minPartitions: Option[Int],
maxPartitions: Option[Int]): DataFrame = {
val inputPartitions= inDf.rdd.getNumPartitions // EXPENSIVE!
val outDf = minPartitions.flatMap{ minp =>
if (inputPartitions < minp) {
info(s"Repartition the input from $inputPartitions to $minp partitions..")
Option(inDf.repartition(minp))
} else {
None
}
}.getOrElse( maxPartitions.map{ maxp =>
if (inputPartitions > maxp) {
info(s"Coalesce the input from $inputPartitions to $maxp partitions..")
inDf.coalesce(maxp)
} else inDf
}.getOrElse(inDf))
outDf
}
But we can not afford to incur the cost of the rdd.getNumPartitions for every DataFrame in this manner.
Is there not any way to obtain this information - e.g. from querying the online/temporary catalog for the registered table maybe?
Update The Spark GUI showed the DataFrame.rdd operation as taking as long as the longest sql in the job. I will re-run the job and attach the screenshot in a bit here.
The following is just a testcase : it is using a small fraction of the data size of that in production. The longest sql is only five minutes - and this one is on its way to spending that amount of time as well (note that the sql is not helped out here: it also has to execute subsequently thus effectively doubling the cumulative execution time).
We can see that the .rdd operation at DataFrameUtils line 30 (shown in the snippet above) takes 5.1mins - and yet the save operation still took 5.2 mins later -i.e. we did not save any time by doing the .rdd in terms of the execution time of the subsequent save.
There is no inherent cost of rdd component in rdd.getNumPartitions, because returned RDD is never evaluated.
While you can easily determine this empirically, using debugger (I'll leave this as an exercise for the reader), or establishing that no jobs are triggered in the base case scenario
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.0
/_/
Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_181)
Type in expressions to have them evaluated.
Type :help for more information.
scala> val ds = spark.read.text("README.md")
ds: org.apache.spark.sql.DataFrame = [value: string]
scala> ds.rdd.getNumPartitions
res0: Int = 1
scala> spark.sparkContext.statusTracker.getJobIdsForGroup(null).isEmpty // Check if there are any known jobs
res1: Boolean = true
it might be not enough to convince you. So let's approach this in a more systematic way:
rdd returns a MapPartitionRDD (ds as defined above):
scala> ds.rdd.getClass
res2: Class[_ <: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]] = class org.apache.spark.rdd.MapPartitionsRDD
RDD.getNumPartitions invokes RDD.partitions.
In non-checkpointed scenario RDD.partitions invokes getPartitions (feel free to trace the checkpoint path as well).
RDD.getPartitions is abstract.
So the actual implementation used in this case is MapPartitionsRDD.getPartitions, which simply delegates the call to the parent.
There are only MapPartitionsRDD between rdd and the source.
scala> ds.rdd.toDebugString
res3: String =
(1) MapPartitionsRDD[3] at rdd at <console>:26 []
| MapPartitionsRDD[2] at rdd at <console>:26 []
| MapPartitionsRDD[1] at rdd at <console>:26 []
| FileScanRDD[0] at rdd at <console>:26 []
Similarly if Dataset contained an exchange we would follow the parents to the nearest shuffle:
scala> ds.orderBy("value").rdd.toDebugString
res4: String =
(67) MapPartitionsRDD[13] at rdd at <console>:26 []
| MapPartitionsRDD[12] at rdd at <console>:26 []
| MapPartitionsRDD[11] at rdd at <console>:26 []
| ShuffledRowRDD[10] at rdd at <console>:26 []
+-(1) MapPartitionsRDD[9] at rdd at <console>:26 []
| MapPartitionsRDD[5] at rdd at <console>:26 []
| FileScanRDD[4] at rdd at <console>:26 []
Note that this case is particularly interesting, because we actually triggered a job:
scala> spark.sparkContext.statusTracker.getJobIdsForGroup(null).isEmpty
res5: Boolean = false
scala> spark.sparkContext.statusTracker.getJobIdsForGroup(null)
res6: Array[Int] = Array(0)
That's because we've encountered as scenario where the partitions cannot be determined statically (see Number of dataframe partitions after sorting? and Why does sortBy transformation trigger a Spark job?).
In such scenario getNumPartitions will also trigger a job:
scala> ds.orderBy("value").rdd.getNumPartitions
res7: Int = 67
scala> spark.sparkContext.statusTracker.getJobIdsForGroup(null) // Note new job id
res8: Array[Int] = Array(1, 0)
however it doesn't mean that the observed cost is somehow related to .rdd call. Instead it is an intrinsic cost of finding partitions in case, where there is no static formula (some Hadoop input formats for example, where full scan of the data is required).
Please note that the points made here shouldn't be extrapolated to other applications of Dataset.rdd. For example ds.rdd.count would be indeed expensive and wasteful.
In my experience df.rdd.getNumPartitions is very fast, I never encountered taking this more than a second or so.
Alternatively, you could also try
val numPartitions: Long = df
.select(org.apache.spark.sql.functions.spark_partition_id()).distinct().count()
which would avoid using .rdd

Why does Spark increment the RDD ID by 2 instead of 1 when reading in text files?

I noticed something interesting when working with the spark-shell and I'm curious as to why this is happening. I load a text file into Spark using the basic syntax, and then I just simply repeat this command. The output of the REPL is below:
scala> val myreviews = sc.textFile("Reviews.csv")
myreviews: org.apache.spark.rdd.RDD[String] = Reviews.csv MapPartitionsRDD[1] at textFile at <console>:24
scala> val myreviews = sc.textFile("Reviews.csv")
myreviews: org.apache.spark.rdd.RDD[String] = Reviews.csv MapPartitionsRDD[3] at textFile at <console>:24
scala> val myreviews = sc.textFile("Reviews.csv")
myreviews: org.apache.spark.rdd.RDD[String] = Reviews.csv MapPartitionsRDD[5] at textFile at <console>:24
scala> val myreviews = sc.textFile("Reviews.csv")
myreviews: org.apache.spark.rdd.RDD[String] = Reviews.csv MapPartitionsRDD[7] at textFile at <console>:24
I know that the MapPartitionsRDD[X] portion features X as the RDD identifier. However, based upon this SO post on RDD identifiers, I'd expect that the identifier integer increments by one each time a new RDD is created. So why exactly is it incrementing by 2?
My guess is that loading a text file creates an intermediate RDD? Because clearly creating an RDD from parallelize() only increments the RDD counter by 1 (before it was 7):
scala> val arrayrdd = sc.parallelize(Array(3,4,5))
arrayrdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[8] at parallelize at <console>:24
Note: I don't believe the number has anything to do w/ partitions. If I call, I get that my RDD is partitioned into 9 partitions:
scala> myreviews.partitions.size
res2: Int = 9
Because a single method call can create more than one intermediate RDD. It will be obvious if you check the debug string
sc.textFile("README.md").toDebugString
String =
(2) README.md MapPartitionsRDD[1] at textFile at <console>:25 []
| README.md HadoopRDD[0] at textFile at <console>:25 []
As you see the lineage consist of two RDDs.
The first one is a HadoopRDD which corresponds to data import
hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
minPartitions)
The second one is MapPartitionsRDD and corresponds to the subsequent map which drops keys (offsets) and converts Text to String.
.map(pair => pair._2.toString).setName(path)

Spark RDD default number of partitions

Version: Spark 1.6.2, Scala 2.10
I am executing below commands In the spark-shell.
I am trying to see the number of partitions that Spark is creating by default.
val rdd1 = sc.parallelize(1 to 10)
println(rdd1.getNumPartitions) // ==> Result is 4
//Creating rdd for the local file test1.txt. It is not HDFS.
//File content is just one word "Hello"
val rdd2 = sc.textFile("C:/test1.txt")
println(rdd2.getNumPartitions) // ==> Result is 2
As per the Apache Spark documentation, the spark.default.parallelism is the number of cores in my laptop (which is 2 core processor).
My question is : rdd2 seem to be giving the correct result of 2 partitions as said in the documentation. But why rdd1 is giving the result as 4 partitions ?
The minimum number of partitions is actually a lower bound set by the SparkContext. Since spark uses hadoop under the hood, Hadoop InputFormat` will still be the behaviour by default.
The first case should reflect defaultParallelism as mentioned here which may differ, depending on settings and hardware. (Numbers of cores, etc.)
So unless you provide the number of slices, that first case would be defined by the number described by sc.defaultParallelism:
scala> sc.defaultParallelism
res0: Int = 6
scala> sc.parallelize(1 to 100).partitions.size
res1: Int = 6
As for the second case, with sc.textFile, the number of slices by default is the minimum number of partitions.
Which is equal to 2 as you can see in this section of code.
Thus, you should consider the following :
sc.parallelize will take numSlices or defaultParallelism.
sc.textFile will take the maximum between minPartitions and the number of splits computed based on hadoop input split size divided by the block size.
sc.textFile calls sc.hadoopFile, which creates a HadoopRDD that uses InputFormat.getSplits under the hood [Ref. InputFormat documentation].
InputSplit[] getSplits(JobConf job, int numSplits) throws IOException : Logically split the set of input files for the job.
Each InputSplit is then assigned to an individual Mapper for processing.
Note: The split is a logical split of the inputs and the input files are not physically split into chunks. For e.g. a split could be tuple. Parameters: job - job configuration.
numSplits - the desired number of splits, a hint. Returns: an array of InputSplits for the job. Throws: IOException.
Example:
Let's create some dummy text files:
fallocate -l 241m bigfile.txt
fallocate -l 4G hugefile.txt
This will create 2 files, respectively, of size 241MB and 4GB.
We can see what happens when we read each of the files:
scala> val rdd = sc.textFile("bigfile.txt")
// rdd: org.apache.spark.rdd.RDD[String] = bigfile.txt MapPartitionsRDD[1] at textFile at <console>:27
scala> rdd.getNumPartitions
// res0: Int = 8
scala> val rdd2 = sc.textFile("hugefile.txt")
// rdd2: org.apache.spark.rdd.RDD[String] = hugefile.txt MapPartitionsRDD[3] at textFile at <console>:27
scala> rdd2.getNumPartitions
// res1: Int = 128
Both of them are actually HadoopRDDs:
scala> rdd.toDebugString
// res2: String =
// (8) bigfile.txt MapPartitionsRDD[1] at textFile at <console>:27 []
// | bigfile.txt HadoopRDD[0] at textFile at <console>:27 []
scala> rdd2.toDebugString
// res3: String =
// (128) hugefile.txt MapPartitionsRDD[3] at textFile at <console>:27 []
// | hugefile.txt HadoopRDD[2] at textFile at <console>:27 []

Spark UI DAG stage disconnected

I ran the following job in the spark-shell:
val d = sc.parallelize(0 until 1000000).map(i => (i%100000, i)).persist
d.join(d.reduceByKey(_ + _)).collect
The Spark UI shows three stages. Stage 4 and 5 correspond to the computation of d, and stage 6 corresponds to the computation of the collect action. Since d is persisted, I would expect only two stages. However stage 5 is present not connected to any other stages.
So tried running the same computation without using persist, and the DAG looks like identically, except without the green dots indicating the RDD has been persisted.
I would expect the output of stage 11 to be connect to the input of stage 12, but it is not.
Looking at the stage descriptions, the stages seem to indicate that d is being persisted, because stage 5 has input, but I am still confused as to why stage 5 even exists.
Input RDD is cached and cached part is not recomputed.
This can be validated with a simple test:
import org.apache.spark.SparkContext
def f(sc: SparkContext) = {
val counter = sc.longAccumulator("counter")
val rdd = sc.parallelize(0 until 100).map(i => {
counter.add(1L)
(i%10, i)
}).persist
rdd.join(rdd.reduceByKey(_ + _)).foreach(_ => ())
counter.value
}
assert(f(spark.sparkContext) == 100)
Caching doesn't remove stages from DAG.
If data is cached corresponding stages can be marked as skipped but are still part of the DAG. Lineage can be truncated using checkpoints but it is not the same thing and it doesn't remove stages from visualization.
Input stages contain more than cached computations.
Spark stages group together operations which can be chained without performing shuffle.
While part of the input stage is cached it doesn't cover all the operations required to prepare shuffle files. This is why you don't see skipped tasks.
The rest (detachment) is just a limitation of the graph visualization.
If you repartition data first:
import org.apache.spark.HashPartitioner
val d = sc.parallelize(0 until 1000000)
.map(i => (i%100000, i))
.partitionBy(new HashPartitioner(20))
d.join(d.reduceByKey(_ + _)).collect
you'll get DAG you're most likely looking for:
Adding to user6910411's detailed answer, RDD is not persisted in memory until the first action runs and it computes the whole DAG, due to lazy evaluation of RDDs. So when you run collect() first time, RDD "d" gets persisted in memory for the first time, but nothing gets read from the memory. If you run collect() second time, the cached RDD is read.
Also, if you do a toDebugString on the final RDD, it shows the below output:
scala> d.join(d.reduceByKey(_ + _)).toDebugString
res5: String =
(4) MapPartitionsRDD[19] at join at <console>:27 []
| MapPartitionsRDD[18] at join at <console>:27 []
| CoGroupedRDD[17] at join at <console>:27 []
+-(4) MapPartitionsRDD[15] at map at <console>:24 []
| | ParallelCollectionRDD[14] at parallelize at <console>:24 []
| ShuffledRDD[16] at reduceByKey at <console>:27 []
+-(4) MapPartitionsRDD[15] at map at <console>:24 []
| ParallelCollectionRDD[14] at parallelize at <console>:24 []
A rough graphical representation of above can be shown as:RDD Stages

Co-occurrence graph RpcTimeoutException in Apache Spark

I have a file that maps from documentId to entities and I extract document co-occurrences. The entities RDD looks like this:
//documentId -> (name, type, frequency per document)
val docEntityTupleRDD: RDD[(Int, Iterable[(String, String, Int)])]
To extract relationships between entities and their frequency within each document, I use the following code:
def hashId(str: String) = {
Hashing.md5().hashString(str, Charsets.UTF_8).asLong()
}
val docRelTupleRDD = docEntityTupleRDD
//flatMap at SampleGraph.scala:62
.flatMap { case(docId, entities) =>
val entitiesWithId = entities.map { case(name, _, freq) => (hashId(name), freq) }.toList
val relationships = entitiesWithId.combinations(2).collect {
case Seq((id1, freq1), (id2, freq2)) if id1 != id2 =>
// Make sure left side is less than right side
val (first, second) = if (id1 < id2) (id1, id2) else (id2, id1)
((first, second), (docId.toInt, freq1 * freq2))
}
relationships
}
val zero = collection.mutable.Map[Int, Int]()
val edges: RDD[Edge[immutable.Map[Int, Int]]] = docRelTupleRDD
.aggregateByKey(zero)(
(map, v) => map += v,
(map1, map2) => map1 ++= map2
)
.map { case ((e1, e2), freqMap) => Edge(e1, e2, freqMap.toMap) }
Each edge stores the relationship frequency per document in a Map. When I'm trying to write the edges to file:
edges.saveAsTextFile(outputFile + "_edges")
I receive the following errors after some time:
15/12/28 02:39:40 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 127198 ms exceeds timeout 120000 ms
15/12/28 02:39:40 ERROR TaskSchedulerImpl: Lost executor driver on localhost: Executor heartbeat timed out after 127198 ms
15/12/28 02:39:40 INFO TaskSetManager: Re-queueing tasks for driver from TaskSet 0.0
15/12/28 02:42:50 WARN AkkaRpcEndpointRef: Error sending message [message = Heartbeat(driver,[Lscala.Tuple2;#64c6e4c4,BlockManagerId(driver, localhost, 35375))] in 1 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout
15/12/28 02:42:50 WARN TaskSetManager: Lost task 2.0 in stage 0.0 (TID 2, localhost): ExecutorLostFailure (executor driver lost)
15/12/28 02:43:55 ERROR TaskSetManager: Task 2 in stage 0.0 failed 1 times; aborting job
15/12/28 02:46:04 WARN TaskSetManager: Lost task 5.0 in stage 0.0 (TID 5, localhost): ExecutorLostFailure (executor driver lost)
[...]
15/12/28 02:47:07 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/12/28 02:48:36 WARN AkkaRpcEndpointRef: Error sending message [message = Heartbeat(driver,[Lscala.Tuple2;#64c6e4c4,BlockManagerId(driver, localhost, 35375))] in 2 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout
15/12/28 02:49:39 INFO TaskSchedulerImpl: Cancelling stage 0
15/12/28 02:49:39 INFO DAGScheduler: ShuffleMapStage 0 (flatMap at SampleGraph.scala:62) failed in 3321.145 s
15/12/28 02:51:06 WARN SparkContext: Killing executors is only supported in coarse-grained mode
[...]
My spark configuration looks like this:
val conf = new SparkConf()
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.setAppName("wordCount")
.setMaster("local[8]")
.set("spark.executor.memory", "8g")
.set("spark.driver.maxResultSize", "8g")
// Increase memory fraction to prevent disk spilling
.set("spark.shuffle.memoryFraction", "0.3")
// Disable spilling
// If set to "true", limits the amount of memory used during reduces by spilling data out to disk.
// This spilling threshold is specified by spark.shuffle.memoryFraction.
.set("spark.shuffle.spill", "false")
I already increased the executor memory and refactored a previous reduceByKey construct with aggregateByKey after research on the internet. Error stays the same. Can someone help me?