Pyspark command moves to new line - pyspark

why is that the below query executed in pyspark moves to next line instead of executing the current command
spark 1.6 in cloudera VM 5.X
Created a rdd by name fprdd ( this command executed fine)
pair1rdd = fprdd.map(lambda x : (x[2] ,(x[0],x[1]))
Tried even selecting the command by Shift + Enter too ( same in vain)
hitting enter takes to the new line
Can anyone help me with possible solution .

In order to get the answer to this question, you should go thru what is lazy evaluation in spark.
All transformations in Spark are lazy, in that they do not compute
their results right away. Instead, they just remember the
transformations applied to some base dataset (e.g. a file). The
transformations are only computed when an action requires a result to
be returned to the driver program. This design enables Spark to run
more efficiently. For example, we can realize that a dataset created
through map will be used in a reduce and return only the result of the
reduce to the driver, rather than the larger mapped dataset.

Related

Spark dataframe seems to be recomputed twice

SOLVED: I solved the issue, that was due to a very stupid, silly, idiotic mistake in one of the first passages of the flow.
Basically, I was computing a dataframe that was written to a Hive table; this dataframe then needed to be used to create the temporaryDF after many passages, but I was originally querying the table from scratch instead of using a copy of the dataframe to-be-written in the table. The mistake lies in the fact that the just-computed dataframes was missing previous partitions (due to the specific logic of the flow), whereas next computations to create temporaryDF needed also at least two previous partitions. I don't know why, I can't remember when, I decided to cache the just-computed one, thus losing information and getting an empty one under Oozie (in Spark-Shell I was always using at least three partitions, due to manually updating the table after some time - each new partition came every 15min). I was probably in a late night working sprint and my brain decided it was worthy to mess it up.
I upvoted and accepted #thebluephantom answer because he is very right within the specific circumstance I was describing.
Original:
I'm having a strange behaviour using Spark-Submit with Spark v.2.2.0.2.6.4.105-1 (using Scala) in Hadoop 2 under an Oozie workflow vs using Spark-Shell.
I have a Hive table that contains records that keep track of some processes every 15 minutes. The table is overwritten every time with new records or 'old' records that still satisfy the logic of the processes of interest.
I keep track of 'the age' of the records through a column that I will here call times_investigated, which ranges from 1 to 9.
I create a temporary dataframe, let's call it temporayDF, that contains both the old and the new entries (both the types need to be present to run useful computations). This temporayDF is then split between the new entries and the old ones, based on $"times_investigated" === 1 and $"times_investigated > 1" (or =!= 1).
Then, the processed entries are merged with a union in a final dataframe that is then written into the original Hive table.
// Before, I run the query on the 'old' Hive table and the logic over old and new entries.
// I now have a temporary dataframe
val temporaryDF = previousOtherDF
.withColumn("original_col_new", conditions)
.withColumn("original_other_col_new", otherConditions)
.withColumn("times_investigated_new", nvl($"times_investigated" + 1, 1))
.select(
previousColumns,
$"original_col_new".as("original_col"),
$"original_other_col_new".as("original_other_col"),
$"times_investigated_new".as("times_investigated"))
.cache
// Now I need to split the temporayDF in 2 to run some other logic on the new entries.
val newEntriesDF = temporaryDF
.filter($"times_investigated" === 1)
.join(neededDF, conditions, "leftouter")
.join(otherNeededDF, conditions, "leftouter")
.groupBy(cols)
.agg(min(colOne),
max(colTwo),
min(colThree),
max(colFour))
.withColumn("original_col_five_new",
when(conditions).otherwise(somethingElse))
.withColumn("original_col_six_new",
when(conditions).otherwise(somethingElse))
.select(orderedColumns)
val oldEntriesDF = temporaryDF.filter($"times_investigated" > 1)
val finalTableDF = oldEntriesDF.union(newEntriesDF)
// Now I write the table
finalTableDF.createOrReplaceTempView(tempFinalTableDF)
sql("""INSERT OVERWRITE TABLE $finalTableDF
SELECT * FROM tempFinalTableDF """)
// I would then need to re-use the newly-computed table to process further information...
The Problem:
The Hive table does not present the new entries with times_investigated = 1. It just processes the old ones, so, after those 9 times an entry can stay inside the table, it gets completely empty.
I run some tests within Spark-Shell and everything worked perfectly for many iterations, even manually writing the Hive table from the shell produced the expected results in the Hive table, but when I launched the workflow under Oozie, the strange behavior appeared again.
What I noticed within Spark-Shell is that, after writing the Hive table, if I went to compute a temporaryDF.show(), the new entries would be updated to $"times_investigated" = 2!
I tried to create a copy of temporaryDF to work on separate dataframes with the new and the old entries, but also this copyOfTemporaryDF gets updated after writing the Hive table.
It seems that this re-computation is happening before writing the Hive table under Oozie.
I know that I can compute the operations in a different manner, but I need to find a quick temporary fix on the current flow if possible.
Above all, I would love to understand what is happening under the hood, in order to avoid getting myself in such a circumstance later on.
Do you guys have any clue and/or advice?
I tried caching the intermediate dataframes, but without success.
P.S. Sorry for the probably bad coding practices
EDIT. More context: the temporaryDF comes from other intermediate dataframes, used just once to compute this on of interest. The last passages that create temporaryDF are withColumn operations, where $"times_investigated" is updated with a custom nvl function (that works exactly like the SQL one) and never gave problems in older versions of the flow (see below for the passages).
Edit2: I also tried to merge the operations on new and old entries in one long chained series, so that temopraryDF is actually the final dataframe to be written in the Hive table, but the new entries with times_investigated = 1 are still not considered (yet I have no issues via Spark-Shell and .showing the dataframe after writing to table makes it re-compute, so the times investigated are +1).
Use .cache otherwise you will get re-computation. You should do this for the appropriate dataframe or RDD if the RDD or DF is to be used multiple times in a single Spark App - not even Action dependent, sometimes you get "skipped stages".
val temporaryDF = previousOperations...cache()
2 vals use temporaryDF and without caching the recomputations will be as you see, and they may well give different results. That should be cached.
Of course if a Worker dies, or the partition evicted, some recomputing is needed.
.cache may not be ideal for datasets larger than available cluster memory. Each partition that is evicted will be rebuilt from source and that is a costly affair.
Also, using suitable partitioning and iterating a few times be better than persisting / caching; but it all depends.

Map on dataframe takes too long [duplicate]

How can I force Spark to execute a call to map, even if it thinks it does not need to be executed due to its lazy evaluation?
I have tried to put cache() with the map call but that still doesn't do the trick. My map method actually uploads results to HDFS. So, its not useless, but Spark thinks it is.
Short answer:
To force Spark to execute a transformation, you'll need to require a result. Sometimes a simple count action is sufficient.
TL;DR:
Ok, let's review the RDD operations.
RDDs support two types of operations:
transformations - which create a new dataset from an existing one.
actions - which return a value to the driver program after running a computation on the dataset.
For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).
All transformations in Spark are lazy, in that they do not compute their results right away.
Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently – for example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.
By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.
Conclusion
To force Spark to execute a call to map, you'll need to require a result. Sometimes a count action is sufficient.
Reference
Spark Programming Guide.
Spark transformations only describe what has to be done. To trigger an execution you need an action.
In your case there is a deeper problem. If goal is to create some kind of side effect, like storing data on HDFS, the right method to use is foreach. It is both an action and has a clean semantics. What is also important, unlike map, it doesn't imply referential transparency.

pyspark dataframe writing results

I am working on a project where i need to read 12 files average file size is 3 gb. I read them with RDD and create dataframe with spark.createDataFrame. Now I need to process 30 Sql queries on the dataframe most them need output of previous one like depend on each other so i save all my intermediate state in dataframe and create temp view for that dataframe.
The program takes only 2 minutes for execute part but the problem is while writing them to csv file or show the results or calling count() function takes too much time. I have tries re-partition thing but still it is taking to much time.
1.What could be the solution?
2.Why it is taking too much time to write even all processing taking small amount of time?
I solved above problem with persist and cache in pyspark.
Spark is a lazy programming language. Two types of Spark RDD operations are- Transformations and Actions. A Transformation is a function that produces new RDD from the existing RDDs but when we want to work with the actual dataset, at that point Action is performed. When the action is triggered after the result, new RDD is not formed like transformation.
Every time i do some operation it was just transforming, so if i call that particular dataframe it will it parent query every time since spark is lazy,so adding persist stopped calling parent query multiple time. It saved lots of processing time.

Bring data of DataFrame back to local node for further actions (count / show) in spark/scala

I'm using Spark 1.6 in Scala.
I know it's some of the ideas behind the Spark Framework. But I couldn't answer it to myself by reading different tutorials.. (maybe the wrong ones).
I joined two DataFrames to a new one (nDF). Now I know, it's not yet proceeded, as long I say show, first or count.
But since I want to do exactly this, I want to inspect nDF in different ways:
nDF.show
nDF.count
nDF.filter()
..and so on, it would each time take a long time, since the original DataFrames are big. Couldn't I bring/copy the data to this new one. So I could solve these new actions as quick as on the original sets? (First I thought it's 'collect', but it only returns a Array, no DataFrame)
This is a classic scenario. When you join 2 Dataframes spark doesn't do any operation as it evaluates lazily when an action called on the resulting dataframe . Action mean show, count, print etc.
Now when show, count is being called on nDF, spark is evaluating the resultant dataframe every time i.e once when you called show, then when count is being called and so on. This means internally it is performing map/reduce every time an action is called on the resultant dataframe.
Spark doesn't cache the resulting dataframe in memory unless it is hinted to do so by doing df.cache / df.persist.
So when you do
val nDF = a.join(b).persist
And then call the count/show it will evaluate the nDF once and store the resulting dataframe in memory. Hence subsequent actions will be faster.
However the fist evaluation might be little slower also you need to using little more executor memory.
If the memory available to you is good with respect to the size of your dataset, what you're probably looking for is df.cache(). If the size of your dataset is too much, consider using df.persist() as it allows different levels of persistence.
Hope this is what you're looking for. Cheers

Is it possible to resume a failed Apache Spark job?

I am trying to run a Spark job over data from multiple Cassandra tables which are grouped as part of the job. I am trying to get an end to end run with a huge data set 13m data points and it has failed over multiple points. As I fix those failures and move ahead, I encounter the next problem which I fix and restart the job again. Is there a way to speed up the testing cycle on real data so that I can restart/resume a previously failed job from a specific checkpoint?
You can checkpoint your RDDs to disk at various midpoints, which would let you restart from there if necessary. You would have to save the intermediates as a sequence file or text file, and do a little work to make sure everything goes to and from disk cleanly.
I find it more useful to start up the spark-shell and build my data flow in there. If you can identify a subset of your data which is representative, even better. Once you get into the REPL you can create RDDs, check the first value or take(100) and print them to stdout, count various result data sets, and so on. The REPL is what makes spark 10x more productive than hadoop for me.
Once I have built, in the REPL, a flow of transformations and actions that gets me the result that I need, then I can form it into a scala file and refactor that to be clean; extract functions that can be reused and unit tested, tune the parallelism, whatever.
I often find myself going back into the REPL when I need to extend my data flow, so I copy and paste code from my scala file to get to a good starting point, and experiment with the extension from there.