Execute code on each node in PySpark

Execute code on each node in PySpark - pyspark

I want to execute something on each node using PySpark, something like this:
rdd = sqlContext.read.parquet("...").rdd
def f (i):
import sys, socket
return [(socket.gethostname(),sys.version)]
vv = rdd.mapPartitions(f).collect()
but I don't see why I need to have to load a file for that.
How do I do that?

You can use sc.parallelize(range(num_executors), num_executors) or something like that if you just want any old RDD.

Related

pyspark failing to run query in parallel using foreach

I have a function that generates different query and executes them and writes data into different tables. I want to parallelize this.
Here is an example:
def build_and_execute_sql(item):
gen_sql = 'insert overwrite table schema.table_d_{} select * from ...'.format(item)
spark.sql(gen_sql)
sc = spark.sparkContext
lst = ['products', 'orders', 'deliveries']
rdd = sc.parallelize(lst)
rdd.foreach(build_and_execute_sql)
when I execute this, this fails with no specific error. My goal is to execute this in parallel.
I have about 12 such queries that are generated and are executed.
I tried to play around with rdd.formach(build_execute_sql).collect(), but nothing really works.
Any pointers?? wondering why would foreach fail?
I'm familiar with multiprocessing, but wondering if there is a clean way to do it in pyspark itself.

You can try python multiprocessing instead:
from multiprocessing.pool import ThreadPool
lst = ['products', 'orders', 'deliveries']
with ThreadPool(3) as p:
p.map(build_and_execute_sql, lst)

How to writeback to dataframe using transform_df in palantir foundry?

I created a library for updating description of the columns of the input dataset. This function takes three parameter as input (input_dataset, output_dataset, config file) and eventually writes back the description of output dataset. So now we want to import this library across various use cases. How to go for those cases where we are writing spark transformation i.e taking inputs through transform_df because here we can't assign output to output variable. In that situation how can i call my description library function? How to proceed in those situation in palantir foundry. Any suggestions?

This method isn't currently supported using the #transform_df decorator; you'll have to use the #transform decorator at the moment.
The reasoning behind this resulted from recognizing the need for broader access to metadata APIs like the #transform decorator already allows. Thus it seemed more in line with this pattern to keep it there since the #transform_df decorator is inherently higher-level.
You can always simply move over your transformations from...
from transforms.api import transform_df, Input, Output
#transform_df(
Output("/my/output"),
my_input("/my/input"),
)
def my_compute_function(my_input):
df = my_input
# ... logic ....
return my_input
...to...
from transforms.api import transform, Input, Output
#transform(
my_output=Output("/my/output"),
my_input=Input("/my/input")
)
def my_compute_function(my_input, my_output):
df = my_input.dataframe()
# ... logic ....
my_output.write_dataframe(df)
...in which only 6 lines of code need be changed.

Scala/Spark determine the path of external table

I am having one external table on on gs bucket and to do some compaction logic, I want to determine the full path on which the table is created.
val tableName="stock_ticks_cow_part"
val primaryKey="key"
val versionPartition="version"
val datePartition="dt"
val datePartitionCol=new org.apache.spark.sql.ColumnName(datePartition)
import spark.implicits._
val compactionTable = spark.table(tableName).withColumnRenamed(versionPartition, "compaction_version").withColumnRenamed(datePartition, "date_key")
compactionTable. <code for determining the path>
Let me know if anyone knows how to determine the table path in scala.

I think you can use .inputFiles to
Returns a best-effort snapshot of the files that compose this Dataset
Be aware that this returns an Array[String], so you should loop through it to get all information you're looking for.
So actually just call
compactionTable.inputFiles
and look at each element of the Array

Here is the correct answer:
import org.apache.spark.sql.catalyst.TableIdentifier
lazy val tblMetadata = catalog.getTableMetadata(new TableIdentifier(tableName,Some(schema)))
lazy val s3location: String = tblMetadata.location.getPath

You can use SQL commands SHOW CREATE TABLE <tablename> or DESCRIBE FORMATTED <tablename>. Both should return the location of the external table, but they need some logic to extract this path...
See also How to get the value of the location for a Hive table using a Spark object?

Use the DESCRIBE FORMATTED SQL command and collect the path back to the driver.
In Scala:
val location = spark.sql("DESCRIBE FORMATTED table_name").filter("col_name = 'Location'").select("data_type").head().getString(0)
The same in Python:
location = spark.sql("DESCRIBE FORMATTED table_name").filter("col_name = 'Location'").select("data_type").head()[0]

Pyspark - Transfer control out of Spark Session (sc)

This is a follow up question on
Pyspark filter operation on Dstream
To keep a count of how many error messages/warning messages has come through for say a day, hour - how does one design the job.
What I have tried:
from __future__ import print_function
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
def counts():
counter += 1
print(counter.value)
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: network_wordcount.py <hostname> <port>", file=sys.stderr)
exit(-1)
sc = SparkContext(appName="PythonStreamingNetworkWordCount")
ssc = StreamingContext(sc, 5)
counter = sc.accumulator(0)
lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))
errors = lines.filter(lambda l: "error" in l.lower())
errors.foreachRDD(lambda e : e.foreach(counts))
errors.pprint()
ssc.start()
ssc.awaitTermination()
this however has multiple issues, to start with print doesn't work (does not output to stdout, I have read about it, the best I can use here is logging). Can I save the output of that function to a text file and tail that file instead?
I am not sure why the program just comes out, there is no error/dump anywhere to look further into (spark 1.6.2)
How does one preserve state? What I am trying is to aggregate logs by server and severity, another use case is to count how many transactions were processed by looking for certain keywords
Pseudo Code for what I want to try:
foreachRDD(Dstream):
if RDD.contains("keyword1 | keyword2 | keyword3"):
dictionary[keyword] = dictionary.get(keyword,0) + 1 //add the keyword if not present and increase the counter
print dictionary //or send this dictionary to else where
The last part of sending or printing dictionary requires switching out of spark streaming context - Can someone explain the concept please?

print doesn't work
I would recommend reading the design patterns section of the Spark documentation. I think that roughly what you want is something like this:
def _process(iter):
for item in iter:
print item
lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))
errors = lines.filter(lambda l: "error" in l.lower())
errors.foreachRDD(lambda e : e.foreachPartition(_process))
This will get your call print to work (though it is worth noting that the print statement will execute on the workers and not the drivers, so if you're running this code on a cluster you will only see it on the worker logs).
However, it won't solve your second problem:
How does one preserve state?
For this, take a look at updateStateByKey and the related example.

Using Custom Hadoop input format for processing binary file in Spark

I have developed a hadoop based solution that process a binary file. This uses classic hadoop MR technique. The binary file is about 10GB and divided into 73 HDFS blocks, and the business logic written as map process operates on each of these 73 blocks. We have developed a customInputFormat and CustomRecordReader in Hadoop that returns key (intWritable) and value (BytesWritable) to the map function. The value is nothing but the contents of a HDFS block(bianry data). The business logic knows how to read this data.
Now, I would like to port this code in spark. I am a starter in spark and could run simple examples (wordcount, pi example) in spark. However, could not straightforward example to process binaryFiles in spark. I see there are two solutions for this use case. In the first, avoid using custom input format and record reader. Find a method (approach) in spark the creates a RDD for those HDFS blocks, use a map like method that feeds HDFS block content to the business logic. If this is not possible, I would like to re-use the custom input format and custom reader using some methods such as HadoopAPI, HadoopRDD etc. My problem:- I do not know whether the first approach is possible or not. If possible, can anyone please provide some pointers that contains examples? I was trying second approach but highly unsuccessful. Here is the code snippet I used
package org {
object Driver {
def myFunc(key : IntWritable, content : BytesWritable):Int = {
println(key.get())
println(content.getSize())
return 1
}
def main(args: Array[String]) {
// create a spark context
val conf = new SparkConf().setAppName("Dummy").setMaster("spark://<host>:7077")
val sc = new SparkContext(conf)
println(sc)
val rd = sc.newAPIHadoopFile("hdfs:///user/hadoop/myBin.dat", classOf[RandomAccessInputFormat], classOf[IntWritable], classOf[BytesWritable])
val count = rd.map (x => myFunc(x._1, x._2)).reduce(_+_)
println("The count is *****************************"+count)
}
}
}
Please note that the print statement in the main method prints 73 which is the number of blocks whereas the print statements inside the map function prints 0.
Can someone tell where I am doing wrong here? I think I am not using API the right way but failed to find some documentation/usage examples.

A couple of problems at a glance. You define myFunc but call func. Your myFunc has no return type, so you can't call collect(). If your myFunc truly doesn't have a return value, you can do foreach instead of map.
collect() pulls the data in an RDD to the driver to allow you to do stuff with it locally (on the driver).

I have made some progress in this issue. I am now using the below function which does the job
var hRDD = new NewHadoopRDD(sc, classOf[RandomAccessInputFormat],
classOf[IntWritable],
classOf[BytesWritable],
job.getConfiguration()
)
val count = hRDD.mapPartitionsWithInputSplit{ (split, iter) => myfuncPart(split, iter)}.collect()
However, landed up with another error the details of which i have posted here
Issue in accessing HDFS file inside spark map function
15/10/30 11:11:39 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 40.221.94.235): java.io.IOException: No FileSystem for scheme: spark
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Execute code on each node in PySpark - pyspark

You can use sc.parallelize(range(num_executors), num_executors) or something like that if you just want any old RDD.

Related

pyspark failing to run query in parallel using foreach

How to writeback to dataframe using transform_df in palantir foundry?

Scala/Spark determine the path of external table

Pyspark - Transfer control out of Spark Session (sc)

Using Custom Hadoop input format for processing binary file in Spark

Categories

Resources