How to parallelize a function the correct way - pyspark

I currently have a setup where I have a very large Pandas dataframe “df” that can be filtered into separate batches df_batch_1= df[df[‘column’]==<filter 1>], df_batch_2= df[df[‘column’]==<filter 2>],…etc. In addition, I have a very “heavy” function “heavy_function(df_batch)” that must do some heavy computation on each data frame batch df_batch_1, df_batch_2,...etc. My plan is to do this “heavy lifting” on a data bricks (pyspark) cluster since it runs too slow on a regular computer.
So far, I am running this by using threads on a data bricks cluster like this:
threads = [threading.Thread(target= heavy_function, args=(df[df[‘column’]==filter]) for filter in [<filter 1>, <filter 2>,…etc.]]
for t in threads:
t.start()
I have been told that this is an anti-pattern and that I should find another way of doing this. I hope you can help pointing me in the right direction. What is the right “pysparkian” way of doing this?
Any help is appreciated!
Best regards,
DK

Related

google dataprep (clouddataprep by trifacta) tip: jobs will not be able to run if they are to large

During my cloud dataprep adventures I have come across yet another very annoying bug.
The problem occurs when creating complex flow structures which need to be connected through reference datasets. If a certain limit is crossed in performing a number of unions or a joins with these sets, dataflow is unable to start a job.
I have had a lot of contact with support and they are working on the issue:
"Our Systems Engineer Team was able to determine the root cause resulting into the failed job. They mentioned that the job is too large. That means that the recipe (combined from all datasets) is too big, and Dataflow rejects it. Our engineering team is still investigating approaches to address this.
A workaround is to split the job into two smaller jobs. The first run the flow for the data enrichment, and then use the output as input in the other flow. While it is not ideal, this would be a working solution for the time being."
I ran into the same problem and have a fairly educated guess as to the answer. Keep in mind that DataPrep simply takes all your GUI based inputs and translates it into Apache Beam code. When you pass in a reference data set, it probably writes some AB code that turns the reference data set into a side-input (https://beam.apache.org/documentation/programming-guide/). DataFlow will perform a Parellel Do (ParDo) function where it takes each element from a PCollection, stuffs it into a worker node, and then applies the side-input data for transformation.
So I am pretty sure if the reference sets get too big (which can happen with Joins), the underlying code will take an element from dataset A, pass it to a function with side-input B...but if side-input B is very big, it won't be able to fit into the worker memory. Take a look at the Stackdriver logs for your job to investigate if this is the case. If you see 'GC (Allocation Failure)' in your logs this is a sign of not enough memory.
You can try doing this: suppose you have two CSV files to read in and process, file A is 4 GB and file B is also 4 GB. If you kick off a job to perform some type of Join, it will very quickly outgrow the worker memory and puke. If you CAN, see if you can pre-process in a way where one of the files is in the MB range and just grow the other file.
If your data structures don't lend themselves to that option, you could do what the Sys Engs suggested, split one file up into many small chunks and then feed it to the recipe iteratively against the other larger file.
Another option to test is specifying the compute type for the workers. You can iteratively grow the compute type larger and larger to see if it finally pushes through.
The other option is to code it all up yourself in Apache Beam, test locally, then port to Google Cloud DataFlow.
Hopefully these guys fix the problem soon, they don't make it easy to ask them questions, that's for sure.

How to split a large data frame and use the smaller parts to do multiple broadcast joins in Spark?

Let's say we have two very large data frames - A and B. Now, I understand if I use same hash partitioner for both RDDs and then do the join, the keys will be co-located and the join might be faster with reduced shuffling (the only shuffling that will happen will be when the partitioner changes on A and B).
I wanted to try something different though - I want to try broadcast join like so -> let's say the B is smaller than A so we pick B to broadcast but B is still a very big dataframe. So, what we want to do is to make multiple data frames out of B and then send each as broadcast to be joined on A.
Has anyone tried this?
To split one data frame into many I am only seeing randomSplit method but that doesn't look so great an option.
Any other better way to accomplish this task?
Thanks!
Has anyone tried this?
Yes, someone already tried that. In particular GoDataDriven. You can find details below:
presentation - https://databricks.com/session/working-skewed-data-iterative-broadcast
code - https://github.com/godatadriven/iterative-broadcast-join
They claim pretty good results for skewed data, however there are three problems you have to consider doing this yourself:
There is no split in Spark. You have to filter data multiple times or eagerly cache complete partitions (How do I split an RDD into two or more RDDs?) to imitate "splitting".
Huge advantage of broadcast is reduction in the amount of transferred data. If data is large, then amount of data to be transferred can actually significantly increase: (Why my BroadcastHashJoin is slower than ShuffledHashJoin in Spark)
Each "join" increases complexity of the execution plan and with long series of transformations things can get really slow on the driver side.
randomSplit method but that doesn't look so great an option.
It is actually not a bad one.
Any other better way to accomplish this task?
You may try to filter by partition id.

How to make MapReduce work with HDFS

This might sound like some stupid question.
I might write a MR code that can take input and output as HDFS locations and then I really don't need to worry about the parallel computing power of hadoop/MR. (Please correct me if I am wrong here).
However if my input is not an HDFS location say I am taking a MongoDB data as input - mongodb://localhost:27017/mongo_hadoop.messages and running my mappers and reducers and storing the data back to mongodb, how will HDFS come into picture. I mean how can I be sure that the 1 GB or any sized big file is first being distributed on HDFS and then parallel computing is being done on it?
Is it that this direct URI will not distribute the data and I need to take the BSON file instead, load it up on HDFS and then give the HDFS path as Input to MR or the framework is smart enough to do this by itself?
I am sorry if the above question is too stupid or not making any sense at all. I am really new to big data but very much excited to dive into this domain.
Thanks.
You are describing DBInputFormat. This is an input format that reads the split from an external database. HDFS only gets involved in setting up the job, but not in actual input. There is also an DBOutputFormat. With an input like DBInputFormat the splits are logical, eg. key ranges.
Read Database Access with Apache Hadoop for a detailed explanation.
Sorry,I am not sure about MongoDb.
If you just wanted to know,how splitting is happening if we are using the data source is a table,then this is my answer when MapRed working with HBase.
we will use TableInputFormat to use an Hbase table in MapRed job.
From the http://hbase.apache.org/book.html#hbase.mapreduce.classpath
7.7. Map-Task Splitting
7.7.1. The Default HBase MapReduce Splitter
When TableInputFormat is used to source an HBase table in a MapReduce job, its splitter will make a map task for each region of the table. Thus, if there are 100 regions in the table, there will be 100 map-tasks for the job - regardless of how many column families are selected in the Scan.
7.7.2. Custom Splitters
For those interested in implementing custom splitters, see the method getSplits in TableInputFormatBase. That is where the logic for map-task assignment resides.
This is a good question, not stupid.
1.
"mongodb://localhost:27017/mongo_hadoop.messages and running my mappers and reducers and storing the data back to mongodb, how will HDFS come into picture. "
Under this situation, u needn't consider hdfs. U needn't do anything related with hdf. Just like write a multiple-thread application with each thread write data to mongodb.
In fact, hdfs is independent to map-reduce, and map-reduce is also independent to hdfs. So, u can use them separately or together as your wish.
2.
if u want to input/output db to map-reduce, u show consider DBInputFormat, but that's another question.
Now, hadoop DBInputFormat only support JDBC. I'm not sure whether some mongodb version of DBInputFormat. Maybe U can search it or implement it by yourself.

Apache spark streaming - cache dataset for joining

I'm considering using Apache Spark streaming for some real-time work but I'm not sure how to cache a dataset for use in a join/lookup.
The main input will be json records coming from Kafka that contain an Id, I want to translate that id into a name using a lookup dataset. The lookup dataset resides in Mongo Db but I want to be able to cache it inside the spark process as the dataset changes very rarely (once every couple of hours) so I don't want to hit mongo for every input record or reload all the records in every spark batch but I need to be able to update the data held in spark periodically (e.g. every 2 hours).
What is the best way to do this?
Thanks.
I've thought long and hard about this myself. In particular I've wondered is it possible to actually implement a database DB in Spark of sorts.
Well the answer is kind of yes. First you want a program that first caches the main data set into memory, then every couple of hours does an optimized join-with-tiny to update the main data set. Now apparently Spark will have a method that does a join-with-tiny (maybe it's already out in 1.0.0 - my stack is stuck on 0.9.0 until CDH 5.1.0 is out).
Anyway, you can manually implement a join-with-tiny, by taking the periodic bi-hourly dataset and turning it into a HashMap then broadcasting it as a broadcast variable. What this means is that the HashMap will be copied, but only once per node (compare this with just referencing the Map - it would be copied once per task - a much greater cost). Then you take your main dataset and add on the new records using the broadcasted map. You can then periodically (nightly) save to hdfs or something.
So here is some scruffy pseudo code to elucidate:
var mainDataSet: RDD[KeyType, DataType] = sc.textFile("/path/to/main/dataset")
.map(parseJsonAndGetTheKey).cache()
everyTwoHoursDo {
val newData: Map[KeyType, DataType] = sc.textFile("/path/to/last/two/hours")
.map(parseJsonAndGetTheKey).toarray().toMap
broadcast(newData)
val mainDataSetNew =
mainDataSet.map((key, oldValue) => (key,
newData.get(key).map(newDataValue =>
update(oldValue, newDataValue))
.getOrElse(oldValue)))
.cache()
mainDataSetNew.someAction() // to force execution
mainDataSet.unpersist()
mainDataSet = mainDataSetNew
}
I've also thought that you could be very clever and use a custom partioner with your own custom index, and then use a custom way of updating the partitions so that each partition itself holds a submap. Then you can skip updating partitions that you know won't hold any keys that occur in the newData, and also optimize the updating process.
I personally think this is a really cool idea, and the nice thing is your dataset is already ready in memory for some analysis / machine learning. The down side is your kinda reinventing the wheel a bit. It might be a better idea to look at using Cassandra as Datastax is partnering with Databricks (people who make Spark) and might end up supporting some kind of thing like this out of box.
Further reading:
http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables
http://www.datastax.com/2014/06/datastax-unveils-dse-45-the-future-of-the-distributed-database-management-system
Here is a fairly simple work-flow:
For each batch of data:
Convert the batch of JSON data to a DataFrame (b_df).
Read the lookup dataset from MongoDB as a DataFrame (m_df). Then cache, m_df.cache()
Join the data using b_df.join(m_df, "join_field")
Perform your required aggregation and then write to a data source.

Producing ngram frequencies for a large dataset

I'd like to generate ngram frequencies for a large dataset. Wikipedia, or more specifically, Freebase's WEX is suitable for my purposes.
What's the best and most cost efficient way to do it in the next day or so?
My thoughts are:
PostgreSQL using regex to split sentences and words. I already have the WEX dump in PostgreSQL, and I already have regex to do the splitting (major accuracy isn't required here)
MapReduce with Hadoop
MapReduce with Amazon's Elastic MapReduce, which I know next to nothing about
My experience with Hadoop consists of calculating Pi on three EC2 instances very very inefficiently. I'm good with Java, and I understand the concept of Map + Reduce.
PostgreSQL I fear will take a long, long time, as it's not easily parallelisable.
Any other ways to do it? What's my best bet for getting it done in the next couple days?
Mapreduce will work just fine, and probably you could do most of the input-output shuffling by pig.
See
http://arxiv.org/abs/1207.4371
for some algorithms.
Of course, to make sure you get a running start, you don't actually need to be using mapreduce for this task; just split the input yourself, make the simplest fast program to calculate ngrams of a single input file and aggregate the ngram frequencies later.
Hadoop gives you two good things , which are main in my opinion: parralell task running (map only jobs) and distributed sort (shuffling between map and reduce
For the NGrams, it looks like you need both - parralel tasks (mappers) to emit ngrams and shuffling - to count number of each ngram.
So I think Hadoop here is ideal solution.