Spark: partitioning based on the data size - scala

Working on some Spark job I faced with the problem of data skew.
I working on the spatial data and it looks like this:
<key - id of area; value - some meta data that also contains a size and link to binary file>.
The main problem that data is not distributed well. As it is spatial data for instance areas that contain big cities are very large. But areas with fields and countryside is to small.
So some Spark partitions for one RDD have size 400kb and other > 1gb. As a result long running last tasks.
Key contains only numbers (like 45949539) and for this moment all data will be partitioned by hash partitioner using this key - completely randomly.
My next transformations are "flatmap" an "reduce by key". (Yes, long running last tasks are here)
<id1, data1> -> [<id1, data1>, <id2, data2> ....]
While flatmap I produce all my internal data from binary file. So size of the file I know (As I said before - from meta data).
But I need some specific partitioner that can mix large and small areas within one partition.
Is it real to do in Spark?
This job is a some kind of legacy and I can use RDD only.

Related

Scala - Write data to file with row limit

I have an RDD with 30Million rows of data, Is there a way to save this into files of 1M each.
I think their is no direct way of doing it. one thing you can do is collect() your rdd and get the iterator from it and save it using normal file save using what scala provides. Something like this
val arrayValue = yourRdd.collect();
//Iterate the array and put it in file if it reaches the limit .
Note: This approach is not recommended if your data size id huge because collect() will bring all the records of RDD to driver code(Master).
You can do rdd.repartition(30). This will ensure that your data is about equally partitioned into 30 partitions and that should give you partitions which have roughly 1 Mil rows each.
Then you do simple rdd.saveAsTextFile(<path>) and Spark will create as many files as partitions under <path>. Or if you want more control over how and where your data is saved, you can do rdd.foreachPartition(f: Iterator[T] => Unit) and handle the logic of actually dealing with rows and saving then as you see fit within the function f passed to the foreachPartition. (Note that foreachPartition will run on each of your executor nodes and will not bring the data back to driver, which of course is a desirable thing).

Apache spark streaming - cache dataset for joining

I'm considering using Apache Spark streaming for some real-time work but I'm not sure how to cache a dataset for use in a join/lookup.
The main input will be json records coming from Kafka that contain an Id, I want to translate that id into a name using a lookup dataset. The lookup dataset resides in Mongo Db but I want to be able to cache it inside the spark process as the dataset changes very rarely (once every couple of hours) so I don't want to hit mongo for every input record or reload all the records in every spark batch but I need to be able to update the data held in spark periodically (e.g. every 2 hours).
What is the best way to do this?
Thanks.
I've thought long and hard about this myself. In particular I've wondered is it possible to actually implement a database DB in Spark of sorts.
Well the answer is kind of yes. First you want a program that first caches the main data set into memory, then every couple of hours does an optimized join-with-tiny to update the main data set. Now apparently Spark will have a method that does a join-with-tiny (maybe it's already out in 1.0.0 - my stack is stuck on 0.9.0 until CDH 5.1.0 is out).
Anyway, you can manually implement a join-with-tiny, by taking the periodic bi-hourly dataset and turning it into a HashMap then broadcasting it as a broadcast variable. What this means is that the HashMap will be copied, but only once per node (compare this with just referencing the Map - it would be copied once per task - a much greater cost). Then you take your main dataset and add on the new records using the broadcasted map. You can then periodically (nightly) save to hdfs or something.
So here is some scruffy pseudo code to elucidate:
var mainDataSet: RDD[KeyType, DataType] = sc.textFile("/path/to/main/dataset")
.map(parseJsonAndGetTheKey).cache()
everyTwoHoursDo {
val newData: Map[KeyType, DataType] = sc.textFile("/path/to/last/two/hours")
.map(parseJsonAndGetTheKey).toarray().toMap
broadcast(newData)
val mainDataSetNew =
mainDataSet.map((key, oldValue) => (key,
newData.get(key).map(newDataValue =>
update(oldValue, newDataValue))
.getOrElse(oldValue)))
.cache()
mainDataSetNew.someAction() // to force execution
mainDataSet.unpersist()
mainDataSet = mainDataSetNew
}
I've also thought that you could be very clever and use a custom partioner with your own custom index, and then use a custom way of updating the partitions so that each partition itself holds a submap. Then you can skip updating partitions that you know won't hold any keys that occur in the newData, and also optimize the updating process.
I personally think this is a really cool idea, and the nice thing is your dataset is already ready in memory for some analysis / machine learning. The down side is your kinda reinventing the wheel a bit. It might be a better idea to look at using Cassandra as Datastax is partnering with Databricks (people who make Spark) and might end up supporting some kind of thing like this out of box.
Further reading:
http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables
http://www.datastax.com/2014/06/datastax-unveils-dse-45-the-future-of-the-distributed-database-management-system
Here is a fairly simple work-flow:
For each batch of data:
Convert the batch of JSON data to a DataFrame (b_df).
Read the lookup dataset from MongoDB as a DataFrame (m_df). Then cache, m_df.cache()
Join the data using b_df.join(m_df, "join_field")
Perform your required aggregation and then write to a data source.

Custom scalding tap (or Spark equivalent)

I am trying to dump some data that I have on a Hadoop cluster, usually in HBase, with a custom file format.
What I would like to do is more or less the following:
start from a distributed list of records, such as a Scalding pipe or similar
group items by some computed function
make so that items belonging to the same group reside on the same server
on each group, apply a transformation - that involves sorting - and write the result on disk. In fact I need to write a bunch of MapFile - which are essentially sorted SequenceFile, plus an index.
I would like to implement the above with Scalding, but I am not sure how to do the last step.
While of course one cannot write sorted data in a distributed fashion, it should still be doable to split data into chunks and then write each chunk sorted locally. Still, I cannot find any implementation of MapFile output for map-reduce jobs.
I recognize it is a bad idea to sort very large data, and this is the reason even on a single server I plan to split data into chunks.
Is there any way to do something like that with Scalding? Possibly I would be ok with using Cascading directly, or really an other pipeline framework, such as Spark.
Using Scalding (and the underlying Map/Reduce) you will need to use the TotalOrderPartitioner, which does pre-sampling to create appropriate buckets/splits of the input data.
Using Spark will speed up due to the faster access paths to the disk data. However it will still require shuffles to disk/hdfs so it will not be like orders of magnitude better.
In Spark you would use a RangePartitioner, which takes the number of partitions and an RDD:
val allData = sc.hadoopRdd(paths)
val partitionedRdd = sc.partitionBy(new RangePartitioner(numPartitions, allData)
val groupedRdd = partitionedRdd.groupByKey(..).
// apply further transforms..

realtime querying/aggregating millions of records - hadoop? hbase? cassandra?

I have a solution that can be parallelized, but I don't (yet) have experience with hadoop/nosql, and I'm not sure which solution is best for my needs. In theory, if I had unlimited CPUs, my results should return back instantaneously. So, any help would be appreciated. Thanks!
Here's what I have:
1000s of datasets
dataset keys:
all datasets have the same keys
1 million keys (this may later be 10 or 20 million)
dataset columns:
each dataset has the same columns
10 to 20 columns
most columns are numerical values for which we need to aggregate on (avg, stddev, and use R to calculate statistics)
a few columns are "type_id" columns, since in a particular query we may
want to only include certain type_ids
web application
user can choose which datasets they are interested in (anywhere from 15 to 1000)
application needs to present: key, and aggregated results (avg, stddev) of each column
updates of data:
an entire dataset can be added, dropped, or replaced/updated
would be cool to be able to add columns. But, if required, can just replace the entire dataset.
never add rows/keys to a dataset - so don't need a system with lots of fast writes
infrastructure:
currently two machines with 24 cores each
eventually, want ability to also run this on amazon
I can't precompute my aggregated values, but since each key is independent, this should be easily scalable. Currently, I have this data in a postgres database, where each dataset is in its own partition.
partitions are nice, since can easily add/drop/replace partitions
database is nice for filtering based on type_id
databases aren't easy for writing parallel queries
databases are good for structured data, and my data is not structured
As a proof of concept I tried out hadoop:
created a tab separated file per dataset for a particular type_id
uploaded to hdfs
map: retrieved a value/column for each key
reduce: computed average and standard deviation
From my crude proof-of-concept, I can see this will scale nicely, but I can see hadoop/hdfs has latency I've read that that it's generally not used for real time querying (even though I'm ok with returning results back to users in 5 seconds).
Any suggestion on how I should approach this? I was thinking of trying HBase next to get a feel for that. Should I instead look at Hive? Cassandra? Voldemort?
thanks!
Hive or Pig don't seem like they would help you. Essentially each of them compiles down to one or more map/reduce jobs, so the response cannot be within 5 seconds
HBase may work, although your infrastructure is a bit small for optimal performance. I don't understand why you can't pre-compute summary statistics for each column. You should look up computing running averages so that you don't have to do heavy weight reduces.
check out http://en.wikipedia.org/wiki/Standard_deviation
stddev(X) = sqrt(E[X^2]- (E[X])^2)
this implies that you can get the stddev of AB by doing
sqrt(E[AB^2]-(E[AB])^2). E[AB^2] is (sum(A^2) + sum(B^2))/(|A|+|B|)
Since your data seems to be pretty much homogeneous, I would definitely take a look at Google BigQuery - You can ingest and analyze the data without a MapReduce step (on your part), and the RESTful API will help you create a web application based on your queries. In fact, depending on how you want to design your application, you could create a fairly 'real time' application.
It is serious problem without immidiate good solution in the open source space. In commercial space MPP databases like greenplum/netezza should do.
Ideally you would need google's Dremel (engine behind BigQuery). We are developing open source clone, but it will take some time...
Regardless of the engine used I think solution should include holding the whole dataset in memory - it should give an idea what size of cluster you need.
If I understand you correctly and you only need to aggregate on single columns at a time
You can store your data differently for better results
in HBase that would look something like
table per data column in today's setup and another single table for the filtering fields (type_ids)
row for each key in today's setup - you may want to think how to incorporate your filter fields into the key for efficient filtering - otherwise you'd have to do a two phase read (
column for each table in today's setup (i.e. few thousands of columns)
HBase doesn't mind if you add new columns and is sparse in the sense that it doesn't store data for columns that don't exist.
When you read a row you'd get all the relevant value which you can do avg. etc. quite easily
You might want to use a plain old database for this. It doesn't sound like you have a transactional system. As a result you can probably use just one or two large tables. SQL has problems when you need to join over large data. But since your data set doesn't sound like you need to join, you should be fine. You can have the indexes setup to find the data set and the either do in SQL or in app math.

Sequential Row IDs in Column Oriented DBs (HBase, Cassandra)?

I've seen two contradictory pieces of advice when it comes to designing row IDs in HBase, (specifically, but I think it applies to Cassandra as well.)
Group keys that you'll be aggregating together often to take advantage of data locality. (White, Hadoop: The Definitive Guide and I recall seeing it on the HBase site, but can't find it...)
Spread keys around so that work can be distributed across multiple machines (Twitter, Pig, and HBase at Twitter slide 14)
I'm guessing which one is optimal can depend on your use case, but does anyone have any experience with either strategy?
In HBase, a table is partitioned into regions by dividing up the key space, which is sorted lexicographically. Each region of the table belongs to a single region server, so all reads and writes are handled by that server (which allows for a strong consistency guarantee). This means that if all of your reads or writes are concentrated on a small range of your keyspace, that you will only be able to scale to what a single region server can handle. For example, if your data is a time series and keyed by the timestamp, then all writes are going to the last region in the table, and you will be constrained to writing at the rate that a single server can handle.
On the other hand, if you can choose your keys such that any given query only needs to scan a small range of rows, but that the overall set of reads and writes are spread across your keyspace, then the total load will be distributed and scale nicely, but you can still enjoy the locality benefits for your query.