How to save data in pyspark in 1 file in Amazon EMR - pyspark

I use next code for save data to local disk
receiptR.write.format('com.databricks.spark.csv').save('file:/mnt/dump/gp')
But I had next directory structure
[hadoop#ip-172-31-16-209 ~]$ cd /mnt/dump
[hadoop#ip-172-31-16-209 dump]$ ls -R
.:
gp
./gp:
_temporary
./gp/_temporary:
0
./gp/_temporary/0:
task_201610061116_0000_m_000000 _temporary
./gp/_temporary/0/task_201610061116_0000_m_000000:
part-00000
How I can save data in next structure?
/mnt/dump/gp/
part-00000

The files are separated out one per partition. So if you were to view your data on its own, you'd see this.
rdd = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9], 4) # as "4" partitions
rdd.collect()
--> [1, 2, 3, 4, 5, 6, 7, 8, 9]
and if you view it with partitions visible:
rdd.glom().collect()
--> [[1, 2], [3, 4], [5, 6], [7, 8, 9]]
So when you save it, it will save the files broken into 4 pieces.
As others have suggested in similar questions, i.e. how to make saveAsTextFile NOT split output into multiple file? , you can coalesce the dataset down to 1 single partition and then save:
coalesce(1,true).saveAsTextFile("s3://myBucket/path/to/file.txt")
However, warning: the reason why Spark deals with data across multiple partitions in the first place is because for very large datasets, each node can deal with smaller data. When you coalesce down to 1 partition, you're forcing the entirety of the dataset into a single node. if you dont have the memory available for that, you'll get into trouble. Source: NullPointerException in Spark RDD map when submitted as a spark job

Related

How to partition dataframes which are partitionBy and joined with other keys?

I have a lot of dataframes with customer data and the history for different legal entities, simplified:
Val myData = spark.createDataframe(Seq(
(1, 1, “a lot of Data”, 2010-01-01-10.00.00”),
(1, 1, “a lot of Data”, 2010-01-20-10.31.00”),
(1, 1, “a lot of Data”, 2019-06-16-12.00.00”),
(2, 5, “a lot of Data”, 2010-01-01-10.00.00”),
(2,6, “a lot of Data”, 2010-01-01-10.00.00”),
(3, 7, “a lot of Data”, 2010-01-01-10.00.00”)))
.toDF(“legalentity”, “customernumber”,”anydata”,”changetimestamp”)
These dataframes are stored as parquet files and exposed has external hive tables.
The change timestamp is transformed to a >valid from<, >valid to< by views, like this
CREATE VIEW myview
AS SELECT
Legalentity, customernumber, anydata,
Changetimestamp as valid_from,
Coalesce(lead(changetimestamp) over (PARTITION by legalentity, customernumber ORDER BY changetimestamp ASC), “9999-12-31-00.00.00”) as valid_to
(this is simplified, there are some timestamp transformations inside needed)
There are a lot of joins between the dataframes / hive tables later.
These dataframes are store this way:
myDf
.orderBy(col(“legalentity”), col(“customernumber”))
.write
.format(“parquet_format”)
.mode(SaveMode.Append)
.partitionBy(“legalentity”)
.save(outputpath)
For legal reason the data of different legal entities must be stored in different hdfs pathes, that is done by the partitionBy clause which creates a separate folder for each legal entity.
There are small and big legal entities with huge number of customers and others with few customers.
The number of shuffle partitions is averaged over all legal entities, that fine.
Problems:
No more Columns to partition the dataframe are possible:
If we want to speed up all by adding a repartition with more columns as the partitionBy clause for writing like:
myDf
.orderBy(col(“legalentity”), col(“customernumber”))
.repartition(col(“legalentity”), col(“customernumber”))
.write
.format(“parquet_format”)
.mode(SaveMode.Append)
.partitionBy(“legalentity”)
.save(outputpath)
The number of shuffel partitions is used in every legal entity folder.
That causes partitions = >legal entity< * >number of shuffle partitions<
Too many partitions
There are small and big dataframes / tables. All get the same number of shuffle partitions, so small dataframes has partition sizes of 3 MB or less.
If we use different numbers of partitions for each table so that the file size gets close to 128 MB, everything is slowed down.
We get new data every day which we just append but therefor we don’t use the number of shuffle partitions, we repartition(1).
Sometimes we have to re-load all to compress all these partitions, but our processes are not slowed down by the new daily data.

How to split one input stream to multiple topics and guarantee the simultaneously consuming

I want to create a simple sensor data based application with apache kafka. My question is very simple and is referenced to the basic concept of apache kafka. I'm a beginner at apache kafka.
Here my requirement:
I get sensor data via an byte array with different data inside.
For example the array exists of three entries (Temperature 1, Temperature 2 and Voltage). Here one example with 4 arrays and value data. Each array comes in a defined timestamp.
Array 1: [ 1, 2, 3 ]
Array 2: [ 4, 5, 6 ]
Array 3: [ 7, 8, 9 ]
Array 4: [ 10, 11, 12 ]
Now I want to read these arrays and want to produce messages for three topics:
topic-temp1
topic-temp2
topic-voltage
The order of producing is:
Read array 1
produce message to topic-temp1 (value=1)
produce message to topic-temp2 (value=2)
produce message to topic-voltage (value=3)
Read array 2
produce message to topic-temp1 (value=4)
produce message to topic-temp2 (value=5)
produce message to topic-voltage (value=6)
Read array 3
produce message to topic-temp1 (value=7)
produce message to topic-temp2 (value=8)
produce message to topic-voltage (value=9)
... Read array n ...
After that I have 3 Topics with different data inside:
topic-temp1: 1, 4, 7, 10
topic-temp2: 2, 5, 8, 11
topic-voltage: 3, 6, 9, 12
Now to my question:
I want to create a software application that consumes these 3 topics. I want to display 3 graphs (temp1, temp2, voltage) in one diagram. The y-axe is the signal value and the x-axe is the timestamp.
How can I quarantee that I get the consumed values at the same timestamp? Only the I can overlay the graphs.
1,2,3
4,5,6
7,8,9
10,11,12
Should I use the Kafka-Stream API? One input-stream-topic (byte array) and three output-stream-topics? How to ensure that these three values are together produced and will be consumed together?
Or should I use a simple consumer api and access the data via offset value. because the offset should be the same for the entries (1,2,3) (4,5,6) ..., because I produced them in this order?
Thank you in advance!
I suggest you use one topic of sensor-resdings with a payload of sensor name (or preferably a UUID), so you know which sensor sent the data, and data it generates, as one whole message.
Otherwise, joining data purely by timestamp doesn't seem that fail proof.
Your message key can be the UUID/name, and you can scale that to hundreds of partitions
You could binary encode the data you're sending, but I will use a JSON string for illustration
{
"sensor_id" : "some unique name",
"temperatures" [1,2],
"voltage": 3
}
If you want three topics out of that, you can very easily create three output topics using Kafka Streams or KSQL
Else, go ahead and create individual topics, but add the ID/name so you can join on that, using windows of time on orders of seconds or minutes, not trying to adjust for lag where one event is just microseconds off and you cannot join messages

Can't write ordered data to parquet in spark

I am working with Apache Spark to generate parquet files. I can partition them by date with no problems, but internally I can not seem to lay out the data in the correct order.
The order seems to get lost during processing, which means the parquet metadata is not right (specifically I want to ensure that the parquet row groups are reflecting sorted order so that queries specific to my use case can filter efficiently via the metadata).
Consider the following example:
// note: hbase source is a registered temp table generated from hbase
val transformed = sqlContext.sql(s"SELECT id, sampleTime, ... , toDate(sampleTime) as date FROM hbaseSource")
// Repartion the input set by the date column (in my source there should be 2 distinct dates)
val sorted = transformed.repartition($"date").sortWithinPartitions("id", "sampleTime")
sorted.coalesce(1).write.partitionBy("date").parquet(s"/outputFiles")
With this approach, I do get the right parquet partition structure ( by date). And even better, for each date partition, I see a single large parquet file.
/outputFiles/date=2018-01-01/part-00000-4f14286c-6e2c-464a-bd96-612178868263.snappy.parquet
However, when I query the file I see the contents out of order. To be specific, "out of order" seems more like several ordered data-frame partitions have been merged into the file.
The parquet row group metadata shows that the sorted fields are actually overlapping ( a specific id, for example, could be located in many row groups ):
id: :[min: 54, max: 65012, num_nulls: 0]
sampleTime: :[min: 1514764810000000, max: 1514851190000000, num_nulls: 0]
id: :[min: 827, max: 65470, num_nulls: 0]
sampleTime: :[min: 1514764810000000, max: 1514851190000000, num_nulls: 0]
id: :[min: 1629, max: 61412, num_nulls: 0]
I want the data to be properly ordered inside each file so the metadata min/max in each row group are non-overlapping.
For example, this is the pattern I want to see:
RG 0: id: :[min: 54, max: 100, num_nulls: 0]
RG 1: id: :[min: 100, max: 200, num_nulls: 0]
... where RG = "row group". If I wanted id = 75, the query could find it in one row group.
I have tried many variations of the above code. For example with and without coalesce (I know coalesce is bad, but my idea was to use it to prevent shuffling). I have also tried sort instead of sortWithinPartitions (sort should create a total ordered sort, but will result in many partitions). For example:
val sorted = transformed.repartition($"date").sort("id", "sampleTime")
sorted.write.partitionBy("date").parquet(s"/outputFiles")
Gives me 200 files, which is too many, and they are still not sorted correctly. I can reduce the file count by adjusting the shuffle size, but I would have expected sort to be processed in order during the write (I was under the impression that writes did not shuffle the input). The order I see is as follows (other fields omitted for brevity):
+----------+----------------+
|id| sampleTime|
+----------+----------------+
| 56868|1514840220000000|
| 57834|1514785180000000|
| 56868|1514840220000000|
| 57834|1514785180000000|
| 56868|1514840220000000|
Which looks like it's interleaved sorted partitions. So I think repartition buys me nothing here, and sort seems to be incapable of preserving order on the write step.
I've read that what I want to do should be possible. I've even tried the approach outlined in the presentation "Parquet performance tuning:
The missing guide" by Ryan Blue ( unfortunately it is behind the OReily paywall). That involves using insertInto. In that case, spark seemed to use an old version of parquet-mr which corrupted the metadata, and I am not sure how to upgrade it.
I am not sure what I am doing wrong. My feeling is that I am misunderstanding the way repartition($"date") and sort work and/or interact.
I would appreciate any ideas. Apologies for the essay. :)
edit:
Also note that if I do a show(n) on transformed.sort("id", "sampleTime") the data is sorted correctly. So it seems like the problem occurs during the write stage. As noted above, it does seem like the output of the sort is shuffled during the write.
The problem is that while saving file format, Spark is requiring some order and if the order is not satisfied, Spark will sort the data during the saving process according to the requirement and will forget your sort. To be more specific Spark requires this order (and this is taken directly from the Spark source code of Spark 2.4.4):
val requiredOrdering = partitionColumns ++ bucketIdExpression ++ sortColumns
where partitionColumns are columns by which you partition the data. You are not using bucketing so bucketingIdExpression and sortColumns are not relevant in this example and the requiredOrdering will be only the partitionColumns. So if this is your code:
val sorted = transformed.repartition($"date").sortWithinPartitions("id",
"sampleTime")
sorted.write.partitionBy("date").parquet(s"/outputFiles")
Spark will check if the data is sorted by date, which is not, so Spark will forget your sort and will sort it by date. On the other hand if you instead do it like this:
val sorted = transformed.repartition($"date").sortWithinPartitions("date", "id",
"sampleTime")
sorted.write.partitionBy("date").parquet(s"/outputFiles")
Spark will check again if the data is sorted by date and this time it is (the requirement is satisfied) so Spark will preserve this order and will induce no more sorts while saving the data. So i believe this way it should work.
Just idea, sort after coalesce: ".coalesce(1).sortWithinPartitions()". Also expected result looks strange - why ordered data in parquet required? Sorting after reading looks more appropriate.

stream processing architecture: future events effect past results

I'm new to stream processing (kafka streams / flink / storm / spark / etc.) and trying to figure out the best way to go about handling a real world problem, represented here by a toy example. We are tied to Kafka for our pubsub/data ingestion, but have no particular attachment in terms of stream processor framework/approach.
Theoretically, suppose I have a source emitting floating point values sporadically. Also at any given point there is a multiplier M that should be applied to this source's values; but M can change, and critically, I may only find out about the change much later - possibly not even "in change order."
I am thinking of representing this in Kafka as
"Values": (timestamp, floating point value) - the values from the source, tagged with their emission time.
"Multipliers": (timestamp, floating point multiplier) - indicates M changed to this floating point multiplier at this timestamp.
I would then be tempted to create an output topic, say "Results", using a standard stream processing framework, that joins the two streams, and merely multiplies each value in Values with the current multiplier determined by Multipliers.
However, based on my understanding this is not going to work, because new events posted to Multipliers can have arbitrarily large impact on results already written to the Results stream. Conceptually, I would like to have something like a Results stream that is current as of the last event posted to Multipliers against all values in Values, but which can be "recalculated" as either further Values or Multipliers events come in.
What are some techniques for achieving/architecting this with kafka and major stream processors?
Example:
Initially,
Values = [(1, 2.4), (2, 3.6), (3, 1.0), (5, 2.2)]
Multipliers = [(1, 1.0)]
Results = [(1, 2.4), (2, 3.6), (3, 1.0), (5, 2.2)]
Later,
Values = [(1, 2.4), (2, 3.6), (3, 1.0), (5, 2.2)]
Multipliers = [(1, 1.0), (4, 2.0)]
Results = [(1, 2.4), (2, 3.6), (3, 1.0), (5, 4.4)]
Finally, after yet another event posted to Multipliers (and also a new Value emitted too):
Values = [(1, 2.4), (2, 3.6), (3, 1.0), (5, 2.2), (7, 5.0)]
Multipliers = [(1, 1.0), (4, 2.0), (2, 3.0)]
Results = [(1, 2.4), (2, 10.8), (3, 3.0), (5, 4.4), (7, 10.0)]
I am only familiar with Spark and in order for this to work as you describe, you are looking to selectively "update" previous results as new multiplier values are received, while applying the highest indexed multiplier to new values that have not yet had a multiplier applied to them. AFAIK, Spark by itself won't let you do this using streaming (you need to cache and update old results and you also need to know which is the multiplier to use for new values), but you could code the logic such that you write your "results" topic to a regular DB table, and when you received a new multiplier, all subsequent events in the Values dataframe would just use that value, but you would do a one time check to find if there are values in the results table that now need to be updated to use the new multiplier and simply update those values in the DB table.
Your results consumer has to be able to deal with inserts and updates. You can use Spark with any DB that has a connector to achieve this.
Alternatively, you can use SnappyData, which turns Apache Spark into a mutable compute + data platform. Using Snappy, you would have Values and Multipliers as regular streaming dataframes, and you would have Results as a dataframe setup as a replicated table in SnappyData. And when you process a new entry in the multiplier stream, you would update all results stored in the results table. This is perhaps the easiest way to accomplish what you are trying to do

Secondary sorting by using join in Spark?

In Spark, I want to sort an RDD by two different fields. For example, in the given example here, I want to sort the elements by fieldA first, and within that, sort by fieldB (Secondary sorting). Is the method employed in the given example good enough? I have tested my code and it works. But is this a reliable way of doing it?
// x is of type (key, fieldA) and y of type (key, fieldB)
val a = x.sortBy(_._2)
// b will be of type (key, (fieldB, fieldA))
val b = y.join(x).sortBy(_._2._1))
So, I want an output that looks like the following, for example.
fieldA, fieldB
2, 10
2, 11
2, 13
7, 5
7, 7
7, 8
9, 3
9, 10
9, 10
But is this a reliable way of doing it?
It is not reliable. It depends on an assumption that during the shuffle data is processed in the order defined by the order of partitions. This may happen but there is no guarantee it will.
In other words shuffle based sorting is not stable. In general there exist methods which can be used to achieve desired result without performing full shuffle twice, but these are quite low level and for optimal performance require custom Partitioner.
You can use sortBy in the following way
y.join(x).sortBy(r => (r._2._2, r._2._1))
Two sorting will happen in one go.