I want to maintain the track of duplicate rows by adding a version field in the incoming dataset in spark, is there a way I can do that ?
Related
I'm using dataframe and I have come across these terms. I couldn't fully understand them if possible, can you give an example on both?
Predicate pushdown is when filtering query results, a consumer of the parquet-mr API in spark can fetch all records from the API and then evaluate each record against the predicates of the filtering condition. Like you are joining two large parquet table -
SELECT * FROM museum m JOIN painting p ON p.museumid = m.id WHERE p.width > 120 AND p.height > 150
However, this requires assembling all records in memory, even non-matching ones. With predicate pushdown, these conditions are passed to the parquet-mr library instead, which evaluates the predicates on a lower level and discards non-matching records without assembling them first. So data gets filtered first based on the condition then comes to memory to join operations
Partitioning in Spark is actually logically divided data. When you read a file from the s3/ADLS/gzip file of HDFS it creates single partition for each file. Later when you process your data based on Shuffle partition your data frame partition changes accordingly. So I think partitioning and predicate pushdown have only one common factor when to fetch data from columnar files if the filter is applied the data comes to each partition in Spark is lesser than the existing data lying on the file system
Hope this helps
I'm currently developing a Spark Application that uses dataframes to compute and aggregates specific columns from a hive table.
Aside from using count() function in dataframes/rdd. Is there a more optimal approach to get the number of records processed or number of count of records of a dataframe ?
I just need to know if there's something needed to override a specific function or so.
Any replies would be appreciated. I'm currently using Apache spark 1.6.
Thank you.
Aside from using count() function in dataframes/rdd, is there a more optimal
approach to get the number of records processed or number of count of records
of a dataframe?
Nope. Since an RDD may have an arbitrarily complex execution plan, involving JDBC table queries, file scans, etc., there's no apriori way to determine its size short of counting.
I have a rdd which I need to store in mongoDB.
I tried use rdd.map to write each row of the rdd to mongoDB, using pymongo. But I encountered pickle error as it seems that pickling pymongo object to the workers is not supported.
Hence, I do a rdd.collect() to get the rdd to driver, and write it to mongoDB.
Is it possible to iteratively collect each partition of the rdd instead? This will minimize the changes of out of memory at the driver.
Yes, it is possible. You can use RDD.toLocalIterator(). You should remember though that it doesn't come for free. Each partition will require a separate job so you should consider persisting your data before you use it.
I've got a fairly big RDD with 400 fields coming from Kafka spark stream, I need to create another RDD or Map by selecting some fields from initial RDD stream when I transform the stream and eventually writing the Elasticsearch.
I know my fields by field name but don't know the field index.
How do I project the specific fields by field name to a new Map?
Assuming each field is delimited by delimiter '#'. You can determine the index for each field using the first row or header file and store in some data-structure. Subsequently, you can use this structure to determine the fields and create new maps.
You can use Apache Avro format to pre-process the data. That would allow you to access the data based on their fields and would not require the knowledge of their indexes in the String. The following link provides a great starting point to integrate Avro with Kafka and Spark.
http://aseigneurin.github.io/2016/03/04/kafka-spark-avro-producing-and-consuming-avro-messages.html
I am working on a cron job which needs to query Postgres on a daily basis. The table is huge ~ trillion records. On an average I would expect to retrieve about a billion records per execution. I couldn't find any documentation on using cursors or pagination for Slick 2.1.0 An easy approach I can think of is, get the count first and loop through using drop and take. Is there a better and efficient way to do this?
Map reduce using akka, postgresql-async, first count then distribute with offset+limit query to actors then map the data when needed then reduce the resul to elasticsearch or other store?