Slow performance reading parquet files in S3 with scala in Spark - scala

I save a partitioned file in a s3 bucket from a data frame in scala
data_frame.write.mode("append").partitionBy("date").parquet("s3n://...")
When I read this partitioned file I'm experimenting very slow performance, I'm just doing a simple group by
val load_df = sqlContext.read.parquet(s"s3n://...").cache()
I also try
load_df.registerTempTable("dataframe")
Any advice, I'm doing something wrong?

It depends on what you mean by "very slow performance".
If you have too many files in you date partition it will take some time to read those.
Try to reduce granularity of the partition.

You should use the S3A driver (which may be as simple as changing your url protocol to s3a://, or maybe you'll need some extra classpath to have hadoop-aws and aws-sdk jars in it) to have better perfs.

Related

Should we avoid partitionBy when writing files to S3 in spark?

The parquet location is:
s3://mybucket/ref_id/date/camera_id/parquet-file
Let's say I have ref_id x3, date x 4, camera_id x 500, if I write parquet like below(use partitionBy), I will get 3x4x500=6000 files uploaded to S3. It is extremely slower than that just wrote a couple of files to the top-level bucket(no multiple level prefix)
What is the best practice? My colleague argues that partitionBy is good thing when used together with Hive metastore/table
df.write.mode("overwrite")\
.partitionBy('ref_id','date','camera_id')\
.parquet('s3a://mybucket/tmp/test_data')
If your problem is too many files, which seems to be the case, you need to repartition your RDD/dataframe before you write it. Each RDD/Dataframe partition will generate 1 file per folder.
df.repartition(1)\
.write.mode("overwrite")\
.partitionBy('ref_id','date','camera_id')\
.parquet('s3a://mybucket/tmp/test_data')
As alternative to repartition you can also use coalesce.
If (after repartition to 1) the files are too small you, need to reduce the directory structure. The parquet documentation recommends file size between 500Mb and 1Gb.
https://parquet.apache.org/documentation/latest/
We recommend large row groups (512MB - 1GB). Since an entire row
group might need to be read, we want it to completely fit on one HDFS
block.
If your files are a few Kb or Mb then you have a serious problem, it will seriously hurt performance.

Append new data to partitioned parquet files

I am writing an ETL process where I will need to read hourly log files, partition the data, and save it. I am using Spark (in Databricks).
The log files are CSV so I read them and apply a schema, then perform my transformations.
My problem is, how can I save each hour's data as a parquet format but append to the existing data set? When saving, I need to partition by 4 columns present in the dataframe.
Here is my save line:
data
.filter(validPartnerIds($"partnerID"))
.write
.partitionBy("partnerID","year","month","day")
.parquet(saveDestination)
The problem is that if the destination folder exists the save throws an error.
If the destination doesn't exist then I am not appending my files.
I've tried using .mode("append") but I find that Spark sometimes fails midway through so I end up loosing how much of my data is written and how much I still need to write.
I am using parquet because the partitioning substantially increases my querying in the future. As well, I must write the data as some file format on disk and cannot use a database such as Druid or Cassandra.
Any suggestions for how to partition my dataframe and save the files (either sticking to parquet or another format) is greatly appreciated.
If you need to append the files, you definitely have to use the append mode. I don't know how many partitions you expect it to generate, but I find that if you have many partitions, partitionBy will cause a number of problems (memory- and IO-issues alike).
If you think that your problem is caused by write operations taking too long, I recommend that you try these two things:
1) Use snappy by adding to the configuration:
conf.set("spark.sql.parquet.compression.codec", "snappy")
2) Disable generation of the metadata files in the hadoopConfiguration on the SparkContext like this:
sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")
The metadata-files will be somewhat time consuming to generate (see this blog post), but according to this they are not actually important. Personally, I always disable them and have no issues.
If you generate many partitions (> 500), I'm afraid the best I can do is suggest to you that you look into a solution not using append-mode - I simply never managed to get partitionBy to work with that many partitions.
If you're using unsorted partitioning your data is going to be split across all of your partitions. That means every task will generate and write data to each of your output files.
Consider repartitioning your data according to your partition columns before writing to have all the data per output file on the same partitions:
data
.filter(validPartnerIds($"partnerID"))
.repartition([optional integer,] "partnerID","year","month","day")
.write
.partitionBy("partnerID","year","month","day")
.parquet(saveDestination)
See: DataFrame.repartition

How to make MapReduce work with HDFS

This might sound like some stupid question.
I might write a MR code that can take input and output as HDFS locations and then I really don't need to worry about the parallel computing power of hadoop/MR. (Please correct me if I am wrong here).
However if my input is not an HDFS location say I am taking a MongoDB data as input - mongodb://localhost:27017/mongo_hadoop.messages and running my mappers and reducers and storing the data back to mongodb, how will HDFS come into picture. I mean how can I be sure that the 1 GB or any sized big file is first being distributed on HDFS and then parallel computing is being done on it?
Is it that this direct URI will not distribute the data and I need to take the BSON file instead, load it up on HDFS and then give the HDFS path as Input to MR or the framework is smart enough to do this by itself?
I am sorry if the above question is too stupid or not making any sense at all. I am really new to big data but very much excited to dive into this domain.
Thanks.
You are describing DBInputFormat. This is an input format that reads the split from an external database. HDFS only gets involved in setting up the job, but not in actual input. There is also an DBOutputFormat. With an input like DBInputFormat the splits are logical, eg. key ranges.
Read Database Access with Apache Hadoop for a detailed explanation.
Sorry,I am not sure about MongoDb.
If you just wanted to know,how splitting is happening if we are using the data source is a table,then this is my answer when MapRed working with HBase.
we will use TableInputFormat to use an Hbase table in MapRed job.
From the http://hbase.apache.org/book.html#hbase.mapreduce.classpath
7.7. Map-Task Splitting
7.7.1. The Default HBase MapReduce Splitter
When TableInputFormat is used to source an HBase table in a MapReduce job, its splitter will make a map task for each region of the table. Thus, if there are 100 regions in the table, there will be 100 map-tasks for the job - regardless of how many column families are selected in the Scan.
7.7.2. Custom Splitters
For those interested in implementing custom splitters, see the method getSplits in TableInputFormatBase. That is where the logic for map-task assignment resides.
This is a good question, not stupid.
1.
"mongodb://localhost:27017/mongo_hadoop.messages and running my mappers and reducers and storing the data back to mongodb, how will HDFS come into picture. "
Under this situation, u needn't consider hdfs. U needn't do anything related with hdf. Just like write a multiple-thread application with each thread write data to mongodb.
In fact, hdfs is independent to map-reduce, and map-reduce is also independent to hdfs. So, u can use them separately or together as your wish.
2.
if u want to input/output db to map-reduce, u show consider DBInputFormat, but that's another question.
Now, hadoop DBInputFormat only support JDBC. I'm not sure whether some mongodb version of DBInputFormat. Maybe U can search it or implement it by yourself.

Is Cassandra good for storing files?

I'm developing a php platform that will make huge use of images, documents and any file format that will come in my mind so i was wondering if Cassandra is a good choice for my needs.
If not, can you tell me how should i store files? I'd like to keep using cassandra because it's fault-tolerant and uses auto-replication among nodes.
Thanks for help.
From the cassandra wiki,
Cassandra's public API is based on Thrift, which offers no streaming abilities
any value written or fetched has to fit in memory. This is inherent to Thrift's
design and is therefore unlikely to change. So adding large object support to
Cassandra would need a special API that manually split the large objects up
into pieces. A potential approach is described in http://issues.apache.org/jira/browse/CASSANDRA-265.
As a workaround in the meantime, you can manually split files into chunks of whatever
size you are comfortable with -- at least one person is using 64MB -- and making a file correspond
to a row, with the chunks as column values.
So if your files are < 10MB you should be fine, just make sure to limit the file size, or break large files up into chunks.
You should be OK with files of 10MB. In fact, DataStax Brisk puts a filesystem on top of Cassandra if I'm not mistaken: http://www.datastax.com/products/enterprise.
(I'm not associated with them in any way- this isn't an ad)
As fresh information, Netflix provides utilities for their cassandra client called astyanax for storing files as handled object stores. Description and examples can be found here. It can be a good starting point to write some tests using astyanax and evaluate Cassandra as a file storage.

File I/O on NoSQL - especially HBase - is it recommended? or not?

I'm new at NoSQL and now I'm trying to use HBase for file storage. I'll store files in HBase as binary.
I don't need any statistics, only file storage.
IS IT RECOMMENDED? I worry about I/O speed.
The reason why I use HBase for a storage is I have to use HDFS, but I can't build Hadoop on a client computer. Because of it, I was tring to find some libraries which helps the client to connect to HDFS to get files. But I couldn't find it, and I just choose HBase instead of a connection library.
In this situation, what should I do?
I don't know about Hadoop, but MongoDB has GridFS which is designed for distributed file storage which enables you to scale horizontally, get replication for "free" and so on.
http://www.mongodb.org/display/DOCS/GridFS
There will be some overhead with storing files in chunks in MongoDB, so if your load is low to medium, and you need low response times, you will probably be better off with using the file system directly. Performance will also vary between different driver implementations.
I think that capability to mount HDFS as regular file system should help you. http://wiki.apache.org/hadoop/MountableHDFS
You certainly can use HBase to store files. It is perhaps not ideal, and based on your file size distribution you may want to tweak some of the settings. Compared with HDFS, it is probably a much better alternative for large numbers of files.
Settings to look out for:
max region size: You will likely want to turn this up to 4GB
max cell size: you will want to set this to 0 to disable this limit
You may also want to look at other kinds of alternatives (maybe even MapR).