Spark dataframe CSV vs Parquet - pyspark

I am beginner in Spark and trying to understand the mechanics of spark dataframes.
I am comparing performance of sql queries on spark sql dataframe when loading data from csv verses parquet. My understanding is once the data is loaded to a spark dataframe, it shouldn't matter where the data was sourced from (csv or parquet). However I see significant performance difference between the two. I am loading the data using the following commands and there writing queries against it.
dataframe_csv = sqlcontext.read.format("csv").load()
dataframe_parquet = sqlcontext.read.parquet()
Please explain the reason for the difference.

The reason because you see differente performance between csv & parquet is because parquet has a columnar storage and csv has plain text format. Columnar storage is better for achieve lower storage size but plain text is faster at read from a dataframe.

Related

Pyspark & HDFS: Add new dataframe column to existing parquet files in hdfs

let me first start with my scenario:
I have a huge dataframe stored in HDFS. I load the dataframe in a spark session
and create a new column without changing any of the existing content. After this, I want to store the dataframe to the original directory in HDFS.
Now, I know, I can practically do with with spark's write operation in the fashion df.parquet.write("my_df_path", mode="overwrite"). Since the data is immense, I'm investigating whether there is a so to speak column-wise append-mode or method, that does not write the complete dataframe back only the difference to the stored data. The final target is to save both memory and computational effort for the HDFS system.

save PostgreSQL data in Parquet format

I'm working a project which needs to generate parquet files from a huge PostgreSQL database. The data size can be gigantic (ex: 10TB). I'm very new to this topic and has done some research online but did not find a direct way to convert the data to Parquet file. Here are my questions:
The only feasible solution I saw is to load Postgres table to Apache Spark via JDBC and save as a parquet file. But I assume it will be very slow while transferring 10TB data.
Is it possible to generate a huge parquet file size that is 10 TB? Or is it better to create multiple parquet files?
Hope my question is clear and I really appreciate any helpful feedbacks. Thanks in advance!
Use the ORC format instead of the parquet format for this volume.
I assume the data is partitioned, so I think it's a good idea to extract in parallel taking advantage of data partitioning.

how to save output from cassandra table using spark

I want to save the output/rows read from cassandra table to a file in either csv or json format. Using, Spark 1.6.3:
scala>val results.sqlContext.sql("select * from myks.mytable")
scala>val.write.option("header","true").save("/tmp/xx.csv") -- writes to cfs:// filesystem
I am not able to find an option to write to the OS as csv or json format file.
Appreciate any help!
Use the spark Cassandra connector to read data from a Cassandra table into spark
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/0_quick_start.md

Is really Hive on Tez with ORC performance better than Spark SQL for ETL?

I have little experience in Hive and currently learning Spark with Scala. I am curious to know whether Hive on Tez really faster than SparkSQL. I searched many forums with test results but they have compared older version of Spark and most of them are written in 2015. Summarized main points below
ORC will do the same as parquet in Spark
Tez engine will give better performance like Spark engine
Joins are better/faster in Hive than Spark
I feel like Hortonworks supports more for Hive than Spark and Cloudera vice versa.
sample links :
link1
link2
link3
Initially I thought Spark would be faster than anything because of their in-memory execution. after reading some articles I got Somehow existing Hive also getting improvised with new concepts like Tez, ORC, LLAP etc.
Currently running with PL/SQL Oracle and migrating to big data since volumes are getting increased. My requirements are kind of ETL batch processing and included data details involved in every weekly batch runs. Data will increase widely soon.
Input/lookup data are csv/text formats and updating into tables
Two input tables which has 5 million rows and 30 columns
30 look up tables used to generate each column of output table which contains around 10 million rows and 220 columns.
Multiple joins involved like inner and left outer since many look up tables used.
Kindly please advise which one of below method I should choose for better performance with readability and easy to include minor updates on columns for future production deployment.
Method 1:
Hive on Tez with ORC tables
Python UDF thru TRANSFORM option
Joins with performance tuning like map join
Method 2:
SparkSQL with Parquet format which is converting from text/csv
Scala for UDF
Hope we can perform multiple inner and left outer join in Spark
The best way to implement the solution to your problem as below.
To load the data into the table the spark looks good option to me. You can read the tables from the hive metastore and perform the incremental updates using some kind of windowing functions and register them in hive. While ingesting as data is populated from various lookup table, you are able to write the code in programatical way in scala.
But at the end of the day, there need to be a query engine that is very easy to use. As your spark program register the table with hive, you can use hive.
Hive support three execution engines
Spark
Tez
Mapreduce
Tez is matured, spark is evolving with various commits from Facebook and community.
Business can understand hive very easily as a query engine as it is much more matured in the industry.
In short use spark to process the data for daily processing and register them with hive.
Create business users in hive.

Data Analysis Scala on Spark

I am new to Scala, and i have to use Scala and Spark's SQL, Mllib and GraphX in order to perform some analysis on huge data set. The analyses i want to do are:
Customer life cycle Value (CLV)
Centrality measures (degree, Eigenvector, edge-betweenness,
closeness) The data is in a CSV file (60GB (3 years transnational data))
located in Hadoop cluster.
My question is about the optimal approach to access the data and perform the above calculations?
Should i load the data from the CSV file into dataframe and work on
the dataframe? or
Should i load the data from the CSV file and convert it into RDD and
then work on the RDD? or
Are there any other approach to access the data and perform the analyses?
Thank you so much in advance for your help..
Dataframe gives you sql like syntax to work with the data where as RDD gives Scala collection like methods for data manipulation.
One extra benefit with Dataframes is underlying spark system will optimise your queries just like sql query optimisation. This is not available in case of RDD's.
As you are new to Scala its highly recommended to use Dataframes API initially and then Pick up RDD API later based on requirement.
You can use Databricks CSV reader api, which is easy to use and returns DataFrame. It automatically infer data types. If you pass the file with header it can automatically use that as Schema, otherwise you can construct schema using StructType.
https://github.com/databricks/spark-csv
Update:
If you are using Spark 2.0 Version , by default it support CSV datasource, please see the below link.
https://spark.apache.org/releases/spark-release-2-0-0.html#new-features
See this link for how to use.
https://github.com/databricks/spark-csv/issues/367