I am creating some scala UDFs to process data in a Dataframe, and am wondering if it is possible to log information within a UDF? Any examples on how to do this?
Related
I'm quite new to pySpark but I'm confused about the difference between a spark Dataframe (created for example from an RDD ) and a pandas-on-spark Dataframe.
Are those the same object ? Looking at the type it seems they are different classes.
What's the core difference, if any ? (I know that working with pandas-on-spark Dataframe you can use almost the same syntax of Pandas on a distributed Dataframe but I'm wondering if is only this one the difference )
Thanks
Answering directly:
Are those the same object ? Looking at the type it seems they are different classes.
No, they are completely different objects (classes).
What's the core difference, if any ?
A pySpark DataFrame is an object from the PySpark library, with its own API and it can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.
A Pandas-on-Spark DataFrame and pandas DataFrame are similar. However, the former is distributed and the latter is in a single machine. When converting to each other, the data is transferred between multiple machines and the single client machine.
A Pandas DataFrame, is an object from the pandas library, also with its own API and it can also be constructed from a wide range of methods.
Also, I recommend checking this documentation about Pandas on Spark
I'm very new to pyspark and I'm kind of confused with the data manipulation. What I learned lately is that we can manipulate data (tabular data) with SQL queries or with pyspark dataframes built in methods. My question is
Is there another way to manipulate tabular data in pyspark other than with SQL queries or with pyspark dataframes built in methods?
Why some people manipulate the data with SQL and some others with the built in methods? I mean it's mentioned that spark dataframes can act like SQL table, so why using the built in functions?
In the best practice, when to manipulate the data with SQL queries and with pyspark dataframes built in methods?
I'm sorry if this is a basic question but I'm very new at this and I have been looking for articles to answer the questions I have but to no avail.
I am working on PST files, I have worked on writing custom record reader for a Mapreduce program for different input formats but this time it is going to be spark.
I am not getting any clue or documentation on implementing record readers in spark.
Can some body help on this? Is it possible to implement this functionality in spark?
I want to avoid writing the entire stream to a file and then load it to dataframe. what's the right way?
You can check Spark Streaming and sqlnetworkWordCount which explains that your problem can be solved by creating singleton instance of SparkSession by using SparkContext of SparkStreaming.
You should have better ideas by going through above links where dataframes are created from streaming rdd.
I am new to Scala, and i have to use Scala and Spark's SQL, Mllib and GraphX in order to perform some analysis on huge data set. The analyses i want to do are:
Customer life cycle Value (CLV)
Centrality measures (degree, Eigenvector, edge-betweenness,
closeness) The data is in a CSV file (60GB (3 years transnational data))
located in Hadoop cluster.
My question is about the optimal approach to access the data and perform the above calculations?
Should i load the data from the CSV file into dataframe and work on
the dataframe? or
Should i load the data from the CSV file and convert it into RDD and
then work on the RDD? or
Are there any other approach to access the data and perform the analyses?
Thank you so much in advance for your help..
Dataframe gives you sql like syntax to work with the data where as RDD gives Scala collection like methods for data manipulation.
One extra benefit with Dataframes is underlying spark system will optimise your queries just like sql query optimisation. This is not available in case of RDD's.
As you are new to Scala its highly recommended to use Dataframes API initially and then Pick up RDD API later based on requirement.
You can use Databricks CSV reader api, which is easy to use and returns DataFrame. It automatically infer data types. If you pass the file with header it can automatically use that as Schema, otherwise you can construct schema using StructType.
https://github.com/databricks/spark-csv
Update:
If you are using Spark 2.0 Version , by default it support CSV datasource, please see the below link.
https://spark.apache.org/releases/spark-release-2-0-0.html#new-features
See this link for how to use.
https://github.com/databricks/spark-csv/issues/367