Reading CSV in pyspark if there is no header on csv - pyspark

How do I create a RDD from CSV file which does not have a header and how do i combine 2 RDD on a column. Not using Spark SQL
rdd1 = sc.textFile('transactions.csv')

It depends if you want a DataFrame or an RDD. If it is the former try:
spark.read.format("csv").option("header", "false").load("transactions.csv")
The columns will be auto-generated. If it is the latter, do something like:
rdd1 = sc.textFile('transactions.csv').map(lambda row: row.split(","))

Related

Create an empty DF using schema from another DF (Scala Spark)

I have to compare a DF with another one that is the same schema readed from a specific path, but maybe in that path there are not files so I've thought that I have to compare it with a null DF with the same columns as the original.
So I am trying to create a DF with the schema from another DF that contains a lot of columns but I can't find a solution for this. I have been reading the following posts but no one helps me:
How to create an empty DataFrame with a specified schema?
How to create an empty DataFrame? Why "ValueError: RDD is empty"?
How to create an empty dataFrame in Spark
How can I do it in scala? Or is better take other option?
originalDF.limit(0) will return an empty dataframe with the same schema.

Creating Separate Spark dataframe from existing arraytype column

I have a spark dataframe as
with schema
StructType(structField("a",IntegerType,False),structField("b",IntegerType,False),structField("c",ArrayType(structType(structField("d",IntegerType,False),structField("e",IntegerType,False)))
I want to create a separate dataframe from column "c" which is of array type.
Desired output format is
Try this-
df.selectExpr("a", "b", "inline_outer(c)").show()

How to Create data frame schema from a header file

I have 2 data files:
1 file is header file and other is data file.
Header file is having 2 columns (Id,Tags):header.txt
Id,Tags
Now I am trying to create a dataFrame Schema Out of the header file:(I have to use this approach as in real time ,there are 1000 of columns in header.txt and data.txt. So,manually creating case class with 1000 column is not possible.
val dataFile=sparkSession.read.format("text").load("data.txt")
val headerFile=sparkSession.sparkContext.textFile("header.txt")
val fields=
headerFile.flatMap(x=>x.split(",")).map(fieldName=>StructField(fieldName,StringType,true))
val schema=StructType(fields)
But above line is failing with Cannot resolve overloaded method StructType.
Can some one please help
StructType needs an array of StructField and you are using fields variable which is an RDD[String] so collect the rdd to create StructType.
val fields= headerFile.flatMap(x=>x.split(","))
.map(fieldName=>StructField(fieldName,StringType,true))
val schema=StructType(fields.collect)

How to read a CSV file and then save it as JSON in Spark Scala?

I am trying to read a CSV file that has around 7 million rows, and 22 columns.
How to save it as a JSON file after reading the CSV in a Spark Dataframe?
Read a CSV file as a dataframe
val spark = SparkSession.builder().master("local[2]").appname("test").getOrCreate
val df = spark.read.csv("path to csv")
Now you can perform some operation to df and save as JSON
df.write.json("output path")
Hope this helps!

How to parse a csv string into a Spark dataframe using scala?

I would like to convert a RDD containing records of strings, like below, to a Spark dataframe.
"Mike,2222-003330,NY,34"
"Kate,3333-544444,LA,32"
"Abby,4444-234324,MA,56"
....
The schema line is not inside the same RDD, but in a another variable:
val header = "name,account,state,age"
So now my question is, how do I use the above two, to create a dataframe in Spark? I am using Spark version 2.2.
I did search and saw a post:
Can I read a CSV represented as a string into Apache Spark using spark-csv
.
However it's not exactly what I need and I can't figure out a way to modify this piece of code to work in my case.
Your help is greatly appreciated.
The easier way would probably be to start from the CSV file and read it directly as a dataframe (by specifying the schema). You can see an example here: Provide schema while reading csv file as a dataframe.
When the data already exists in an RDD you can use toDF() to convert to a dataframe. This function also accepts column names as input. To use this functionality, first import the spark implicits using the SparkSession object:
val spark: SparkSession = SparkSession.builder.getOrCreate()
import spark.implicits._
Since the RDD contains strings it needs to first be converted to tuples representing the columns in the dataframe. In this case, this will be a RDD[(String, String, String, Int)] since there are four columns (the last age column is changed to int to illustrate how it can be done).
Assuming the input data are in rdd:
val header = "name,account,state,age"
val df = rdd.map(row => row.split(","))
.map{ case Array(name, account, state, age) => (name, account, state, age.toInt)}
.toDF(header.split(","):_*)
Resulting dataframe:
+----+-----------+-----+---+
|name| account|state|age|
+----+-----------+-----+---+
|Mike|2222-003330| NY| 34|
|Kate|3333-544444| LA| 32|
|Abby|4444-234324| MA| 56|
+----+-----------+-----+---+