How to drop first row from parquet file? - scala

I have parquet file which contain two columns(id,feature).file consists of 14348 row.file
How i drop first row id,feature from file
code
val df = spark.read.format("parquet").load("file:///usr/local/spark/dataset/model/data/user/part-r-00000-7d55ba81-5761-4e36-b488-7e6214df2a68.snappy.parquet")
val header = df.first()
val data = df.filter(row => row != header)
data .show()
result seems as output

If you are trying to "ignore" the schema defined in the file, it is implicitly done once you read your file, using spark like:
spark.read.format("parquet").load(your_file)
If you are trying to only skip the first row on your DF and if you already know the id you can do: val filteredDF = originalDF.filter(s"id != '${excludeID}' "). If you don't know the id, you can use monotonically_increasing_id to tag it and then filter, similar like: filter spark dataframe based on maximum value of a column

You need to drop the first row based on id if you know that, else go for indexing approach i.e., assigning the row number and delete the first row.

I'm using Spark 2.4.0, and you could use the header option to the DataFrameReader call like so -
spark.read.format("csv").option("header", true).load(<path_to_file>)
Reference for the other options for DataFrameReader are here

Related

How to read JSON in data frame column

I'm reading a HDFS directory
val schema = spark.read.schema(schema).json("/HDFS path").schema
val df= spark.read.schema(schema).json ("/HDFS path")
Here selecting only PK and timestamp from JSON file
Val df2= df.select($"PK1",$"PK2",$"PK3" ,$"ts")
Then
Using windows function to get updated PK on the base of timestamp
val dfrank = df2.withColumn("rank",row_number().over(
Window.partitionBy($"PK1",$"PK2",$"PK3" ).orderBy($"ts".desc))
)
.filter($"rank"===1)
From this window function getting only updated primary keys & timestamp of updated JSON.
Now I have to add one more column where I want to get only JSON with updated PK and Timestamp
How I can do that
Trying below but getting wrong JSON instead of updated JSON
val df3= dfrank.withColumn("JSON",lit(dfrank.toJSON.first()))
Result shown in image.
Here, you convert the entire dataframe to JSON and collect it to the driver with toJSON (that's going to crash with a large dataframe) and add a column that contains a JSON version of the first row of the dataframe to your dataframe. I don't think this is what you want.
From what I understand, you have a dataframe and for each row, you want to create a JSON column that contains all of its columns. You could create a struct with all your columns and then use to_json like this:
val df3 = dfrank.withColumn("JSON", to_json(struct(df.columns.map(col) : _*)))

Spark-Scala: Get Dataframe Variable by concatenating two String Variables

I have a scenario where I need to form a dataframe name from two string variable. Which is pretty easy and can be done by concatenating.
Example: "df_" + "part1324"
The above code will return a String variable. I want this to be a Dataframe variable through which I can perform further operation on the data frame.
Map can be used for assign names to DataFrames:
val df = List(("df_value")).toDF()
val stringVariable = "part1324"
// assign name to dataframe
val namedDataFrames = Map("df_" + stringVariable -> df)
// get dataframe by name
namedDataFrames("df_part1324").show(false)
Your question is confusing. What do you mean by dataframe variable? Concatenating two strings will always return String. In order to create a dataframe, you need to apply the different methods available to create a dataframe.
val df:Dataframe cannot be equal to df_part1234 (String)as per your example but to use it as dataframe, you need to do something like below
val df_part1234 = sc.range(1000).toDF("number") where sc is your Sparksession variable.
In case you need to generate this variable dynamically, place it under the logic of variable generation like Loop and add the statement to create the dataframe.
Please rewrite your question if you are trying to achieve something else (along with code snippet to reproduce the issue) or accept the answer if you are clear on the issue

How can create a new DataFrame from a list?

Hello guys i have this function that gets the row Values from a DataFrame, converts them into a list and the makes a Dataframe from it.
//Gets the row content from the "content column"
val dfList = df.select("content").rdd.map(r => r(0).toString).collect.toList
val dataSet = sparkSession.createDataset(dfList)
//Makes a new DataFrame
sparkSession.read.json(dataSet)
What i need to do to make a list with other column values so i can have another DataFrame with the other columns values
val dfList = df.select("content","collection", "h").rdd.map(r => {
println("******ROW********")
println(r(0).toString)
println(r(1).toString)
println(r(2).toString) //These have the row values from the other
//columns in the select
}).collect.toList
thanks
Approach doesn't look right, you don't need to collect dataframe to just add new columns. Try adding columns to directly to dataframe using withColumn() withColumnRenamed() https://docs.azuredatabricks.net/spark/1.6/sparkr/functions/withColumn.html.
If you want to bring columns from another dataframe try joining. In any case it's not good idea to use collect as it will bring all your data to driver.

IndexOutOfBoundsException when writing dataframe into CSV

So, I'm trying to read an existing file, save that into a DataFrame, once that's done I make a "union" between that existing DataFrame and a new one I have already created, both have the same columns and share the same schema.
ALSO I CANNOT GIVE SIGNIFICANT NAME TO VARS NOR GIVE ANYMORE DATA BECAUSE OF RESTRICTIONS
val dfExist = spark.read.format("csv").option("header", "true").option("delimiter", ",").schema(schema).load(filePathAggregated3)
val df5 = df4.union(dfExist)
Once that's done I get the "start_ts" (a timestamp on Epoch format) that's duplicate in the union between the above dataframes (df4 and dfExist) and also I get rid of some characters I don't want
val df6 = df5.select($"start_ts").collect()
val df7 = df6.diff(df6.distinct).distinct.mkString.replace("[", "").replace("]", "")
Now I use this "start_ts" duplicate to filter the DataFrame and create 2 new DataFrames selecting the items of this duplicate timestamp, and the items that are not like this duplicate timestamp
val itemsNotDup = df5.filter(!$"start_ts".like(df7)).select($"start_ts",$"avg_value",$"Number_of_val")
val items = df5.filter($"start_ts".like(df7)).select($"start_ts",$"avg_value",$"Number_of_val")
And then I save in 2 different lists the avg_value and the Number_of_values
items.map(t => t.getAs[Double]("avg_value")).collect().foreach(saveList => listDataDF += saveList.toString)
items.map(t => t.getAs[Long]("Number_of_val")).collect().foreach(saveList => listDataDF2 += saveList.toString)
Now I make some maths with the values on the lists (THIS IS WHERE I'M GETTING ISSUES)
val newAvg = ((listDataDF(0).toDouble*listDataDF2(0).toDouble) - (listDataDF(1).toDouble*listDataDF2(1).toDouble)) / (listDataDF2(0) + listDataDF2(1)).toInt
val newNumberOfValues = listDataDF2(0).toDouble + listDataDF2(1).toDouble
Then save the duplicate timestamp (df7), the avg and the number of values into a list as a single item, this list transforms into a DataFrame and then I transform I get a new DataFrame with the columns how are supposed to be.
listDataDF3 += df7 + ',' + newAvg.toString + ',' + newNumberOfValues.toString + ','
val listDF = listDataDF3.toDF("value")
val listDF2 = listDF.withColumn("_tmp", split($"value", "\\,")).select(
$"_tmp".getItem(0).as("start_ts"),
$"_tmp".getItem(1).as("avg_value"),
$"_tmp".getItem(2).as("Number_of_val")
).drop("_tmp")
Finally I join the DataFrame without duplicates with the new DataFrame which have the duplicate timestamp and the avg of the duplicate avg values and the sum of number of values.
val finalDF = itemsNotDup.union(listDF2)
finalDF.coalesce(1).write.mode(SaveMode.Overwrite).format("csv").option("header","true").save(filePathAggregated3)
When I run this code in SPARK it gives me the error, I supposed it was related to empty lists (since it's giving me the error when making some maths with the values of the lists) but If I delete the line where I write to CSV, the code runs perfectly, also I saved the lists and values of the math calcs into files and they are not empty.
My supposition, is that, is deleting the file before reading it (because of how spark distribute tasks between workers) and that's why the list is empty therefore I'm getting this error when trying to make maths with those values.
I'm trying to be as clear as possible but I cannot give much more details, nor show any of the output.
So, how can I avoid this error? also I've been only 1 month with scala/spark so any code recommendation will be nice as well.
Thanks beforehand.
This error comes because of the Data. Any of your list does not contains columns as expected. When you refer to that index, the List gives this error to you
It was a problem related to reading files, I made a check (df.rdd.isEmpty) and wether the DF was empty I was getting this error. Made this as an if/else statement to check if the DF is empty, and now it works fine.

Dataframe column substring based on the value during join

I have a dataframe with column having values like "COR//xxxxxx-xx-xxxx" or "xxxxxx-xx-xxxx"
I need to compare this column with another column in a different dataframe based on the column value.
If column value have "COR//xxxxx-xx-xxxx", I need to use substring("column", 4, length($"column")
If the column value have "xxxxx-xx-xxxx", I can compare directly without using substring.
For example:
val DF1 = DF2.join(DF3, upper(trim($"column1".substr(4, length($"column1")))) === upper(trim(DF3("column1"))))
I am not sure how to add the condition while joining. Could anyone please let me know how can we achieve this in Spark dataframe?
You can try adding a new column based on the conditions and join on the new column. Something like this.
val data = List("COR//xxxxx-xx-xxxx", "xxxxx-xx-xxxx")
val DF2 = ps.sparkSession.sparkContext.parallelize(data).toDF("column1")
val DF4 = DF2.withColumn("joinCol", when(col("column1").like("%COR%"),
expr("substring(column1, 6, length(column1)-1)")).otherwise(col("column1")) )
DF4.show(false)
The new column will have values like this.
+------------------+-------------+
|column1 |joinCol |
+------------------+-------------+
|COR//xxxxx-xx-xxxx|xxxxx-xx-xxxx|
|xxxxx-xx-xxxx |xxxxx-xx-xxxx|
+------------------+-------------+
You can now join based on the new column added.
val DF1 = DF4.join(DF3, upper(trim(DF4("joinCol"))) === upper(trim(DF3("column1"))))
Hope this helps.
Simply create a new column to use in the join:
DF2.withColumn("column2",
when($"column1" rlike "COR//.*",
$"column1".substr(lit(4), length($"column1")).
otherwise($"column1"))
Then use column2 in the join. It is also possible to add the whole when clause directly in the join but it would look very messy.
Note that to use a constant value in substr you need to use lit. And if you want to remove the whole "COR//" part, use 6 instead of 4.