How can create a new DataFrame from a list?

How can create a new DataFrame from a list? - scala

Hello guys i have this function that gets the row Values from a DataFrame, converts them into a list and the makes a Dataframe from it.
//Gets the row content from the "content column"
val dfList = df.select("content").rdd.map(r => r(0).toString).collect.toList
val dataSet = sparkSession.createDataset(dfList)
//Makes a new DataFrame
sparkSession.read.json(dataSet)
What i need to do to make a list with other column values so i can have another DataFrame with the other columns values
val dfList = df.select("content","collection", "h").rdd.map(r => {
println("******ROW********")
println(r(0).toString)
println(r(1).toString)
println(r(2).toString) //These have the row values from the other
//columns in the select
}).collect.toList
thanks

Approach doesn't look right, you don't need to collect dataframe to just add new columns. Try adding columns to directly to dataframe using withColumn() withColumnRenamed() https://docs.azuredatabricks.net/spark/1.6/sparkr/functions/withColumn.html.
If you want to bring columns from another dataframe try joining. In any case it's not good idea to use collect as it will bring all your data to driver.

Related

Spark 2.3: Flatten Array of Array of Structs, and create new columns

I have a dataframe with a column ids that looks like
ids
WrappedArray(WrappedArray([item1,micro], [item3, mini]), WrappedArray([item2,macro]))
WrappedArray(WrappedArray([item1,micro]), WrappedArray([item5,micro], [item6,macro]))
where the exact type of the column is
StructField(ids,ArrayType(ArrayType(StructType(StructField(identifier,StringType,true), StructField(identifierType,StringType,true)),true),true),true)
I want to create two new columns, one holding the value of all of the distinct identifier in the struct, and another column holding the max observed identifierType for that row (if there are ties, then return all the ties).
So in our example, I would like the output to be
list_of_identifiers, most_frequent_type
Array(item1, item2, item3), [micro, mini, macro]
Array(item1, item5, item6), [micro]
To achieve this, the first step I need to do is flatten the ids column to something like
ids
WrappedArray([item1,micro], [item3, mini], [item2,macro])
WrappedArray([item1,micro], [item5,micro], [item6,macro])
but I cannot figure out the way to do this.
Here is sample input table
val arrayStructData = Seq(
Row(List(List(Row("item1", "micro"),Row("item3", "mini")), List(Row("item2", "macro")))),
Row(List(List(Row("item1", "micro")), List(Row("item5", "micro"), Row("item6", "macro"))))
)
val arrayStructSchema = new StructType()
.add("ids", ArrayType(ArrayType(new StructType()
.add("identifier",StringType)
.add("identifierType",StringType))))
val df = spark.createDataFrame(spark.sparkContext
.parallelize(arrayStructData),arrayStructSchema)

How to drop first row from parquet file?

I have parquet file which contain two columns(id,feature).file consists of 14348 row.file
How i drop first row id,feature from file
code
val df = spark.read.format("parquet").load("file:///usr/local/spark/dataset/model/data/user/part-r-00000-7d55ba81-5761-4e36-b488-7e6214df2a68.snappy.parquet")
val header = df.first()
val data = df.filter(row => row != header)
data .show()
result seems as output

If you are trying to "ignore" the schema defined in the file, it is implicitly done once you read your file, using spark like:
spark.read.format("parquet").load(your_file)
If you are trying to only skip the first row on your DF and if you already know the id you can do: val filteredDF = originalDF.filter(s"id != '${excludeID}' "). If you don't know the id, you can use monotonically_increasing_id to tag it and then filter, similar like: filter spark dataframe based on maximum value of a column

You need to drop the first row based on id if you know that, else go for indexing approach i.e., assigning the row number and delete the first row.

I'm using Spark 2.4.0, and you could use the header option to the DataFrameReader call like so -
spark.read.format("csv").option("header", true).load(<path_to_file>)
Reference for the other options for DataFrameReader are here

Dataframe column substring based on the value during join

I have a dataframe with column having values like "COR//xxxxxx-xx-xxxx" or "xxxxxx-xx-xxxx"
I need to compare this column with another column in a different dataframe based on the column value.
If column value have "COR//xxxxx-xx-xxxx", I need to use substring("column", 4, length($"column")
If the column value have "xxxxx-xx-xxxx", I can compare directly without using substring.
For example:
val DF1 = DF2.join(DF3, upper(trim($"column1".substr(4, length($"column1")))) === upper(trim(DF3("column1"))))
I am not sure how to add the condition while joining. Could anyone please let me know how can we achieve this in Spark dataframe?

You can try adding a new column based on the conditions and join on the new column. Something like this.
val data = List("COR//xxxxx-xx-xxxx", "xxxxx-xx-xxxx")
val DF2 = ps.sparkSession.sparkContext.parallelize(data).toDF("column1")
val DF4 = DF2.withColumn("joinCol", when(col("column1").like("%COR%"),
expr("substring(column1, 6, length(column1)-1)")).otherwise(col("column1")) )
DF4.show(false)
The new column will have values like this.
+------------------+-------------+
|column1 |joinCol |
+------------------+-------------+
|COR//xxxxx-xx-xxxx|xxxxx-xx-xxxx|
|xxxxx-xx-xxxx |xxxxx-xx-xxxx|
+------------------+-------------+
You can now join based on the new column added.
val DF1 = DF4.join(DF3, upper(trim(DF4("joinCol"))) === upper(trim(DF3("column1"))))
Hope this helps.

Simply create a new column to use in the join:
DF2.withColumn("column2",
when($"column1" rlike "COR//.*",
$"column1".substr(lit(4), length($"column1")).
otherwise($"column1"))
Then use column2 in the join. It is also possible to add the whole when clause directly in the join but it would look very messy.
Note that to use a constant value in substr you need to use lit. And if you want to remove the whole "COR//" part, use 6 instead of 4.

Generate distinct values from a column in a spark dataframe

I have a spark dataframe like below
id|name|age|sub
1 |ravi|21 |[M,J,J,K]
I don't want to explode on the column "sub" as it will create another extra set of rows. I want generate unique values from the "sub" column and assign it to new column sub_unique.
My output should be like
id|name|age|sub_unique
1 |ravi|21 |[M,J,K]

You can use udf
val distinct = udf((x: Seq[String]) => if (s != null) x.distinct else Seq[String]())
df.withColumn("subm_unique", distinct($"sub"))

Tagging a HBase Table using Spark RDD in Scala

I am trying add an extra "tag" column to an Hbase table. Tagging is done on the basis of words present in the rows of the table. Say for example, If "Dark" appears in a certain row, then its tag will be added as "Horror". I have read all the rows from the table in a spark RDD and have matched them with words based on which we would tag. A snippet to code looks like this:
var hBaseRDD2=sc.newAPIHadoopRDD(conf,classOf[TableInputFormat],classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result])
val transformedRDD = hBaseRDD2.map(tuple => {
(Bytes.toString(tuple._2.getValue(Bytes.toBytes("Moviesdata"),Bytes.toBytes("MovieName"))),
Bytes.toString(tuple._2.getValue(Bytes.toBytes("Moviesdata"),Bytes.toBytes("MovieSummary"))),
Bytes.toString(tuple._2.getValue(Bytes.toBytes("Moviesdata"),Bytes.toBytes("MovieActor")))
)
})
Here, "moviesdata" is the columnfamily of the HBase table and "MovieName"&"MovieSummary" & "MovieActor" are column names. "transformedRDD" in the above snippet is of type RDD[String,String,String]. It has been converted into type RDD[String] by:
val arrayRDD: RDD[String] = transformedRDD.map(x => (x._1 + " " + x._2 + " " + x._3))
From this, all words have been extracted by doing this:
val words = arrayRDD.map(x => x.split(" "))
The words which we would are looking for in the HBase Table rows are in a csv file. One of the column, let's say "synonyms" column, of the csv has the words which we would look for. Another column in the csv is a "target_tag" column, which has the words which would be tagged to the row corresponding to which there is match.
Read the csv by:
val csv = sc.textFile("/tag/moviestagdata.csv")
reading the synonyms column: (synonyms column is the second column, therefore "p(1)" in the below snippet)
val synonyms = csv.map(_.split(",")).map( p=>p(1))
reading the target_tag column: (target_tag is the 3rd column)
val targettag = csv.map(_.split(",")).map(p=>p(2))
Some rows in synonyms and targetag have more than one strings and are seperated by "###". The snippet to seperate them is this:
val splitsyno = synonyms.map(x => x.split("###"))
val splittarget = targettag.map(x=>x.split("###"))
Now, to match each string from "splitsyno", we need to traverse every row, and further a row might have many strings, hence, to create a set of every string, I did this:(an empty set was created)
splitsyno.map(x=>x.foreach(y=>set += y)
To match every string with those in "words" created up above, I did this:
val check = words.exists(set contains _)
Now, the problem which I am facing is that I don't exactly know that strings from what rows in csv are matching to strings from what rows in HBase table. This is needed as I would need to find corresponding target string and which row in HBase table to add to. How should I get it done? Any help would be highly appreciated.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How can create a new DataFrame from a list? - scala

Related

Spark 2.3: Flatten Array of Array of Structs, and create new columns

How to drop first row from parquet file?

Dataframe column substring based on the value during join

Generate distinct values from a column in a spark dataframe

Tagging a HBase Table using Spark RDD in Scala

Categories

Resources