using the sentence:
scala> val intento2 = sql("SELECT _CreationDate FROM tablaTemporal" )
intento2: org.apache.spark.sql.DataFrame = [_CreationDate: string]
scala> intento2.show(5, false)
I receive this output:
+-----------------------+
|_CreationDate |
+-----------------------+
|2008-07-31T00:00:00.000|
|2008-07-31T14:22:31.287|
|2008-07-31T14:22:31.287|
|2008-07-31T14:22:31.287|
|2008-07-31T14:22:31.317|
+-----------------------+
only showing top 5 rows
but the result I need is the same but no symbols added by scala/spark:
2005-07-31T14:20:19.239
2007-07-31T14:20:31.287
2009-07-31T14:21:33.287
2005-07-31T14:23:36.287
2009-07-31T14:20:38.317
How can i do, to print a clean output like above?
Here, you're printing the dataframe.
What you want to do is print each record of the dataframe:
intento2.collect().map(_.getString(0)).foreach(println)
collect transforms the dataframe into an array of Row objects.
then we map each Row to its first element with row.getString(0). In fact the Row contains only one element, the date.
Related
I need to add a new column to dataframe DF1 but the new column's value should be calculated using other columns' value present in that DF. Which of the other columns to be used will be given in another dataframe DF2.
eg. DF1
|protocolNo|serialNum|testMethod |testProperty|
+----------+---------+------------+------------+
|Product1 | AB |testMethod1 | TP1 |
|Product2 | CD |testMethod2 | TP2 |
DF2-
|action| type| value | exploded |
+------------+---------------------------+-----------------+
|append|hash | [protocolNo] | protocolNo |
|append|text | _ | _ |
|append|hash | [serialNum,testProperty] | serialNum |
|append|hash | [serialNum,testProperty] | testProperty |
Now the value of exploded column in DF2 will be column names of DF1 if value of type column is hash.
Required -
New column should be created in DF1. the value should be calculated like below-
hash[protocolNo]_hash[serialNumTestProperty] ~~~ here on place of column their corresponding row values should come.
eg. for Row1 of DF1, col value should be
hash[Product1]_hash[ABTP1]
this will result into something like this abc-df_egh-45e after hashing.
The above procedure should be followed for each and every row of DF1.
I've tried using map and withColumn function using UDF on DF1. But in UDF, outer dataframe value is not accessible(gives Null Pointer Exception], also I'm not able to give DataFrame as input to UDF.
Input DFs would be DF1 and DF2 as mentioned above.
Desired Output DF-
|protocolNo|serialNum|testMethod |testProperty| newColumn |
+----------+---------+------------+------------+----------------+
|Product1 | AB |testMethod1 | TP1 | abc-df_egh-4je |
|Product2 | CD |testMethod2 | TP2 | dfg-df_ijk-r56 |
newColumn value is after hashing
Instead of DF2, you can translate DF2 to case class like Specifications, e.g
case class Spec(columnName:String,inputColumns:Seq[String],action:String,action:String,type:String*){}
Create instances of above class
val specifications = Seq(
Spec("new_col_name",Seq("serialNum","testProperty"),"hash","append")
)
Then you can process the below columns
val transformed = specifications
.foldLeft(dtFrm)((df: DataFrame, spec: Specification) => df.transform(transformColumn(columnSpec)))
def transformColumn(spec: Spec)(df: DataFrame): DataFrame = {
spec.type.foldLeft(df)((df: DataFrame, type : String) => {
type match {
case "append" => {have a case match of the action and do that , then append with df.withColumn}
}
}
Syntax may not be correct
Since DF2 has the column names that will be used to calculate a new column from DF1, I have made this assumption that DF2 will not be a huge Dataframe.
First step would be to filter DF2 and get the column names that we want to pick from DF1.
val hashColumns = DF2.filter('type==="hash").select('exploded).collect
Now, hashcolumns will have the columns that we want to use to calculate hash in the newColumn. The hashcolumns is an Array of Row. We need this to be a Column that will be applied while creating the newColumn in DF1.
val newColumnHash = hashColumns.map(f=>hash(col(f.getString(0)))).reduce(concat_ws("_",_,_))
The above line will convert the Row to a Column with hash function applied to it. And we reduce it while concatenating _. Now, the task becomes simple. We just need to apply this to DF1.
DF1.withColumn("newColumn",newColumnHash).show(false)
Hope this helps!
I am using Spark and I have a table that has a specific string format in one of the columns called predictions. The format is always of the type - 0=some_probability,1=some_other_probability,2=some_other_probability .
Here are a few sample records from that table -
val table1 = Seq(
("0=0.5,1=0.3,2=0.2"),
("0=0.6,1=0.2,2=0.2"),
("0=0.1,1=0.1,2=0.8")
).toDF("predictions")
table1.show(false)
+-----------------+
|predictions |
+-----------------+
|0=0.5,1=0.3,2=0.2|
|0=0.6,1=0.2,2=0.2|
|0=0.1,1=0.1,2=0.8|
+-----------------+
Now, I also have metadata information about each of these indexes - 0,1,2...n in a separate string. The metadata string looks like -
val metadata = "AA::BB::CC"
I would like to write a UDF in Scala to map these indexes to each element in the string. The output of that UDF should give me a new column which looks like this -
+--------------------+
|labelled_predictions|
+--------------------+
|AA=0.5,BB=0.3,CC=0.2|
|AA=0.6,BB=0.2,CC=0.2|
|AA=0.1,BB=0.1,CC=0.8|
+--------------------+
So, 0 is replaced by AA since AA is the first element in the metadata string that is always split by ::.
How do I write an UDF in Scala-Spark to do this ?
val metadata = "AA::BB::CC"
based on given data, this should work for you:
def myUDF(metadata:String) = udf((s: String) => {
val metadataSplit = metadata.split("::")
val dataSplit = s.split(",")
val output = new Array[String](dataSplit.size)
for (i <- 0 until dataSplit.size) {
output(i) = metadataSplit(i) + "=" + dataSplit(i).split("=")(1)
}
output.mkString(",")
})
table1.withColumn("labelled_predictions", myUDF(metadata)(col("predictions"))).select("labelled_predictions").show(false)
output:
+--------------------+
|labelled_predictions|
+--------------------+
|AA=0.5,BB=0.3,CC=0.2|
|AA=0.6,BB=0.2,CC=0.2|
|AA=0.1,BB=0.1,CC=0.8|
+--------------------+
This one below is a simple syntax to search for a string in a particular column uisng SQL Like functionality.
val dfx = df.filter($"name".like(s"%${productName}%"))
The questions is How do I grab each and every column NAME that contained the particular string in its VALUES and generate a new column with a list of those "column names" for every row.
So far this is the approach I took but stuck as I cant use spark-sql "Like" function inside a UDF.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.types._
import spark.implicits._
val df1 = Seq(
(0, "mango", "man", "dit"),
(1, "i-man", "man2", "mane"),
(2, "iman", "mango", "ho"),
(3, "dim", "kim", "sim")
).toDF("id", "col1", "col2", "col3")
val df2 = df1.columns.foldLeft(df1) {
(acc: DataFrame, colName: String) =>
acc.withColumn(colName, concat(lit(colName + "="), col(colName)))
}
val df3 = df2.withColumn("merged_cols", split(concat_ws("X", df2.columns.map(c=> col(c)):_*), "X"))
Here is a sample output. Note that here there are only 3 columns but in the real job I'll be reading multiple tables which can contain dynamic number of columns.
+--------------------------------------------+
|id | col1| col2| col3| merged_cols
+--------------------------------------------+
0 | mango| man | dit | col1, col2
1 | i-man| man2 | mane | col1, col2, col3
2 | iman | mango| ho | col1, col2
3 | dim | kim | sim|
+--------------------------------------------+
This can be done using a foldLeft over the columns together with when and otherwise:
val e = "%man%"
val df2 = df1.columns.foldLeft(df.withColumn("merged_cols", lit(""))){(df, c) =>
df.withColumn("merged_cols", when(col(c).like(e), concat($"merged_cols", lit(s"$c,"))).otherwise($"merged_cols"))}
.withColumn("merged_cols", expr("substring(merged_cols, 1, length(merged_cols)-1)"))
All columns that satisfies the condition e will be appended to the string in the merged_cols column. Note that the column must exist for the first append to work so it is added (containing an empty string) to the dataframe when sent into the foldLeft.
The last row in the code simply removes the extra , that is added in the end. If you want the result as an array instead, simply adding .withColumn("merged_cols", split($"merged_cols", ",")) would work.
An alternative appraoch is to instead use an UDF. This could be preferred when dealing with many columns since foldLeft will create multiple dataframe copies. Here regex is used (not the SQL like since that operates on whole columns).
val e = ".*man.*"
val concat_cols = udf((vals: Seq[String], names: Seq[String]) => {
vals.zip(names).filter{case (v, n) => v.matches(e)}.map(_._2)
})
val df2 = df.withColumn("merged_cols", concat_cols(array(df.columns.map(col(_)): _*), typedLit(df.columns.toSeq)))
Note: typedLit can be used in Spark versions 2.2+, when using older versions use array(df.columns.map(lit(_)): _*) instead.
I have a dataframe that looks like this
+--------------------
| unparsed_data|
+--------------------
|02020sometext5002...
|02020sometext6682...
I need to get it split it up into something like this
+--------------------
|fips | Name | Id ...
+--------------------
|02020 | sometext | 5002...
|02020 | sometext | 6682...
I have a list like this
val fields = List(
("fips", 5),
(“Name”, 8),
(“Id”, 27)
....more fields
)
I need the spit to take the first 5 characters in unparsed_data and map it to fips, take the next 8 characters in unparsed_data and map it to Name, then the next 27 characters and map them to Id and so on. I need the split to use/reference the filed lengths supplied in the list to do the splitting/slicing as there are allot of fields and the unparsed_data field is very long.
My scala is still pretty week and I assume the answer would look something like this
df.withColumn("temp_field", split("unparsed_data", //some regex created from the list values?)).map(i => //some mapping to the field names in the list)
any suggestions/ideas much appreciated
You can use foldLeft to traverse your fields list to iteratively create columns from the original DataFrame using
substring. It applies regardless of the size of the fields list:
import org.apache.spark.sql.functions._
val df = Seq(
("02020sometext5002"),
("03030othrtext6003"),
("04040moretext7004")
).toDF("unparsed_data")
val fields = List(
("fips", 5),
("name", 8),
("id", 4)
)
val resultDF = fields.foldLeft( (df, 1) ){ (acc, field) =>
val newDF = acc._1.withColumn(
field._1, substring($"unparsed_data", acc._2, field._2)
)
(newDF, acc._2 + field._2)
}._1.
drop("unparsed_data")
resultDF.show
// +-----+--------+----+
// | fips| name| id|
// +-----+--------+----+
// |02020|sometext|5002|
// |03030|othrtext|6003|
// |04040|moretext|7004|
// +-----+--------+----+
Note that a Tuple2[DataFrame, Int] is used as the accumulator for foldLeft to carry both the iteratively transformed DataFrame and next offset position for substring.
This can get you going. Depending on your needs it can get more and more complicated with variable lengths etc. which you do not state. But you can I think use column list.
import org.apache.spark.sql.functions._
val df = Seq(
("12334sometext999")
).toDF("X")
val df2 = df.selectExpr("substring(X, 0, 5)", "substring(X, 6,8)", "substring(X, 14,3)")
df2.show
Gives in this case (you can rename cols again):
+------------------+------------------+-------------------+
|substring(X, 0, 5)|substring(X, 6, 8)|substring(X, 14, 3)|
+------------------+------------------+-------------------+
| 12334| sometext| 999|
+------------------+------------------+-------------------+
I have a Spark DataFrame where I have a column with Vector values. The vector values are all n-dimensional, aka with the same length. I also have a list of column names Array("f1", "f2", "f3", ..., "fn"), each corresponds to one element in the vector.
some_columns... | Features
... | [0,1,0,..., 0]
to
some_columns... | f1 | f2 | f3 | ... | fn
... | 0 | 1 | 0 | ... | 0
What is the best way to achieve this? I thought of one way which is to create a new DataFrame with createDataFrame(Row(Features), featureNameList) and then join with the old one, but it requires spark context to use createDataFrame. I only want to transform the existing data frame. I also know .withColumn("fi", value) but what do I do if n is large?
I'm new to Scala and Spark and couldn't find any good examples for this. I think this can be a common task. My particular case is that I used the CountVectorizer and wanted to recover each column individually for better readability instead of only having the vector result.
One way could be to convert the vector column to an array<double> and then using getItem to extract individual elements.
import org.apache.spark.sql.functions._
import org.apache.spark.ml._
val df = Seq( (1 , linalg.Vectors.dense(1,0,1,1,0) ) ).toDF("id", "features")
//df: org.apache.spark.sql.DataFrame = [id: int, features: vector]
df.show
//+---+---------------------+
//|id |features |
//+---+---------------------+
//|1 |[1.0,0.0,1.0,1.0,0.0]|
//+---+---------------------+
// A UDF to convert VectorUDT to ArrayType
val vecToArray = udf( (xs: linalg.Vector) => xs.toArray )
// Add a ArrayType Column
val dfArr = df.withColumn("featuresArr" , vecToArray($"features") )
// Array of element names that need to be fetched
// ArrayIndexOutOfBounds is not checked.
// sizeof `elements` should be equal to the number of entries in column `features`
val elements = Array("f1", "f2", "f3", "f4", "f5")
// Create a SQL-like expression using the array
val sqlExpr = elements.zipWithIndex.map{ case (alias, idx) => col("featuresArr").getItem(idx).as(alias) }
// Extract Elements from dfArr
dfArr.select(sqlExpr : _*).show
//+---+---+---+---+---+
//| f1| f2| f3| f4| f5|
//+---+---+---+---+---+
//|1.0|0.0|1.0|1.0|0.0|
//+---+---+---+---+---+