I used rdd.collect() to create an Array and now I want to use this Array[Strings] to create a DataFrame. My test file is in the following format(separated by a pipe |).
TimeStamp
IdC
Name
FileName
Start-0f-fields
column01
column02
column03
column04
column05
column06
column07
column08
column010
column11
End-of-fields
Start-of-data
G0002B|0|13|IS|LS|Xys|Xyz|12|23|48|
G0002A|0|13|IS|LS|Xys|Xyz|12|23|45|
G0002x|0|13|IS|LS|Xys|Xyz|12|23|48|
G0002C|0|13|IS|LS|Xys|Xyz|12|23|48|
End-of-data
document
the column name are in between Start-of-field and End-of-Field.
I want to store "| " pipe separated in different columns of Dataframe.
like below example:
column01 column02 column03 column04 column05 column06 column07 column08 column010 column11
G0002C 0 13 IS LS Xys Xyz 12 23 48
G0002x 0 13 LS MS Xys Xyz 14 300 400
my code :
val rdd = sc.textFile("the above text file")
val columns = rdd.collect.slice(5,16).mkString(",") // it will hold columnnames
val data = rdd.collect.slice(5,16)
val rdd1 = sc.parallelize(rdd.collect())
val df = rdd1.toDf(columns)
but this is not giving me the above desired dataframe
Could you try this?
import spark.implicits._ // Add to use `toDS()` and `toDF()`
val rdd = sc.textFile("the above text file")
val columns = rdd.collect.slice(5,16) // `.mkString(",")` is not needed
val dataDS = rdd.collect.slice(5,16)
.map(_.trim()) // to remove whitespaces
.map(s => s.substring(0, s.length - 1)) // to remove last pipe '|'
.toSeq
.toDS
val df = spark.read
.option("header", false)
.option("delimiter", "|")
.csv(dataDS)
.toDF(columns: _*)
df.show(false)
+--------+--------+--------+--------+--------+--------+--------+--------+---------+--------+
|column01|column02|column03|column04|column05|column06|column07|column08|column010|column11|
+--------+--------+--------+--------+--------+--------+--------+--------+---------+--------+
|G0002B |0 |13 |IS |LS |Xys |Xyz |12 |23 |48 |
|G0002A |0 |13 |IS |LS |Xys |Xyz |12 |23 |45 |
|G0002x |0 |13 |IS |LS |Xys |Xyz |12 |23 |48 |
|G0002C |0 |13 |IS |LS |Xys |Xyz |12 |23 |48 |
+--------+--------+--------+--------+--------+--------+--------+--------+---------+--------+
Calling spark.read...csv() method without schema, can take a long time with huge data, because of schema inferences(e,g. Additional reading).
On that case, you can specify schema like below.
/*
column01 STRING,
column02 STRING,
column03 STRING,
...
*/
val schema = columns
.map(c => s"$c STRING")
.mkString(",\n")
val df = spark.read
.option("header", false)
.option("delimiter", "|")
.schema(schema) // schema inferences not occurred
.csv(dataDS)
// .toDF(columns: _*) => unnecessary when schema is specified
If the number of columns and the name of the column are fixed then you can do that as below :
val columns = rdd.collect.slice(5,15).mkString(",") // it will hold columnnames
val data = rdd.collect.slice(17,21)
val d = data.mkString("\n").split('\n').toSeq.toDF()
import org.apache.spark.sql.functions._
val dd = d.withColumn("columnX",split($"value","\\|")).withColumn("column1",$"columnx".getItem(0)).withColumn("column2",$"columnx".getItem(1)).withColumn("column3",$"columnx".getItem(2)).withColumn("column4",$"columnx".getItem(3)).withColumn("column5",$"columnx".getItem(4)).withColumn("column6",$"columnx".getItem(5)).withColumn("column8",$"columnx".getItem(7)).withColumn("column10",$"columnx".getItem(8)).withColumn("column11",$"columnx".getItem(9)).drop("columnX","value")
display(dd)
you can see the output as below:
Related
Assuming the following Dataframe df1 :
df1 :
+---------+--------+-------+
|A |B |C |
+---------+--------+-------+
|toto |tata |titi |
+---------+--------+-------+
I have the N = 3 integer which I want to use in order to create 3 duplicates in the df2 Dataframe using df1 :
df2 :
+---------+--------+-------+
|A |B |C |
+---------+--------+-------+
|toto |tata |titi |
|toto |tata |titi |
|toto |tata |titi |
+---------+--------+-------+
Any ideas ?
From Spark-2.4+ use arrays_zip + array_repeat + explode functions for this case.
val df=Seq(("toto","tata","titi")).toDF("A","B","C")
df.withColumn("arr",explode(array_repeat(arrays_zip(array("A"),array("B"),array("c")),3))).
drop("arr").
show(false)
//or dynamic way
val cols=df.columns.map(x => col(x))
df.withColumn("arr",explode(array_repeat(arrays_zip(array(cols:_*)),3))).
drop("arr").
show(false)
//+----+----+----+
//|A |B |C |
//+----+----+----+
//|toto|tata|titi|
//|toto|tata|titi|
//|toto|tata|titi|
//+----+----+----+
You can use foldLeft along with Dataframe's union
import org.apache.spark.sql.DataFrame
object JoinDataFrames {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
import spark.implicits._
val df = List(("toto","tata","titi")).toDF("A","B","C")
val N = 3;
val resultDf = (1 until N).foldLeft( df)((dfInner : DataFrame, count : Int) => {
df.union(dfInner)
})
resultDf.show()
}
}
I want to write a nested data structure consisting of a Map inside another Map using an array of a Scala case class.
The result should transform this dataframe:
|Value|Country| Timestamp| Sum|
+-----+-------+----------+----+
| 123| ITA|1475600500|18.0|
| 123| ITA|1475600516|19.0|
+-----+-------+----------+----+
into:
+--------------------------------------------------------------------+
|value |
+--------------------------------------------------------------------+
[{"value":123,"attributes":{"ITA":{"1475600500":18,"1475600516":19}}}]
+--------------------------------------------------------------------+
The actualResult dataset below gets me close but the structure isn't quite the same as my expected dataframe.
case class Record(value: Integer, attributes: Map[String, Map[String, BigDecimal]])
val actualResult = df
.map(r =>
Array(
Record(
r.getAs[Int]("Value"),
Map(
r.getAs[String]("Country") ->
Map(
r.getAs[String]("Timestamp") -> new BigDecimal(
r.getAs[Double]("Sum").toString
)
)
)
)
)
)
The Timestamp column in the actualResult dataset doesn't get combined together into the same Record row but rather creates two separate rows instead.
+----------------------------------------------------+
|value |
+----------------------------------------------------+
[{"value":123,"attributes":{"ITA":{"1475600516":19}}}]
[{"value":123,"attributes":{"ITA":{"1475600500":18}}}]
+----------------------------------------------------+
With the use of groupBy and collect_list by creatng combined column using struct I was able to get single row as below output.
val mycsv =
"""
|Value|Country|Timestamp|Sum
| 123|ITA|1475600500|18.0
| 123|ITA|1475600516|19.0
""".stripMargin('|').lines.toList.toDS()
val df: DataFrame = spark.read.option("header", true)
.option("sep", "|")
.option("inferSchema", true)
.csv(mycsv)
df.show
val df1 = df.
groupBy("Value","Country")
.agg( collect_list(struct(col("Country"), col("Timestamp"), col("Sum"))).alias("attributes")).drop("Country")
val json = df1.toJSON // you can save in to file
json.show(false)
Result combined 2 rows
+-----+-------+----------+----+
|Value|Country| Timestamp| Sum|
+-----+-------+----------+----+
|123.0|ITA |1475600500|18.0|
|123.0|ITA |1475600516|19.0|
+-----+-------+----------+----+
+----------------------------------------------------------------------------------------------------------------------------------------------+
|value |
+----------------------------------------------------------------------------------------------------------------------------------------------+
|{"Value":123.0,"attributes":[{"Country":"ITA","Timestamp":1475600500,"Sum":18.0},{"Country":"ITA","Timestamp":1475600516,"Sum":19.0}]}|
+----------------------------------------------------------------------------------------------------------------------------------------------+
convert myMap = Map([Col_1->1],[Col_2->2],[Col_3->3])
to Spark scala Data-frame key as column and value as column value,i am not
getting expected result, please check my code and provide solution.
var finalBufferList = new ListBuffer[String]()
var finalDfColumnList = new ListBuffer[String]()
var myMap:Map[String,String] = Map.empty[String,String]
for ((k,v) <- myMap){
println(k+"->"+v)
finalBufferList += v
//finalDfColumnList += "\""+k+"\""
finalDfColumnList += k
}
val dff = Seq(finalBufferList.toSeq).toDF(finalDfColumnList.toList.toString())
dff.show()
My result :
+------------------------+
|List(Test, Rest, Incedo)|
+------------------------+
| [4, 5, 3]|
+------------------------+
Expected result :
+------+-------+-------+
|Col_1 | Col_2 | Col_3 |
+------+-------+-------+
| 4 | 5 | 3 |
+------+-------+-------+
please give me suggestion .
if you have defined your Map as
val myMap = Map("Col_1"->"1", "Col_2"->"2", "Col_3"->"3")
then you should create RDD[Row] using the values as
import org.apache.spark.sql.Row
val rdd = sc.parallelize(Seq(Row.fromSeq(myMap.values.toSeq)))
then you create a schema using the keys as
import org.apache.spark.sql.types._
val schema = StructType(myMap.keys.toSeq.map(StructField(_, StringType)))
then finally use createDataFrame function to create the dataframe as
val df = sqlContext.createDataFrame(rdd, schema)
df.show(false)
finally you should have
+-----+-----+-----+
|Col_1|Col_2|Col_3|
+-----+-----+-----+
|1 |2 |3 |
+-----+-----+-----+
I hope the answer is helpful
But remember all this would be useless if you are working in small dataset.
I have a dataframe (DF1) with two columns
+-------+------+
|words |value |
+-------+------+
|ABC |1.0 |
|XYZ |2.0 |
|DEF |3.0 |
|GHI |4.0 |
+-------+------+
and another dataframe (DF2) like this
+-----------------------------+
|string |
+-----------------------------+
|ABC DEF GHI |
|XYZ ABC DEF |
+-----------------------------+
I have to replace the individual string values in DF2 with their corresponding values in DF1.. for eg, after the operation, I should get back this dataframe.
+-----------------------------+
|stringToDouble |
+-----------------------------+
|1.0 3.0 4.0 |
|2.0 1.0 3.0 |
+-----------------------------+
I have tried multiple ways but I cannot seem to figure out the solution.
def createCorpus(conversationCorpus: Dataset[Row], dataDictionary: Dataset[Row]): Unit = {
import spark.implicits._
def getIndex(word: String): Double = {
val idxRow = dataDictionary.selectExpr("index").where('words.like(word))
val idx = idxRow.toString
if (!idx.isEmpty) idx.trim.toDouble else 1.0
}
conversationCorpus.map { //eclipse doesnt like this map here.. throws an error..
r =>
def row = {
val arr = r.getString(0).toLowerCase.split(" ")
val arrList = ArrayBuffer[Double]()
arr.map {
str =>
val index = getIndex(str)
}
Row.fromSeq(arrList.toSeq)
}
row
}
}
Combining multiple dataframes to create new columns would require a join. And by looking at your two dataframes it seems we can join by words column of df1 and string column of df2 but string column needs an explode and combination later (which can be done by giving unique ids to each rows before explode). monotically_increasing_id gives unique ids to each rows in df2. split function turns string column to array for an explode. Then you can join them. and then rest of the steps is to combine back the exploded rows back to original by doing groupBy and aggregation.
Finally collected array column can be changed to desired string column by using a udf function
Long story short, following solution should work for you
import org.apache.spark.sql.functions._
def arrayToString = udf((array: Seq[Double])=> array.mkString(" "))
df2.withColumn("rowId", monotonically_increasing_id())
.withColumn("string", explode(split(col("string"), " ")))
.join(df1, col("string") === col("words"))
.groupBy("rowId")
.agg(collect_list("value").as("stringToDouble"))
.select(arrayToString(col("stringToDouble")).as("stringToDouble"))
which should give you
+--------------+
|stringToDouble|
+--------------+
|1.0 3.0 4.0 |
|2.0 1.0 3.0 |
+--------------+
I have got a requirement to do , but I am confused how to do it.
I have two dataframes. so first time i got the below data file1
file1
prodid, lastupdatedate, indicator
00001,,A
00002,01-25-1981,A
00003,01-26-1982,A
00004,12-20-1985,A
the output should be
0001,1900-01-01, 2400-01-01, A
0002,1981-01-25, 2400-01-01, A
0003,1982-01-26, 2400-01-01, A
0004,1985-12-20, 2400-10-01, A
Second time i got another one file2
prodid, lastupdatedate, indicator
00002,01-25-2018,U
00004,01-25-2018,U
00006,01-25-2018,A
00008,01-25-2018,A
I want the end result like
00001,1900-01-01,2400-01-01,A
00002,1981-01-25,2018-01-25,I
00002,2018-01-25,2400-01-01,A
00003,1982-01-26,2400-01-01,A
00004,1985-12-20,2018-01-25,I
00004,2018-01-25,2400-01-01,A
00006,2018-01-25,2400-01-01,A
00008,2018-01-25,2400-01-01,A
so whatever the updates are there in the second file that date should come in the second column and the default date (2400-01-01) should come in the third column and the relavant indicator. The default indicator is A
I have started like this
val spark=SparkSession.builder()
.master("local")
.appName("creating data frame for csv")
.getOrCreate()
import spark.implicits._
val df = spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv("d:/prod.txt")
val df1 = spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv("d:/prod1.txt")
val newdf = df.na.fill("01-01-1900",Seq("lastupdatedate"))
if((df1("indicator")=='U') && (df1("prodid")== newdf("prodid"))){
val df3 = df1.except(newdf)
}
You should join them with prodid and use some when function to manipulate the dataframes to the expected output. You should filter the updated dataframes for second rows and merge them back (I have included comments for explaining each part of the code)
import org.apache.spark.sql.functions._
//filling empty lastupdatedate and changing the date to the expected format
val newdf = df.na.fill("01-01-1900",Seq("lastupdatedate"))
.withColumn("lastupdatedate", date_format(unix_timestamp(trim(col("lastupdatedate")), "MM-dd-yyyy").cast("timestamp"), "yyyy-MM-dd"))
//changing the date to the expected format of the second dataframe
val newdf1 = df1.withColumn("lastupdatedate", date_format(unix_timestamp(trim(col("lastupdatedate")), "MM-dd-yyyy").cast("timestamp"), "yyyy-MM-dd"))
//joining both dataframes and updating columns according to your needs
val tempdf = newdf.as("table1").join(newdf1.as("table2"),Seq("prodid"), "outer")
.select(col("prodid"),
when(col("table1.lastupdatedate").isNotNull, col("table1.lastupdatedate")).otherwise(col("table2.lastupdatedate")).as("lastupdatedate"),
when(col("table1.indicator").isNotNull, when(col("table2.lastupdatedate").isNotNull, col("table2.lastupdatedate")).otherwise(lit("2400-01-01"))).otherwise(lit("2400-01-01")).as("defaultdate"),
when(col("table2.indicator").isNull, col("table1.indicator")).otherwise(when(col("table2.indicator") === "U", lit("I")).otherwise(col("table2.indicator"))).as("indicator"))
//filtering tempdf for duplication
val filtereddf = tempdf.filter(col("indicator") === "I")
.withColumn("lastupdatedate", col("defaultdate"))
.withColumn("defaultdate", lit("2400-01-01"))
.withColumn("indicator", lit("A"))
//finally merging both dataframes
tempdf.union(filtereddf).sort("prodid", "lastupdatedate").show(false)
which should give you
+------+--------------+-----------+---------+
|prodid|lastupdatedate|defaultdate|indicator|
+------+--------------+-----------+---------+
|1 |1900-01-01 |2400-01-01 |A |
|2 |1981-01-25 |2018-01-25 |I |
|2 |2018-01-25 |2400-01-01 |A |
|3 |1982-01-26 |2400-01-01 |A |
|4 |1985-12-20 |2018-01-25 |I |
|4 |2018-01-25 |2400-01-01 |A |
|6 |2018-01-25 |2400-01-01 |A |
|8 |2018-01-25 |2400-01-01 |A |
+------+--------------+-----------+---------+