I am using spark streaming and am creating this dataframe from the kafka message:
|customer|initialLoadComplete|initialLoadRunning| messageContent| tableName|
+--------+-------------------+------------------+--------------------+-----------------+
| A| false| true|TEFault_IdReason...|Timed_Event_Fault|
| A| false| true|TEFault_IdReason...|Timed_Event_Fault|
+--------+-------------------+------------------+--------------------+-----------------+
Now I want to extract out messageContent, messageContent is basically like a CSV that includes the raw data and the first line is the columns.
I can extract out the headers in the following way from the messageContent field.
val Array1 = ssc.sparkContext.parallelize(rowD.getString(2).split("\u0002")(0))
So Array1 looks like this:
Array1: col1^Acol2^Acol3
Array2 is basically the raw data, each column value seperated by ^A and record seperated by ^B.
^A is a column seperator. ^B is record seperator
So this is what array2 could look like:
Array2 = value1^Avalue2^Avalue3^Bvalue4^Avalue5^Avalue6^Bvalue7^Avalue8^Avalue9
Basically I want to create a dataframe out of this so it looks like this:
col1 | col2 | col3
-------------------------
value1 | value2 | value3
value4 | value5 | value6
value7 | value8 | value9
^B is the record delimiter.
When we were reading from a hdfs file, we created a dataframe via this command:
val df = csc.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").option("delimiter", "\u0001").load(hdfsFile)
But this time I am creating a dataframe from two arrays from memory. Array1 is the headers for the values in array2 and array2 is record seperated by ^B.
What could be the equivalent of creating a dataframe in this approach as I did for creating a dataframe from a file.
I am inferring the following from your question.
Array1 is a rdd of only one entry col1^Acol2^Acol3
Array2 is a rdd with each entry looking something like this. value1^Avalue2^Avalue3^Bvalue4^Avalue5^Avalue6^Bvalue7^Avalue8^Avalue9
with these assumptions in place the following should work.
val array1 = sc.parallelize(Seq("col1\u0002col2\u0002col3"))
val array2 = sc.parallelize(Seq("value1\u0001value2\u0001value3\u0002value4\u0001value5\u0001value6\u0002value7\u0001value8\u0001value9"))
val data = array2.flatMap(x => x.split("\u0002")).map(x => x.split('\u0001')).collect()
val result = array2
.flatMap(x => x.split("\u0002"))
.map(x => x.split('\u0001'))
.map({ case Array(x,y,z) => (x,y,z)})
.toDF(array1.flatMap(x => x.split('\u0002')).collect(): _*)
result.show()
+------+------+------+
| col1| col2| col3|
+------+------+------+
|value1|value2|value3|
|value4|value5|value6|
|value7|value8|value9|
+------+------+------+
Related
I want to write a nested data structure consisting of a Map inside another Map using an array of a Scala case class.
The result should transform this dataframe:
|Value|Country| Timestamp| Sum|
+-----+-------+----------+----+
| 123| ITA|1475600500|18.0|
| 123| ITA|1475600516|19.0|
+-----+-------+----------+----+
into:
+--------------------------------------------------------------------+
|value |
+--------------------------------------------------------------------+
[{"value":123,"attributes":{"ITA":{"1475600500":18,"1475600516":19}}}]
+--------------------------------------------------------------------+
The actualResult dataset below gets me close but the structure isn't quite the same as my expected dataframe.
case class Record(value: Integer, attributes: Map[String, Map[String, BigDecimal]])
val actualResult = df
.map(r =>
Array(
Record(
r.getAs[Int]("Value"),
Map(
r.getAs[String]("Country") ->
Map(
r.getAs[String]("Timestamp") -> new BigDecimal(
r.getAs[Double]("Sum").toString
)
)
)
)
)
)
The Timestamp column in the actualResult dataset doesn't get combined together into the same Record row but rather creates two separate rows instead.
+----------------------------------------------------+
|value |
+----------------------------------------------------+
[{"value":123,"attributes":{"ITA":{"1475600516":19}}}]
[{"value":123,"attributes":{"ITA":{"1475600500":18}}}]
+----------------------------------------------------+
With the use of groupBy and collect_list by creatng combined column using struct I was able to get single row as below output.
val mycsv =
"""
|Value|Country|Timestamp|Sum
| 123|ITA|1475600500|18.0
| 123|ITA|1475600516|19.0
""".stripMargin('|').lines.toList.toDS()
val df: DataFrame = spark.read.option("header", true)
.option("sep", "|")
.option("inferSchema", true)
.csv(mycsv)
df.show
val df1 = df.
groupBy("Value","Country")
.agg( collect_list(struct(col("Country"), col("Timestamp"), col("Sum"))).alias("attributes")).drop("Country")
val json = df1.toJSON // you can save in to file
json.show(false)
Result combined 2 rows
+-----+-------+----------+----+
|Value|Country| Timestamp| Sum|
+-----+-------+----------+----+
|123.0|ITA |1475600500|18.0|
|123.0|ITA |1475600516|19.0|
+-----+-------+----------+----+
+----------------------------------------------------------------------------------------------------------------------------------------------+
|value |
+----------------------------------------------------------------------------------------------------------------------------------------------+
|{"Value":123.0,"attributes":[{"Country":"ITA","Timestamp":1475600500,"Sum":18.0},{"Country":"ITA","Timestamp":1475600516,"Sum":19.0}]}|
+----------------------------------------------------------------------------------------------------------------------------------------------+
i have record as string with 1000 fields with delimiter as comma in dataframe like
"a,b,c,d,e.......upto 1000" -1st record
"p,q,r,s,t ......upto 1000" - 2nd record
I am using below suggested solution from stackoverflow
Split 1 column into 3 columns in spark scala
df.withColumn("_tmp", split($"columnToSplit", "\\.")).select($"_tmp".getItem(0).as("col1"),$"_tmp".getItem(1).as("col2"),$"_tmp".getItem(2).as("col3")).drop("_tmp")
however in my case i am having 1000 columns which i have in JSON schema which i can retrive like
column_seq:Seq[Array]=Schema_func.map(_.name)
for(i <-o to column_seq.length-1){println(i+" " + column_seq(i))}
which returns like
0 col1
1 col2
2 col3
3 col4
Now I need to pass all this indexes and column names to below function of DataFrame
df.withColumn("_tmp", split($"columnToSplit", "\\.")).select($"_tmp".getItem(0).as("col1"),$"_tmp".getItem(1).as("col2"),$"_tmp".getItem(2).as("col3")).drop("_tmp")
in
$"_tmp".getItem(0).as("col1"),$"_tmp".getItem(1).as("col2"),
as i cant create the long statement with all 1000 columns , is there any effective way to pass all this arguments from above mentioned json schema to select function , so that i can split the columns , add the header and then covert the DF to parquet.
You can build a series of org.apache.spark.sql.Column, where each one is the result of selecting the right item and has the right name, and then select these columns:
val columns: Seq[Column] = Schema_func.map(_.name)
.zipWithIndex // attach index to names
.map { case (name, index) => $"_tmp".getItem(index) as name }
val result = df
.withColumn("_tmp", split($"columnToSplit", "\\."))
.select(columns: _*)
For example, for this input:
case class A(name: String)
val Schema_func = Seq(A("c1"), A("c2"), A("c3"), A("c4"), A("c5"))
val df = Seq("a.b.c.d.e").toDF("columnToSplit")
The result would be:
// +---+---+---+---+---+
// | c1| c2| c3| c4| c5|
// +---+---+---+---+---+
// | a| b| c| d| e|
// +---+---+---+---+---+
I want to create DataFrame df that should look as simple as this:
+----------+----------+
| timestamp| col2|
+----------+----------+
|2018-01-11| 123|
+----------+----------+
This is what I do:
val values = List(List("timestamp", "2018-01-11"),List("col2","123")).map(x =>(x(0), x(1)))
val df = values.toDF
df.show()
And this is what I get:
+---------+----------+
| _1| _2|
+---------+----------+
|timestamp|2018-01-11|
| col2| 123|
+---------+----------+
What's wrong here?
Use
val df = List(("2018-01-11", "123")).toDF("timestamp", "col2")
toDF expects the input list to contain one entry per resulting Row
Each such entry should be a case class or a tuple
It does not expect column "headers" in the data itself (to name columns - pass names as arguments of toDF)
If you don't know the names of the columns statically you can use following syntax sugar
.toDF( columnNames: _*)
Where columnNames is the List with the names.
EDIT (sorry, I missed that you had the headers glued to each column).
Maybe something like this could work:
val values = List(
List("timestamp", "2018-01-11"),
List("col2","123")
)
val heads = values.map(_.head) // extracts headers of columns
val cols = values.map(_.tail) // extracts columns without headers
val rows = cols(0).zip(cols(1)) // zips two columns into list of rows
rows.toDF(heads: _*)
This would work if the "values" contained two longer lists, but it does not generalize to more lists.
I have a DataFrame that contains (among other things) the names of CSV files to process.
The DataFrame has the file names in first column FileName, FilterData as second column and some extra columns (treat every column as String) as follows:
FileName FilterData col3 col4
testFile.txt XF value1 value2
testFile1.txt XM value3 value4
The CSV files (under FileName column) are available on Azure Data Lake with total size of 5 TB.
I'd like to read the first column (which is a filename), then I open/read the file, and get the records that match the pattern in FilterData.
testFile.txt and testFile1.txt are as follows:
testFile.txt
1,XF,data1
2,XM,data2
testFile.txt
1,XF,data3
2,XM,data4
I want to get the data of the file(reading first column from dataframe), then filter the records based on FilterData column i.e. if the record contains FilterData string, select this record(it will always be only 1 record), and then join back this data with col3 and col4 of Dataframe. Below is my expected output:
1 XF data1 value1 value2
2 XM data3 value3 value4
Are you sure you're using Spark for large massive dataset processing (not as a replacement for reading small configuration files)?
What I'd do is to yourFileNameDataset.collect (see Dataset API) and having the rows locally do pattern matching on the rows to access the filenames and the values to filter on. That's pretty much Scala (not much Spark really).
I'd then spark.read.csv (see SparkSession and DataFrameReader APIs) and filter by the fields and values given in the source DataFrame. That's a mixture of Scala with some Spark development.
Sample Code
(not very effective and certainly not exploiting Spark features like parallel processing, but given the criteria it does work)
scala> val filenames = spark.read.option("header", true).csv("filenames.csv")
filenames: org.apache.spark.sql.DataFrame = [FileName: string, FilterData: string ... 2 more fields]
scala> filenames.show
+-------------+----------+------+------+
| FileName|FilterData| col3| col4|
+-------------+----------+------+------+
| testFile.txt| XF|value1|value2|
|testFile1.txt| XM|value3|value4|
+-------------+----------+------+------+
scala> filenames.
as[(String, String, String, String)].
collect.
foreach { case (file, value, _, _) =>
spark.
read.
option("inferSchema", true).
option("header", true).
csv(file).
as[(Int, String, String)].
filter(_._2 == value).
show
}
+---+---+-----+
| 1| XF|data1|
+---+---+-----+
+---+---+-----+
+---+---+-----+
| 1| XF|data3|
+---+---+-----+
| 2| XM|data4|
+---+---+-----+
How can I convert a dataframe to a tuple that includes the datatype for each column?
I have a number of dataframes with varying sizes and types. I need to be able to determine the type and value of each column and row of a given dataframe so I can perform some actions that are type-dependent.
So for example say I have a dataframe that looks like:
+-------+-------+
| foo | bar |
+-------+-------+
| 12345 | fnord |
| 42 | baz |
+-------+-------+
I need to get
Seq(
(("12345", "Integer"), ("fnord", "String")),
(("42", "Integer"), ("baz", "String"))
)
or something similarly simple to iterate over and work with programmatically.
Thanks in advance and sorry for what is, I'm sure, a very noobish question.
If I understand your question correct, then following shall be your solution.
val df = Seq(
(12345, "fnord"),
(42, "baz"))
.toDF("foo", "bar")
This creates dataframe which you already have.
+-----+-----+
| foo| bar|
+-----+-----+
|12345|fnord|
| 42| baz|
+-----+-----+
Next step is to extract dataType from the schema of the dataFrame and create a iterator.
val fieldTypesList = df.schema.map(struct => struct.dataType)
Next step is to convert the dataframe rows into rdd list and map each value to dataType from the list created above
val dfList = df.rdd.map(row => row.toString().replace("[","").replace("]","").split(",").toList)
val tuples = dfList.map(list => list.map(value => (value, fieldTypesList(list.indexOf(value)))))
Now if we print it
tuples.foreach(println)
It would give
List((12345,IntegerType), (fnord,StringType))
List((42,IntegerType), (baz,StringType))
Which you can iterate over and work with programmatically