I have below like data in dataframe. Note that - Contents is the only one column and this dataframe has only one record which has the data. In data, first row is header, lines are separated by LF.
How can I generate a new dataframe which will have 3 columns and corresponding data.
display(df)
Contents
============================
"DateNum","MonthNum","DayName"
"19910101","1","Tue"
"19910102","1","Wed"
"19910103","1","Thu"
Just for info, below is how the data looks
You can split by new line to get an RDD[String], which can then be converted to a dataframe:
val df2 = spark.read.option("header",true).csv(df.rdd.flatMap(_.getString(0).split("\n")).toDS)
df2.show
+--------+--------+-------+
| DateNum|MonthNum|DayName|
+--------+--------+-------+
|19910101| 1| Tue|
|19910102| 1| Wed|
|19910103| 1| Thu|
+--------+--------+-------+
Related
I'm attempting to do a select on a dataframe but I'm having a little bit of trouble.
I have this initial dataframe
+----------+-------+-------+-------+
|id|value_a|value_b|value_c|value_d|
+----------+-------+-------+-------+
And what I have to do is sum value_a with value_b and keep the others the same. So I have this list
val select_list = List(id, value_c, value_d)
and after this I do the select
df.select(select_list.map(col):_*, (col(value_a) + col(value_b)).as("value_b"))
And I'm expecting to get this:
+----------+-------+-------+
|id|value_c|value_d|value_b| --- that value_b is the sum of value_a and value_b (original)
+----------+-------+-------+
But i'm getting "a no _* annotation allowed here". Keep in mind that in reality I have a lot of columns so I need to use a list, I can't simply select each column. I'm running into this trouble because the new column that is the result of the sum has the same name of an existing column, so I can't just select(column("*"), sum....).drop(value_b) or I'd be dropping the old column and the new one with the sum.
What is the correct syntax to add multiple and single columns in a single select, or how else can I solve this?
for now I decided to do this:
df.select(col("*"), (col(value_a) + col(value_b)).as("value_b_tmp")).
drop("value_a", "value_b").withColumnRenamed("value_b_tmp", "value_b")
Which works fine but I understand the withColumn and withColumnRenamed is expensive because I'm creating pretty much a new dataframe with a new or renamed column and I'm looking for the less expensive operation possible.
Thanks in advance!
Simply use .withColumn function, it will replace the column if it exists:
df
.withColumn("value_b", col("value_a") + col("value_b"))
.select(select_list.map(col):_*)
You can create a new sum field and collect the result of the operation for the sum of the n columns as:
val df: DataFrame =
spark.createDataFrame(
spark.sparkContext.parallelize(Seq(Row(1,2,3),Row(1,2,3))),
StructType(List(
StructField("field1", IntegerType),
StructField("field2", IntegerType),
StructField("field3", IntegerType))))
val columnsToSum = df.schema.fieldNames
columnsToSum.filter(name => name != "field1")
.foldLeft(df.withColumn("sum", lit(0)))((df, column) =>
df.withColumn("sum", col("sum") + col(column)))
Gives:
+------+------+------+---+
|field1|field2|field3|sum|
+------+------+------+---+
| 1| 2| 3| 5|
| 1| 2| 3| 5|
+------+------+------+---+
I have .log file in ADLS which contain multiple nested Json objects as follows
{"EventType":3735091736,"Timestamp":"2019-03-19","Data":{"Id":"event-c2","Level":2,"MessageTemplate":"Test1","Properties":{"CorrId":"d69b7489","ActionId":"d0e2c3fd"}},"Id":"event-c20b9c7eac0808d6321106d901000000"}
{"EventType":3735091737,"Timestamp":"2019-03-18","Data":{"Id":"event-d2","Level":2,"MessageTemplate":"Test1","Properties":{"CorrId":"f69b7489","ActionId":"d0f2c3fd"}},"Id":"event-d20b9c7eac0808d6321106d901000000"}
{"EventType":3735091738,"Timestamp":"2019-03-17","Data":{"Id":"event-e2","Level":1,"MessageTemplate":"Test1","Properties":{"CorrId":"g69b7489","ActionId":"d0d2c3fd"}},"Id":"event-e20b9c7eac0808d6321106d901000000"}
Need to read the above multiple nested Json objects in pyspark and convert to dataframe as follows
EventType Timestamp Data.[Id] ..... [Data.Properties.CorrId] [Data.Properties. ActionId]
3735091736 2019-03-19 event-c2 ..... d69b7489 d0e2c3fd
3735091737 2019-03-18 event-d2 ..... f69b7489 d0f2c3fd
3735091738 2019-03-17 event-e2 ..... f69b7489 d0d2c3fd
For above I am using ADLS,Pyspark in Azure DataBricks.
Does anyone know a general way to deal with above problem? Thanks!
You can read it into an RDD first. It will be read as a list of strings
You need to convert the json string into a native python datatype using
json.loads()
Then you can convert the RDD into a dataframe, and it can infer the schema directly using toDF()
Using the answer from Flatten Spark Dataframe column of map/dictionary into multiple columns, you can explode the Data column into multiple columns. Given your Id column is going to be unique. Note that, explode would return key, value columns for each entry in the map type.
You can repeat the 4th point to explode the properties column.
Solution:
import json
rdd = sc.textFile("demo_files/Test20191023.log")
df = rdd.map(lambda x: json.loads(x)).toDF()
df.show()
# +--------------------+----------+--------------------+----------+
# | Data| EventType| Id| Timestamp|
# +--------------------+----------+--------------------+----------+
# |[MessageTemplate ...|3735091736|event-c20b9c7eac0...|2019-03-19|
# |[MessageTemplate ...|3735091737|event-d20b9c7eac0...|2019-03-18|
# |[MessageTemplate ...|3735091738|event-e20b9c7eac0...|2019-03-17|
# +--------------------+----------+--------------------+----------+
data_exploded = df.select('Id', 'EventType', "Timestamp", F.explode('Data'))\
.groupBy('Id', 'EventType', "Timestamp").pivot('key').agg(F.first('value'))
# There is a duplicate Id column and might cause ambiguity problems
data_exploded.show()
# +--------------------+----------+----------+--------+-----+---------------+--------------------+
# | Id| EventType| Timestamp| Id|Level|MessageTemplate| Properties|
# +--------------------+----------+----------+--------+-----+---------------+--------------------+
# |event-c20b9c7eac0...|3735091736|2019-03-19|event-c2| 2| Test1|{CorrId=d69b7489,...|
# |event-d20b9c7eac0...|3735091737|2019-03-18|event-d2| 2| Test1|{CorrId=f69b7489,...|
# |event-e20b9c7eac0...|3735091738|2019-03-17|event-e2| 1| Test1|{CorrId=g69b7489,...|
# +--------------------+----------+----------+--------+-----+---------------+--------------------+
I was able to read the data by following code.
from pyspark.sql.functions import *
DF = spark.read.json("demo_files/Test20191023.log")
DF.select(col('Id'),col('EventType'),col('Timestamp'),col('Data.Id'),col('Data.Level'),col('Data.MessageTemplate'),
col('Data.Properties.CorrId'),col('Data.Properties.ActionId'))\
.show()```
***Result***
+--------------------+----------+----------+--------+-----+---------------+--------+--------+
| Id| EventType| Timestamp| Id|Level|MessageTemplate| CorrId|ActionId|
+--------------------+----------+----------+--------+-----+---------------+--------+--------+
|event-c20b9c7eac0...|3735091736|2019-03-19|event-c2| 2| Test1|d69b7489|d0e2c3fd|
|event-d20b9c7eac0...|3735091737|2019-03-18|event-d2| 2| Test1|f69b7489|d0f2c3fd|
|event-e20b9c7eac0...|3735091738|2019-03-17|event-e2| 1| Test1|g69b7489|d0d2c3fd|
+--------------------+----------+----------+--------+-----+---------------+--------+--------+
I have a fixed width text file(sample) with data
2107abc2018abn2019gfh
where all the rows data are combined as single row
I need to read the textfile and split data according fixed row length=7
and generate multiple rows and store it in RDD.
2107abc
2018abn
2019gfh
where 2107 is one column and abc is one more column
will the logic will be applicable for huge data file like 1 GB or more?
I'm amusing that you have RDD[String] and you want to extract both columns from your data. First you can split the line at length 7 and then again at 4. You will get your columns separated. Below is the code for same.
//creating a sample RDD from the given string
val rdd = sc.parallelize(Seq("""2107abc2018abn2019gfh"""))
//Now first split at length 7 then again split at length 4 and create dataframe
val res = rdd.flatMap(_.grouped(7).map(x=>x.grouped(4).toSeq)).map(x=> (x(0),x(1)))
//print the rdd
res.foreach(println)
//output
//(2107,abc)
//(2018,abn)
//(2019,gfh)
If you want you can also convert your RDD to dataframe for further processing.
//convert to DF
val df = res.toDF("col1","col2")
//print the dataframe
df.show
//+----+----+
//|col1|col2|
//+----+----+
//|2107| abc|
//|2018| abn|
//|2019| gfh|
//+----+----+
I am reading a text (not CSV) file that has header, content and footer using
spark.read.format("text").option("delimiter","|")...load(file)
I can access the header with df.first(). Is there something close to df.last() or df.reverse().first()?
Sample data:
col1|col2|col3
100|hello|asdf
300|hi|abc
200|bye|xyz
800|ciao|qwerty
This is the footer line
Processing logic:
#load text file
txt = sc.textFile("path_to_above_sample_data_text_file.txt")
#remove header
header = txt.first()
txt = txt.filter(lambda line: line != header)
#remove footer
txt = txt.map(lambda line: line.split("|"))\
.filter(lambda line: len(line)>1)
#convert to dataframe
df=txt.toDF(header.split("|"))
df.show()
Output is:
+----+-----+------+
|col1| col2| col3|
+----+-----+------+
| 100|hello| asdf|
| 300| hi| abc|
| 200| bye| xyz|
| 800| ciao|qwerty|
+----+-----+------+
Assuming the file is not so large we can use collect to get the dataframe as iterator and the access the last element as follows:
df = df.collect()[data.count()-1]
avoid using collect on large datasets.
or
we can use take to cut off the last row.
df = df.take(data.count()-1)
Assuming your text file has JSON header and Footer,
Spark SQL way,
Sample Data
{"":[{<field_name>:<field_value1>},{<field_name>:<field_value2>}]}
Here the header can be avoided by following 3 lines (Assumption No Tilda in data),
jsonToCsvDF=spark.read.format("com.databricks.spark.csv").option("delimiter", "~").load(<Blob Path1/ ADLS Path1>)
jsonToCsvDF.createOrReplaceTempView("json_to_csv")
spark.sql("SELECT SUBSTR(`_c0`,5,length(`_c0`)-5) FROM json_to_csv").coalesce(1).write.option("header",false).mode("overwrite").text(<Blob Path2/ ADLS Path2>)
Now the output will look like,
[{<field_name>:<field_value1>},{<field_name>:<field_value2>}]
In addition to above answer, below solution fits good for files with multiple header and footer lines :-
val data_delimiter = "|"
val skipHeaderLines = 5
val skipHeaderLines = 3
//-- Read file into Dataframe and convert to RDD
val dataframe = spark.read.option("wholeFile", true).option("delimiter",data_delimiter).csv(s"hdfs://$in_data_file")
val rdd = dataframe.rdd
//-- RDD without header and footer
val dfRdd = rdd.zipWithIndex().filter({case (line, index) => index != (cnt - skipFooterLines) && index > (skipHeaderLines - 1)}).map({case (line, index) => line})
//-- Dataframe without header and footer
val df = spark.createDataFrame(dfRdd, dataframe.schema)
This question already has answers here:
How do I convert csv file to rdd
(12 answers)
Closed 5 years ago.
I have a csv file with one of the columns containing value enclosed in double quotes. This column also has commas in it. How do I read this type of columns in CSV in Spark using Scala into an RDD. Column values enclosed in double quotes should be read as Integer type as they are values like Total assets, Total Debts.
example records from csv is
Jennifer,7/1/2000,0,,0,,151,11,8,"25,950,816","5,527,524",51,45,45,45,48,50,2,,
John,7/1/2003,0,,"200,000",0,151,25,8,"28,255,719","6,289,723",48,46,46,46,48,50,2,"4,766,127,272",169
I would suggest you to read with SQLContext as a csv file as it has well tested mechanisms and flexible apis to satisfy your needs
You can do
val dataframe =sqlContext.read.csv("path to your csv file")
Output would be
+-----------+--------+---+----+-------+----+---+---+---+----------+---------+----+----+----+----+----+----+----+-------------+----+
| _c0| _c1|_c2| _c3| _c4| _c5|_c6|_c7|_c8| _c9| _c10|_c11|_c12|_c13|_c14|_c15|_c16|_c17| _c18|_c19|
+-----------+--------+---+----+-------+----+---+---+---+----------+---------+----+----+----+----+----+----+----+-------------+----+
| Jennifer|7/1/2000| 0|null| 0|null|151| 11| 8|25,950,816|5,527,524| 51| 45| 45| 45| 48| 50| 2| null|null|
|Afghanistan|7/1/2003| 0|null|200,000| 0|151| 25| 8|28,255,719|6,289,723| 48| 46| 46| 46| 48| 50| 2|4,766,127,272| 169|
+-----------+--------+---+----+-------+----+---+---+---+----------+---------+----+----+----+----+----+----+----+-------------+----+
Now you can change the header names, change the required columns to integers and do a lot of things
You can even change it to rdd
Edited
If you prefer to read in RDD and stay in RDD, then
Read the file with sparkContext as a textFile
val rdd = sparkContext.textFile("/home/anahcolus/IdeaProjects/scalaTest/src/test/resources/test.csv")
Then split the lines with , by ignoring , in between "
rdd.map(line => line.split(",(?=([^\\\"]*\\\"[^\\\"]*\\\")*[^\\\"]*$)", -1))
#ibh this is not Spark or Scala specific stuff. In Spark you will read file the usual way
val conf = new SparkConf().setAppName("app_name").setMaster("local")
val ctx = new SparkContext(conf)
val file = ctx.textFile("<your file>.csv")
rdd.foreach{line =>
// cleanup code as per regex below
val tokens = line.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", -1)
// side effect
val myObject = new MyObject(tokens)
mylist.add(myObject)
}
See this regex also.