Escape Comma inside a csv file using spark-shell - scala

I have a dataset containing below two rows
s.no,name,Country
101,xyz,India,IN
102,abc,UnitedStates,US
I am trying to escape the commas of each column but not for last column I want them the same and get the output using spark-shell. I tried using the below code but it has given me the different output.
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("delimiter", ",").option("escape", "\"").load("/user/username/data.csv").show()
The output it has given me is
+----+-----+------------+
|s.no| name| Country|
+----+-----+------------+
| 101| xyz| India|
| 102| abc|UnitedStates|
+----+-----+------------+
But I am expecting output to be like below What I am missing here can anyone help me?
s.no name Country
101 xyz India,IN
102 abc UnitedStates,US

I suggest to read the all the fields with providing schema and ignoring the header present in data as below
case class Data (sno: String, name: String, country: String, country1: String)
val schema = Encoders.product[Data].schema
import spark.implicits._
val df = spark.read
.option("header", true)
.schema(schema)
.csv("data.csv")
.withColumn("Country" , concat ($"country", lit(", "), $"country1"))
.drop("country1")
df.show(false)
Output:
+---+----+----------------+
|sno|name|Country |
+---+----+----------------+
|101|xyz |India, IN |
|102|abc |UnitedStates, US|
+---+----+----------------+
Hope this helps!

Related

Column Renaming in pyspark dataframe

I have column names with special characters. I renamed the column and trying to save and it gives the save failed saying the columns have special characters. I ran the print schema on the dataframe and i am seeing the column names with out any special characters. Here is the code i tried.
for c in df_source.columns:
df_source = df_source.withColumnRenamed(c, c.replace( "(" , ""))
df_source = df_source.withColumnRenamed(c, c.replace( ")" , ""))
df_source = df_source.withColumnRenamed(c, c.replace( "." , ""))
df_source.coalesce(1).write.format("parquet").mode("overwrite").option("header","true").save(stg_location)
and i get the following error
Caused by: org.apache.spark.sql.AnalysisException: Attribute name "Number_of_data_samples_(input)" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.
Also one more thing i noticed was when i do a df_source.show() or display(df_source), both shows the same error and printschema shows that there are not special characters.
Can someone help me in finding a solutions for this.
Try Using it as below -
Input_df
from pyspark.sql.types import *
from pyspark.sql.functions import *
data = [("xyz", 1)]
schema = StructType([StructField("Number_of_data_samples_(input)", StringType(), True), StructField("id", IntegerType())])
df = spark.createDataFrame(data=data, schema=schema)
df.show()
+------------------------------+---+
|Number_of_data_samples_(input)| id|
+------------------------------+---+
| xyz| 1|
+------------------------------+---+
Method 1
Using regular expressions to replace the special characters and then use toDF()
import re
cols=[re.sub("\.|\)|\(","",i) for i in df.columns]
df.toDF(*cols).show()
+----------------------------+---+
|Number_of_data_samples_input| id|
+----------------------------+---+
| xyz| 1|
+----------------------------+---+
Method 2
Using .withColumnRenamed()
for i,j in zip(df.columns,cols):
df=df.withColumnRenamed(i,j)
df.show()
+----------------------------+---+
|Number_of_data_samples_input| id|
+----------------------------+---+
| xyz| 1|
+----------------------------+---+
Method 3
Using .withColumn to create a new column and drop the existing column
df = df.withColumn("Number_of_data_samples_input", lit(col("Number_of_data_samples_(input)"))).drop(col("Number_of_data_samples_(input)"))
df.show()
+---+----------------------------+
| id|Number_of_data_samples_input|
+---+----------------------------+
| 1| xyz|
+---+----------------------------+

Combine all dataframe columns into one column as a JSON with JSON types preserved

val someDF = Seq(
(8, """{"details":{"decision":"ACCEPT","source":"Rules"}"""),
(64, """{"details":{"decision":"ACCEPT","source":"Rules"}""")
).toDF("number", "word")
someDF.show(false) :
+------+---------------------------------------------------------------+
|number|word |
+------+---------------------------------------------------------------+
|8 |{"details":{"decision":"ACCEPT","source":"Rules"} |
|64 |{"details":{"decision":"ACCEPT","source":"Rules"} |
+------+---------------------------------------------------------------+
Problem statement:
I want to combine all columns into 1 column with JSON types preserved inside the single output column. That is no escaping of quotes etc. like I got below.
What I tried:
someDF.toJSON.toDF.show(false)
// this escaped the quotes, which I don't want
+------------------------------------------------------------------------------------------------+
|value |
+------------------------------------------------------------------------------------------------+
|{"number":8,"word":"{\"details\":{\"decision\":\"ACCEPT\",\"source\":\"Rules\"}"} |
|{"number":64,"word":"{\"details\":{\"decision\":\"ACCEPT\",\"source\":\"Rules\"}"} |
+------------------------------------------------------------------------------------------------+
Same issue with someDF.select( to_json(struct(col("*"))).alias("value"))
What I want:
+------------------------------------------------------------------------------------------------+
|value |
+------------------------------------------------------------------------------------------------+
|{"number":8,"word":{"details":{"decision":"ACCEPT","source":"Rules"}}} |
|{"number":64,"word":{"details":{"decision":"ACCEPT","source":"Rules"}}} |
+------------------------------------------------------------------------------------------------+
Is there a way to do this?
Update:
Though I used a simple dataframe here, in reality I have hundreds of columns so manually defined schema doesn't work for me.
The "word" column in "someDF" is string type, so to_json treats it as a regular string. The key here is to convert the "word" column to a struct type before using to_json.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
val someDF = Seq(
(8, """{"details":{"decision":"ACCEPT","source":"Rules"}}"""),
(64, """{"details":{"decision":"ACCEPT","source":"Rules"}}""")
).toDF("number", "word")
val schema = StructType(Seq(StructField("details", StructType(Seq(StructField("decision", StringType), StructField("source", StringType))))))
someDF.select(to_json(struct($"number", from_json($"word", schema).alias("word"))).alias("value")).show(false)
Result:
+-----------------------------------------------------------------------+
|value |
+-----------------------------------------------------------------------+
|{"number":8,"word":{"details":{"decision":"ACCEPT","source":"Rules"}}} |
|{"number":64,"word":{"details":{"decision":"ACCEPT","source":"Rules"}}}|
+-----------------------------------------------------------------------+
You can retrieve the list of columns using columns method on your dataframe and then build manually your JSON string using combination of concat and concat_ws built-in functions:
import org.apache.spark.sql.functions.{col, concat, concat_ws, lit}
val result = someDF.select(
concat(
lit("{"),
concat_ws(
",",
someDF.columns.map(x => concat(lit("\""), lit(x), lit("\":"), col(x))): _*
),
lit("}")).as("value")
)

Split file with Space where column data is also having space

Hi I have data file which is having space as delimiter and also the data each column also contain spaces..How can i split it using spark program using scala:
Sample data Filed:
student.txt
3 columns:
Name
Address
Id
Name Address Id
Abhi Rishii Bangalore,Karnataka 1234
Rinki siyty Hydrabad,Andra 2345
Output Data frame should be:
|Name |City |State |Id--|
+-------------+------+-----------+-----+
| Abhi Rishii|Bangalore|Karnataka|1234|
| Rinki siyty|Hydrabad |Andra |2345|
+----+-----+-----------+---------+-----+
Your file is a tab delimited file.
You can use Spark's csv reader to read this file directly into a dataframe.
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().getOrCreate()
var studentDf = spark.read.format("csv") // Use "csv" for both TSV and CSV
.option("header", "true")
.option("delimiter", "\t") // Set delimiter to tab .
.load("student.txt")
.withColumn("_tmp", split($"Address", "\\,"))
.withColumn($"_tmp".getItem(0).as("City"))
.withColumn($"_tmp".getItem(1).as("State"))
.drop("_tmp")
.drop("Address")
studentDf .show()
|Name |City |State |Id--|
+-------------+------+-----------+-----+
| Abhi Rishii|Bangalore|Karnataka|1234|
| Rinki siyty|Hydrabad |Andra |2345|
+----+-----+-----------+---------+-----+

Spark How can I filter out rows that contain char sequences from another dataframe?

So, I am trying to remove rows from df2 if the Value in df2 is "like" a key from df1. I'm not sure if this is possible, or if I might need to change df1 into a list first? It's a fairly small dataframe, but as you can see, we want to remove the 2nd and 3rd rows from df2 and just return back df2 without them.
df1
+--------------------+
| key|
+--------------------+
| Monthly Beginning|
| Annual Percentage|
+--------------------+
df2
+--------------------+--------------------------------+
| key| Value|
+--------------------+--------------------------------+
| Date| 1/1/2018|
| Date| Monthly Beginning on Tuesday|
| Number| Annual Percentage Rate for...|
| Number| 17.5|
+--------------------+--------------------------------+
I thought it would be something like this?
df.filter(($"Value" isin (keyDf.select("key") + "%"))).show(false)
But that doesn't work and I'm not surprised, but I think it helps show what I am trying to do if my previous explanation was not sufficient enough. Thank you for your help ahead of time.
Convert the first dataframe df1 to List[String] and then create one udf and apply filter condition
Spark-shell-
import org.apache.spark.sql.functions._
//Converting df1 to list
val df1List=df1.select("key").map(row=>row.getString(0).toLowerCase).collect.toList
//Creating udf , spark stands for spark session
spark.udf.register("filterUDF", (str: String) => df1List.filter(str.toLowerCase.contains(_)).length)
//Applying filter
df2.filter("filterUDF(Value)=0").show
//output
+------+--------+
| key| Value|
+------+--------+
| Date|1/1/2018|
|Number| 17.5|
+------+--------+
Scala-IDE -
val sparkSession=SparkSession.builder().master("local").appName("temp").getOrCreate()
val df1=sparkSession.read.format("csv").option("header","true").load("C:\\spark\\programs\\df1.csv")
val df2=sparkSession.read.format("csv").option("header","true").load("C:\\spark\\programs\\df2.csv")
import sparkSession.implicits._
val df1List=df1.select("key").map(row=>row.getString(0).toLowerCase).collect.toList
sparkSession.udf.register("filterUDF", (str: String) => df1List.filter(str.toLowerCase.contains(_)).length)
df2.filter("filterUDF(Value)=0").show
Convert df1 to List. Convert df2 to Dataset.
case class s(key:String,Value:String)
df2Ds = df2.as[s]
Then we can use the filter method to filter out the records.
Somewhat like this.
def check(str:String):Boolean = {
var i = ""
for(i<-df1List)
{
if(str.contains(i))
return false
}
return true
}
df2Ds.filter(s=>check(s.Value)).collect

How to split column into multiple columns in Spark 2?

I am reading the data from HDFS into DataFrame using Spark 2.2.0 and Scala 2.11.8:
val df = spark.read.text(outputdir)
df.show()
I see this result:
+--------------------+
| value|
+--------------------+
|(4056,{community:...|
|(56,{community:56...|
|(2056,{community:...|
+--------------------+
If I run df.head(), I see more details about the structure of each row:
[(4056,{community:1,communitySigmaTot:1020457,internalWeight:0,nodeWeight:1020457})]
I want to get the following output:
+---------+----------+
| id | value|
+---------+----------+
|4056 |1 |
|56 |56 |
|2056 |20 |
+---------+----------+
How can I do it? I tried using .map(row => row.mkString(",")),
but I don't know how to extract the data as I showed.
The problem is that you are getting the data as a single column of strings. The data format is not really specified in the question (ideally it would be something like JSON), but given what we know, we can use a regular expression to extract the number on the left (id) and the community field:
val r = """\((\d+),\{.*community:(\d+).*\}\)"""
df.select(
F.regexp_extract($"value", r, 1).as("id"),
F.regexp_extract($"value", r, 2).as("community")
).show()
A bunch of regular expressions should give you required result.
df.select(
regexp_extract($"value", "^\\(([0-9]+),.*$", 1) as "id",
explode(split(regexp_extract($"value", "^\\(([0-9]+),\\{(.*)\\}\\)$", 2), ",")) as "value"
).withColumn("value", split($"value", ":")(1))
If your data is always of the following format
(4056,{community:1,communitySigmaTot:1020457,internalWeight:0,nodeWeight:1020457})
Then you can simply use split and regex_replace inbuilt functions to get your desired output dataframe as
import org.apache.spark.sql.functions._
df.select(regexp_replace((split(col("value"), ",")(0)), "\\(", "").as("id"), regexp_replace((split(col("value"), ",")(1)), "\\{community:", "").as("value") ).show()
I hope the answer is helpful