Reading Encoded value in spark 1.6 throwing Error - scala

I am receiving file from API which have a encoded(non-ascii) character value in 3 columns.
when i am reading file using DataFrame in Spark1.6
val CleanData= sqlContext.sql("""SELECT
COL1
COL2,
COL3
FROM CLEANFRAME
""" )
Encoded value looks like below.
But encoded value appear like
53004, �����������������������������
May someone please help me how to fix this error if possiblw with spark 1.6 and scala.
Spark 1.6,
scala

#this ca be achieved by using the regex_replace
val df = spark.sparkContext.parallelize(List(("503004","d$üíõ$F|'.h*Ë!øì=(.î; ,.¡|®!®","3-2-704"))).toDF("col1","col2","col3")
df.withColumn("col2_new", regexp_replace($"col2", "[^a-zA-Z]", "")).show()
Output:
+------+--------------------+-------+--------+
| col1| col2| col3|col2_new|
+------+--------------------+-------+--------+
|503004|d$üíõ$F|'.h*Ë!øì=...|3-2-704| dFh|
+------+--------------------+-------+--------+

Related

How to remove all characters that start with "_" from a spark string column

I'm trying to modify a column from my dataFrame by removing the suffix from all the rows under that column and I need it in Scala.
The values from the column have different lengths and also the suffix is different.
For example, I have the following values:
09E9894DB868B70EC3B55AFB49975390-0_0_0_0_0
0978C74C69E8D559A62F860EA36ADF5E-28_3_1
0C12FA1DAFA8BCD95E34EE70E0D71D10-0_3_1
0D075AA40CFC244E4B0846FA53681B4D_0_1_0_1
22AEA8C8D403643111B781FE31B047E3-0_1_0_0
I need to remove everything after the "_" so that I can get the following values:
09E9894DB868B70EC3B55AFB49975390-0
0978C74C69E8D559A62F860EA36ADF5E-28
0C12FA1DAFA8BCD95E34EE70E0D71D10-0
0D075AA40CFC244E4B0846FA53681B4D
22AEA8C8D403643111B781FE31B047E3-0
As #werner pointed out in his comment, substring_index provides a simple solution to this. It is not necessary to wrap this in a call to selectExpr.
Whereas #AminMal has provided a working solution using a UDF, if a native Spark function can be used then this is preferable for performance.[1]
val df = List(
"09E9894DB868B70EC3B55AFB49975390-0_0_0_0_0",
"0978C74C69E8D559A62F860EA36ADF5E-28_3_1",
"0C12FA1DAFA8BCD95E34EE70E0D71D10-0_3_1",
"0D075AA40CFC244E4B0846FA53681B4D_0_1_0_1",
"22AEA8C8D403643111B781FE31B047E3-0_1_0_0"
).toDF("col0")
import org.apache.spark.sql.functions.{col, substring_index}
df
.withColumn("col0", substring_index(col("col0"), "_", 1))
.show(false)
gives:
+-----------------------------------+
|col0 |
+-----------------------------------+
|09E9894DB868B70EC3B55AFB49975390-0 |
|0978C74C69E8D559A62F860EA36ADF5E-28|
|0C12FA1DAFA8BCD95E34EE70E0D71D10-0 |
|0D075AA40CFC244E4B0846FA53681B4D |
|22AEA8C8D403643111B781FE31B047E3-0 |
+-----------------------------------+
[1] Is there a performance penalty when composing spark UDFs

Can't query Spark DF from Hive after `saveAsTable` - Spark SQL specific format, which is NOT compatible with Hive

I am trying to save a dataframe as an external table which will be queried both with spark and possibly with hive, but somehow, I cannot query or see any data with hive. It works on in spark.
Here is how to reproduce the problem:
scala> println(spark.conf.get("spark.sql.catalogImplementation"))
hive
scala> spark.conf.set("hive.exec.dynamic.partition", "true")
scala> spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
scala> spark.conf.set("spark.sql.sources.bucketing.enabled", true)
scala> spark.conf.set("hive.exec.dynamic.partition", "true")
scala> spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
scala> spark.conf.set("hive.enforce.bucketing","true")
scala> spark.conf.set("optimize.sort.dynamic.partitionining","true")
scala> spark.conf.set("hive.vectorized.execution.enabled","true")
scala> spark.conf.set("hive.enforce.sorting","true")
scala> spark.conf.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
scala> spark.conf.set("hive.metastore.uris", "thrift://localhost:9083")
scala> var df = spark.range(20).withColumn("random", round(rand()*90))
df: org.apache.spark.sql.DataFrame = [id: bigint, random: double]
scala> df.head
res19: org.apache.spark.sql.Row = [0,46.0]
scala> df.repartition(10, col("random")).write.mode("overwrite").option("compression", "snappy").option("path", "s3a://company-bucket/dev/hive_confs/").format("orc").bucketBy(10, "random").sortBy("random").saveAsTable("hive_random")
19/08/01 19:26:55 WARN HiveExternalCatalog: Persisting bucketed data source table `default`.`hive_random` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
Here is how I query in hive:
Beeline version 2.3.4-amzn-2 by Apache Hive
0: jdbc:hive2://localhost:10000/default> select * from hive_random;
+------------------+
| hive_random.col |
+------------------+
+------------------+
No rows selected (0.213 seconds)
But it works fine in spark:
scala> spark.sql("SELECT * FROM hive_random").show
+---+------+
| id|random|
+---+------+
| 3| 13.0|
| 15| 13.0|
...
| 8| 46.0|
| 9| 65.0|
+---+------+
There is warning after your saveAsTable call. That's where the hint lies -
'Persisting bucketed data source table default.hive_random into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.'
The reason being 'saveAsTable' creates RDD partitions but not Hive partitions, the workaround is to create the table via hql before calling DataFrame.saveAsTable.
I will suggest t try couple of thing. First, try to set hive execution engine to use Spark.
set hive.execution.engine=spark;
Second, try to create external table in metastore and then save data to that table.
The Semantics of bucketed table in Spark and Hive is different.
The doc has details of the differences in semantics.
It states that
Data is written to bucketed tables but the output does not adhere with expected
bucketing spec. This leads to incorrect results when one tries to consume the
Spark written bucketed table from Hive.
Workaround: If reading from both engines is the requirement, writes need to happen from Hive

Spark Scala read CSV which has a comma in the data

My CSV file which is in a zip file has the below data,
"Potter, Jr",Harry,92.32,09/09/2018
John,Williams,78,01/02/1992
And I read it using spark scala csv reader. If I use,
.option('quote', '"')
.option('escape', '"')
I will not be getting the fixed number of columns as output. For line 1, the output would be 5 and line 2 it would be 4. The desired output should return 4 columns only. Is there any way to read it as DF or RDD?
Thanks,
Ash
For the given input data, I was able to read the data using:
val input = spark.read.csv("input_file.csv")
This gave me a Dataframe with 4 string columns.
Check this.
val df = spark.read.csv("in/potter.txt").toDF("fname","lname","value","dt")
df.show()
+----------+--------+-----+----------+
| fname| lname|value| dt|
+----------+--------+-----+----------+
|Potter, Jr| Harry|92.32|09/09/2018|
| John|Williams| 78|01/02/1992|
+----------+--------+-----+----------+

How to split column into multiple columns in Spark 2?

I am reading the data from HDFS into DataFrame using Spark 2.2.0 and Scala 2.11.8:
val df = spark.read.text(outputdir)
df.show()
I see this result:
+--------------------+
| value|
+--------------------+
|(4056,{community:...|
|(56,{community:56...|
|(2056,{community:...|
+--------------------+
If I run df.head(), I see more details about the structure of each row:
[(4056,{community:1,communitySigmaTot:1020457,internalWeight:0,nodeWeight:1020457})]
I want to get the following output:
+---------+----------+
| id | value|
+---------+----------+
|4056 |1 |
|56 |56 |
|2056 |20 |
+---------+----------+
How can I do it? I tried using .map(row => row.mkString(",")),
but I don't know how to extract the data as I showed.
The problem is that you are getting the data as a single column of strings. The data format is not really specified in the question (ideally it would be something like JSON), but given what we know, we can use a regular expression to extract the number on the left (id) and the community field:
val r = """\((\d+),\{.*community:(\d+).*\}\)"""
df.select(
F.regexp_extract($"value", r, 1).as("id"),
F.regexp_extract($"value", r, 2).as("community")
).show()
A bunch of regular expressions should give you required result.
df.select(
regexp_extract($"value", "^\\(([0-9]+),.*$", 1) as "id",
explode(split(regexp_extract($"value", "^\\(([0-9]+),\\{(.*)\\}\\)$", 2), ",")) as "value"
).withColumn("value", split($"value", ":")(1))
If your data is always of the following format
(4056,{community:1,communitySigmaTot:1020457,internalWeight:0,nodeWeight:1020457})
Then you can simply use split and regex_replace inbuilt functions to get your desired output dataframe as
import org.apache.spark.sql.functions._
df.select(regexp_replace((split(col("value"), ",")(0)), "\\(", "").as("id"), regexp_replace((split(col("value"), ",")(1)), "\\{community:", "").as("value") ).show()
I hope the answer is helpful

Replace null values in Spark DataFrame

I saw a solution here but when I tried it doesn't work for me.
First I import a cars.csv file :
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("/usr/local/spark/cars.csv")
Which looks like the following :
+----+-----+-----+--------------------+-----+
|year| make|model| comment|blank|
+----+-----+-----+--------------------+-----+
|2012|Tesla| S| No comment| |
|1997| Ford| E350|Go get one now th...| |
|2015|Chevy| Volt| null| null|
Then I do this :
df.na.fill("e",Seq("blank"))
But the null values didn't change.
Can anyone help me ?
This is basically very simple. You'll need to create a new DataFrame. I'm using the DataFrame df that you have defined earlier.
val newDf = df.na.fill("e",Seq("blank"))
DataFrames are immutable structures.
Each time you perform a transformation which you need to store, you'll need to affect the transformed DataFrame to a new value.
you can achieve same in java this way
Dataset<Row> filteredData = dataset.na().fill(0);
If the column was string type,
val newdf= df.na.fill("e",Seq("blank"))
would work.
Since it's float type (as the image tells) you need to use
val newdf= df.na.fill(0.0, Seq("blank"))