I am creating a data frame from a json file in HDFS for e.g hdfs://localhost:9000/usr/sparkApp/finalTxtTweets/-1606454440000.txt/part-00000 in scala spark shell. I have created a schema as I only need the id and text part of the json file with val schema = new StructType().add("UserID", StringType, true).add("Text", StringType, true) and with the schema, I create the data frame with this:
val df = spark.read.schema(schema).option("multiLine", true).option("mode", "PERMISSIVE").json("hdfs://localhost:9000/usr/sparkApp/finalTxtTweets/-1606454440000.txt/part-00000")
However, it returns a data frame of 2 columns (UserID and Text) but with null as rows.
This is what the json file from part-00000 is like
JObject(List((UserID,JInt(2308359613)), (Text,JString(RT #ONEEsports: The ONE Esports MPL Invitational is now live! Join us as 20 of the best teams from Indonesia, Singapore, Malaysia, Myanmar…))))
Anyone have any idea of how to correctly parse the json file? Or is there anything wrong with my code? Thank you very much for helping!
Related
In Spark Streaming when the input source is a csv file and I read it through a socket (Java), a Dataset<Row> is created with only a string column and the value of each row contains each line sent through the socket.
When I know the format of each line, e.g. the first two values of the csv line are Strings the next is an integer and so on, is t possible to declare my schema and create another Dataset<Row> based on that schema and place the data accordingly?
Thank you in advance.
First of all,if it is csv i dont see any point to use spark streaming for that.It will be hisotrical data ,data is not changing.So you should use spark sql only to read and process csv.
You can create your schema by crating StructField and decalre data types.
I am trying to load a csv file into a dataframe with my own defined schema.
All i need is to identify bad data while loading it as a dataframe.
Bad data in my sense is
input contains String when schema tells Integer
unexpected/bad delimiter
any column data appearing as below, when my schema says nullable => false
null
''
//nothing
I am using a extra column "_corrupt_record" column to redirect records with above 3 scenarios.
I could see this extra column getting populated during case 1) and case 2)
But when data is having Null (case 3.3), the record is not redirecting into this extra column. Its is working for 3.1 and 3.2.
Where am i doing wrong?
you can suggest me any other way that you use in realtime projects to handle/redirect bad data into a file while loading raw files into dataframes.
input file Products.txt
schema : Product (product_id, product_name, product_type, product_version, product_price)
code:
val spark= new sql.SparkSession.Builder().master("local[*]").getOrCreate()
val products_schema= StructType(List
(
StructField("product_id",IntegerType,false),
StructField("product_name",StringType,false),
StructField("product_type",StringType,true),
StructField("product_version",StringType,true),
StructField("product_price",StringType,true),
StructField("_corrupt_record",StringType,true)
)
)
val products_Staging_df=spark.read.option("header", false).option("delimiter", "|").schema(products_schema).csv("C:\\Users\\u6062310\\Desktop\\DBS\\Product.txt")
products_Staging_df.printSchema()
products_Staging_df.show()`
i put some bad records, when i use df.show()
i expected the records with product_id = to also come under _corrupt_record column. but its not coming.
Only Null and '' were working correctly. How to handle blank?
DataFrame saveAsTable is persisting all the column values properly but insertInto function is not storing all the columns especially json data is truncated and sub-sequent column in not stored hive table.
Our Environment
Spark 2.2.0
EMR 5.10.0
Scala 2.11.8
The sample data is
a8f11f90-20c9-11e8-b93e-2fc569d27605 efe5bdb3-baac-5d8e-6cae57771c13 Unknown E657F298-2D96-4C7D-8516-E228153FE010 NonDemarcated {"org-id":"efe5bdb3-baac-5d8e-6cae57771c13","nodeid":"N02c00056","parkingzoneid":"E657F298-2D96-4C7D-8516-E228153FE010","site-id":"a8f11f90-20c9-11e8-b93e-2fc569d27605","channel":1,"type":"Park","active":true,"tag":"","configured_date":"2017-10-23
23:29:11.20","vs":[5.0,1.7999999523162842,1.5]}
DF SaveAsTable
val spark = SparkSession.builder().appName("Spark SQL Test").
config("hive.exec.dynamic.partition", "true").
config("hive.exec.dynamic.partition.mode", "nonstrict").
enableHiveSupport().getOrCreate()
val zoneStatus = spark.table("zone_status")
zoneStatus.select(col("site-id"),col("org-id"), col("groupid"), col("zid"), col("type"), lit(0), col("config"), unix_timestamp().alias("ts")).
write.mode(SaveMode.Overwrite).saveAsTable("dwh_zone_status")
Stored data properly in result table:
a8f11f90-20c9-11e8-b93e-2fc569d27605 efe5bdb3-baac-5d8e-6cae57771c13 Unknown E657F298-2D96-4C7D-8516-E228153FE010 NonDemarcated 0 {"org-id":"efe5bdb3-baac-5d8e-6cae57771c13","nodeid":"N02c00056","parkingzoneid":"E657F298-2D96-4C7D-8516-E228153FE010","site-id":"a8f11f90-20c9-11e8-b93e-2fc569d27605","channel":1,"type":"Park","active":true,"tag":"","configured_date":"2017-10-23 23:29:11.20","vs":[5.0,1.7999999523162842,1.5]} 1520453589
DF insertInto
zoneStatus.
select(col("site-id"),col("org-id"), col("groupid"), col("zid"), col("type"), lit(0), col("config"), unix_timestamp().alias("ts")).
write.mode(SaveMode.Overwrite).insertInto("zone_status_insert")
But insertInto is not persisting all the contents. The json string is storing partially and sub-sequent column is not stored.
a8f11f90-20c9-11e8-b93e-2fc569d27605 efe5bdb3-baac-5d8e-6cae57771c13 Unknown E657F298-2D96-4C7D-8516-E228153FE010 NonDemarcated 0 {"org-id":"efe5bdb3-baac-5d8e-6cae57771c13" NULL
We are using insertInto functions in our projects and recently encountered when parsing json data to pull other metrics. We noticed that the config content is not stored fully. Planning to change to saveAsTable but we can avoid the code change, if any workaround available to add in spark configuration.
You can use below alternative ways of inserting data into table.
val zoneStatusDF = zoneStatus.
select(col("site-id"),col("org-id"), col("groupid"), col("zid"), col("type"), lit(0), col("config"), unix_timestamp().alias("ts"))
zoneStatusDF.registerTempTable("zone_status_insert ")
Or
zoneStatus.sqlContext.sql("create table zone_status_insert as select * from zone_status")
The reason is that schema created with
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE
After removing the ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' can able to save entire contents using insertInto.
I have a hive table column (Json_String String) it has some 1000 rows, Where each row is a Json of same structure. I am trying read the json in to Dataframe as below
val df = sqlContext.read.json("select Json_String from json_table")
but it is throwing up the below exception
java.io.IOException: No input paths specified in job
is there any way to read all the rows in to dataframe as we do with Json files using wild card
val df = sqlContext.read.json("file:///home/*.json")
I think what you're asking for is to read the Hive table as usual and transform the JSON column using from_json function.
from_json(e: Column, schema: StructType): Column Parses a column containing a JSON string into a StructType with the specified schema. Returns null, in the case of an unparseable string.
Given you use sqlContext in your code, I'm afraid that you use Spark < 2.1.0 which then does not offer from_json (which was added in 2.1.0).
The solution then is to use a custom user-defined function (UDF) to do the parsing yourself.
val df = sqlContext.read.json("select Json_String from json_table")
The above won't work since json operator expects a path or paths to JSON files on disk (not as a result of executing a query against a Hive table).
json(paths: String*): DataFrame Loads a JSON file (JSON Lines text format or newline-delimited JSON) and returns the result as a DataFrame.
I need to implement converting csv.gz files in a folder, both in AWS S3 and HDFS, to Parquet files using Spark (Scala preferred). One of the columns of the data is a timestamp and I only have a week of dataset. The timestamp format is:
'yyyy-MM-dd hh:mm:ss'
The output that I desire is that for every day, there is a folder (or partition) where the Parquet files for that specific date is located. So there would 7 output folders or partitions.
I only have a faint idea of how to do this, only sc.textFile is on my mind. Is there a function in Spark that can convert to Parquet? How do I implement this in S3 and HDFS?
Thanks for you help.
If you look into the Spark Dataframe API, and the Spark-CSV package, this will achieve the majority of what you're trying to do - reading in the CSV file into a dataframe, then writing the dataframe out as parquet will get you most of the way there.
You'll still need to do some steps on parsing the timestamp and using the results to partition the data.
old topic but ill think it is important to answer even old topics if not answered right.
in spark version >=2 csv package is already included before that you need to import databricks csv package to your job e.g. "--packages com.databricks:spark-csv_2.10:1.5.0".
Example csv:
id,name,date
1,pete,2017-10-01 16:12
2,paul,2016-10-01 12:23
3,steve,2016-10-01 03:32
4,mary,2018-10-01 11:12
5,ann,2018-10-02 22:12
6,rudy,2018-10-03 11:11
7,mike,2018-10-04 10:10
First you need to create the hivetable so that the spark written data is compatible with the hive schema. (this might be not needed anymore in future versions)
create table:
create table part_parq_table (
id int,
name string
)
partitioned by (date string)
stored as parquet
after youve done that you can easy read the csv and save the dataframe to that table.The second step overwrites the column date with the dateformat like"yyyy-mm-dd". For each of the value a folder will be created with the specific lines in it.
SCALA Spark-Shell example:
spark.sqlContext.setConf("hive.exec.dynamic.partition", "true")
spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
First two lines are hive configurations which are needed to create a partition folder which not exists already.
var df=spark.read.format("csv").option("header","true").load("/tmp/test.csv")
df=df.withColumn("date",substring(col("date"),0,10))
df.show(false)
df.write.format("parquet").mode("append").insertInto("part_parq_table")
after the insert is done you can directly query the table like "select * from part_parq_table".
The folders will be created in the tablefolder on default cloudera e.g. hdfs:///users/hive/warehouse/part_parq_table
hope that helps
BR
Read csv file /user/hduser/wikipedia/pageviews-by-second-tsv
"timestamp" "site" "requests"
"2015-03-16T00:09:55" "mobile" 1595
"2015-03-16T00:10:39" "mobile" 1544
The following code uses spark2.0
import org.apache.spark.sql.types._
var wikiPageViewsBySecondsSchema = StructType(Array(StructField("timestamp", StringType, true),StructField("site", StringType, true),StructField("requests", LongType, true) ))
var wikiPageViewsBySecondsDF = spark.read.schema(wikiPageViewsBySecondsSchema).option("header", "true").option("delimiter", "\t").csv("/user/hduser/wikipedia/pageviews-by-second-tsv")
Convert String-timestamp to timestamp
wikiPageViewsBySecondsDF= wikiPageViewsBySecondsDF.withColumn("timestampTS", $"timestamp".cast("timestamp")).drop("timestamp")
or
wikiPageViewsBySecondsDF= wikiPageViewsBySecondsDF.select($"timestamp".cast("timestamp"), $"site", $"requests")
Write into parquet file.
wikiPageViewsBySecondsTableDF.write.parquet("/user/hduser/wikipedia/pageviews-by-second-parquet")