Count distinct users from RDD - pyspark

I have json file which I loaded into my program using textFile. I want to count the number of distinct users in my json data. I cannot convert to DataFrame or Dataset. I tried the following code it gave me some java EOF error.
jsonFile = sc.textFile('some.json')
dd = jsonFile.filter(lambda x: x[1]).distinct().count()
# 2nd column is user ID coulmn
Sample data
{"review_id":"Q1sbwvVQXV2734tPgoKj4Q","user_id":"hG7b0MtEbXx5QzbzE6C_VA","business_id":"ujmEBvifdJM6h6RLv4wQIg","stars":1.0,text":"Total bill for this horrible service? Over $8Gs","date":"2013-05-07 04:34:36"}

use :
spark.read.json(Json_File, multiLine=True)
to directly read json into dataframe
Try multiLine as True and False as per your Files requirement

Related

I have a file which have transaction Id and xml data which is separated by comma(,) .I want to have only the xml file by removing the tranction ID

I have a file which has transactionId and xml data in it and which is separated by comma(,) .
I want to have only the xml file by removing the tranction ID and process the xml data; but I am not able to do so in pyspark.
Method 1 - I followed:
I tried to read the file as csv then dropped the first column and concatenated rest others columns and writing the file as text format.
df_spike = spark.read.format('csv').option('delimiter',',').load(readLocation)
df_spike = df_spike.drop("\_c0")
columns = df_spike.columns
df_spike = df_spike.withColumn('fresh',concat_ws(",", \*\[col(x) for x in columns\]))
df_final = df_spike.select('fresh')
df_final.write.txt(locatiion)
After writing this data in text format, when I am trying to read the data in XML, it is not reflecting all the rows.
Method 2 - I followed:
I read the data as text file and collected the element of the column one by one and removed the column one by one.
list_data = \[\]
for i in range(df_spike.count()):
df_collect = df_spike.collect()\[i\]\[0\]
df_list_data = df_collect\[12:\]
list_data.append(df_list_data)
This method is working fine; but it is taking excessive time as it is traversing through one by one row of the data.
Is there any efficient method to achieve this?

Adding an additional column containing file name to pyspark dataframe

I am iterating through csv files in a folder using for loop and performing some operations on each csv (getting the count of rows for each unique id and storing all these outputs into a pyspark dataframe). Now my requirement is to add the name of the file as well to the dataframe for each iteration. Can anyone suggest some way to do this
you can get the file name as a column using the function pyspark.sql.functions.input_file_name, and if your files have the same schema, and you want to apply the same processing pipeline, then don't need to loop on these files, you can read them using a regex:
df = spark.read.csv("path/to/the/files/*.csv", header=True, sep=";") \
.withColumn("file_name", input_file_name())

how to assign column names available in csv file as header to orc file

I have column names in one .csv file and want to assign these as column headers to Data Frame in scala. Since it is generic script, I don't want to hard code in the script rather pass the values from csv file.
You can do it:
val columns = spark.read.option("header","true").csv("path_to_csv").schema.fieldNames
val df: DataFrame = ???
df.toDF(columns:_*).write.format("orc").save("your_orc_dir")
in pyspark:
columns = spark.read.option("header","true").csv("path_to_csv").columns
df.toDF(columns).write.format("orc").save("your_orc_dir")
but store data schema separately from data is bad idea

Scala - Writing dataframe to a file as binary

I have a hive table of type parquet, with column Content storing various documents as base64 encoded.
Now, I need to read that column and write into a file in HDFS, so that the base64 column will be converted back to a document for each row.
val profileDF = sqlContext.read.parquet("/hdfspath/profiles/");
profileDF.registerTempTable("profiles")
val contentsDF = sqlContext.sql(" select unbase64(contents) as contents from profiles where file_name'file1'")
Now that contentDF is storing the binary format of a document as a row, which I need to write to a file. Tried different options but couldn't get back the dataframe content to a file.
Appreciate any help regarding this.
I would suggest save as parquet:
https://spark.apache.org/docs/1.6.3/api/java/org/apache/spark/sql/DataFrameWriter.html#parquet(java.lang.String)
Or convert to RDD and do save as object:
https://spark.apache.org/docs/1.6.3/api/java/org/apache/spark/rdd/RDD.html#saveAsObjectFile(java.lang.String)

Getting the value of a DataFrame column in Spark

I am trying to retrieve the value of a DataFrame column and store it in a variable. I tried this :
val name=df.select("name")
val name1=name.collect()
But none of the above is returning the value of column "name".
Spark version :2.2.0
Scala version :2.11.11
There are couple of things here. If you want see all the data collect is the way to go. However in case your data is too huge it will cause drive to fail.
So the alternate is to check few items from the dataframe. What I generally do is
df.limit(10).select("name").as[String].collect()
This will provide output of 10 element. But now the output doesn't look good
So, 2nd alternative is
df.select("name").show(10)
This will print first 10 element, Sometime if the column values are big it generally put "..." instead of actual value which is annoying.
Hence there is third option
df.select("name").take(10).foreach(println)
Takes 10 element and print them.
Now in all the cases you won't get a fair sample of the data, as the first 10 data will be picked. So to truely pickup randomly from the dataframe you can use
df.select("name").sample(.2, true).show(10)
or
df.select("name").sample(.2, true).take(10).foreach(println)
You can check the "sample" function on dataframe
The first will do :)
val name = df.select("name") will return another DataFrame. You can do for example name.show() to show content of the DataFrame. You can also do collect or collectAsMap to materialize results on driver, but be aware, that data amount should not be too big for driver
You can also do:
val names = df.select("name").as[String].collect()
This will return array of names in this DataFrame