how to query dataframe and rdd without schema - pyspark

How to load csv file without any schema into spark rdd and
dataframe and assign the schema
I have a file with the data like this
AA,19970101,47.82,47.82,47.82,47.82,0
the schema should be
stockname,date,highprice,lowprice,openprice,closeprice,volume

Probably you can first create rdd for the input data and on top of the rdd you can create dataframe with schema.
from pyspark.sql.types import StructType
from pyspark.sql.types import StructField
from pyspark.sql.types import *
rdd = sc.textFile("//path/to/textfile/file.txt")
schema = StructType([
StructField("id", IntegerType(), True),
StructField("created_at", TimestampType(), True),
StructField("updated_at", StringType(), True)
])
df = sqlContext.createDataFrame(rdd, schema)

Related

pyspark json datframe created with all null values

I have created a dataframe from a json file. However dataframe is created with all the schema but with values as null. Its a valid json file.
df = spark.read.json(path)
when I displayed the data , using df.display() all i can view is null in the dataframe. Can anyone tell me what could be the issue?
Reading the json file without enabling multiline might be the cause for this.
Please go through the sample demonstration.
My sample json.
[{"id":1,"first_name":"Amara","last_name":"Taplin"},
{"id":2,"first_name":"Gothart","last_name":"McGrill"},
{"id":3,"first_name":"Georgia","last_name":"De Miranda"},
{"id":4,"first_name":"Dukie","last_name":"Arnaud"},
{"id":5,"first_name":"Mellicent","last_name":"Scathard"}]
I got null values when multiline not used.
When multiline enabled I got proper result.
df= spark.read.option('multiline', True).json('/FileStore/tables/Sample1_json.json')
df.display()
If you want to give schema externally also, you can do like this.
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StructField, StringType, IntegerType,LongType
schema = StructType([StructField('first_name', StringType(), True),
StructField('id', IntegerType(), True),
StructField('last_name', StringType(), True)])
df= spark.read.schema(schema).option('multiline', True).json('/FileStore/tables/Sample1_json.json')
df.show()

Create a column for each struct in an array of struct's in a PySpark DataFrame

Suppose I have a dataframe where a have an id and a distinct list of keys and values, such as the following:
import pyspark.sql.functions as fun
from pyspark.sql.types import StructType, StructField, ArrayType, IntegerType, StringType
schema = StructType(
[
StructField('id', StringType(), True),
StructField('columns',
ArrayType(
StructType([
StructField('key', StringType(), True),
StructField('value', IntegerType(), True)
])
)
)
]
)
data = [
('1', [('Savas', 5)]),
('2', [('Savas', 5), ('Ali', 3)]),
('3', [('Savas', 5), ('Ali', 3), ('Ozdemir', 7)])
]
df = spark.createDataFrame(data, schema)
df.show()
For each struct in the array type column I want create a column, as follows:
df1 = df\
.withColumn('temp', fun.explode('names'))\
.select('id', 'temp.key', 'temp.value')\
.groupby('id')\
.pivot('key')\
.agg(fun.first(fun.col('value')))\
.sort('user_id')
df1.show()
Is there a more efficient way to achieve the same result?

Issue with pyspark df.show

I am reading a gz file in pyspark creating an RDD & Schema and then using that RDD to create the Dataframe. But I am not able to see any output.
Here is my code, I am not sure what I am doing wrong.
lines = sc.textFile("x.gz")
parts = lines.map(lambda l: l.split("\t"))
db = parts.map(lambda p: (p[0], int(p[1]), int(p[2]), int(p[3]).strip()))
schema = StructType([
StructField("word", StringType(), True),
StructField("count1", IntegerType(), True),
StructField("count2", IntegerType(), True),
StructField("count3", IntegerType(), True)])
df = sqlContext.createDataFrame(db, schema)
df.show()
df.createOrReplaceTempView("dftable")
result = sqlContext.sql("SELECT COUNT(*) FROM dftable")
result.show()
Moreover, I also want to calculate the number of rows in my table, that's why I used SQL query. But whenever try to call .show() error is thrown. What am I doing it wrong over here?
The data in the gz file is something like.....
A'String' some_number some_number some_number
some_number are in string format.
Please guide me what am I doing wrong?

Get the column names of malformed records while reading a csv file using pyspark

I am reading a csv file using pyspark with predefined schema.
schema = StructType([
StructField("col1", IntegerType(), True),
StructField("col2", StringType(), True)
StructField("col3", FloatType(), True)
])
df = spark.sqlContext.read
.schema(schema)
.option("header",true)
.option("delimiter", ",")
.csv(path)
Now in the csv file, there is float value in col1 and string value in col3. I need to raise an exception and get the names of these columns(col1, col3) because these columns contain the values of different data type than that of defined in schema.
How do I achieve this?
In pyspark versions >2.2 you can use columnNameOfCorruptRecord with csv:
schema = StructType(
[
StructField("col1", IntegerType(), True),
StructField("col2", StringType(), True),
StructField("col3", FloatType(), True),
StructField("corrupted", StringType(), True),
]
)
df = spark.sqlContext.read.csv(
path,
schema=schema,
header=True,
sep=",",
mode="PERMISSIVE",
columnNameOfCorruptRecord="corrupted",
).show()
+----+----+----+------------+
|col1|col2|col3| corrupted|
+----+----+----+------------+
|null|null|null|0.10,123,abc|
+----+----+----+------------+
EDIT: CSV record fields are not independent of one another, so it can't generally be said that one field is corrupt, but others are not. Only the entire record can be corrupt or not corrupt.
For example, suppose we have a comma delimited file with one row and two floating point columns, the Euro values 0,10 and 1,00. The file looks like this:
col1,col2
0,10,1,00
Which field is corrupt?

Is there a way to make an empty schema with column names from a list?

I am trying to make an empty PySpark dataframe in the case where it didn't exist before. I also have a list of column names. Is it possible to define an empty PySpark dataframe without manual assignment?
I have a list of columns final_columns, which I can use to select a subset of columns from a dataframe. However, in the case when this dataframe doesn't exist, I would like to create an empty dataframe with the same columns in final_columns. I would like to do this without manually assigning the names.
final_columns = ['colA', 'colB', 'colC', 'colD', 'colE']
try:
sdf = sqlContext.table('test_table')
except:
print("test_table is empty")
mySchema = StructType([ StructField("colA", StringType(), True),
StructField("colB", StringType(), True),
StructField("colC", StringType(), True),
StructField("colD", StringType(), True),
StructField("colE", DoubleType(), True) ])
sdf = sqlContext.createDataFrame(spark.sparkContext.emptyRDD(),schema=mySchema)
sdf = sdf.select(final_columns)