I am trying to read a file which contains numbers with decimals and when I read the csv file with spark, I get null for some columns and some few digits for other columns. I guess it has to do with the option adjustments I make during the spar.read. Here is my code
from pyspark.sql.types import DateType, DecimalType, DecimalType, StringType, StructField,
StructType
schema = StructType([
StructField("Date", DateType(), False),
StructField("Total MV", DecimalType(16,5), False),
StructField("Total TWR", DecimalType(16,5), False),
StructField("Prod1 MV", DecimalType(16,5), False),
StructField("Prod1TWR", DecimalType(16,5), False),
StructField("Prod2 MV", DecimalType(16,5), False),
StructField("Prod2TWR", DecimalType(16,5), False),
StructField("StockAll", DecimalType(16,5), False)
])
df_mr = (spark.read
.option("delimiter", ";")
.option("inferSchema", True)
.csv("here is the link of the file", loclae="sv_SE")
df_mr.schema
df = (
spark.read
.option("delimiter", ";")
.schema(schema)
.csv("here is the link to the file", locale="sv_SE")
)
df.createOrReplaceTempView("output")
df.show()
The out I get is the following output and then when i use SQL
%sql
select * from output
to get the table, i get the following SQL table. I don't understand why I get null and number formats that are different from the first image. The sample data as input is indata.
I tried with your sample data
DecimalType(20,5) will allow 20 digits before decimal to visible and adding .option("header", True) will convert first line as header and will remove the null values which are getting created.
from pyspark.sql.types import DateType, DecimalType, DecimalType, StringType, StructField, StructType
from pyspark.sql.functions import col, unix_timestamp, to_date
from pyspark.sql import functions as F
schema = StructType([
StructField("Date", DateType(), False),
StructField("Total MV", DecimalType(20,5), False),
StructField("Total TWR", DecimalType(20,5), False),
StructField("Prod1 MV", DecimalType(20,5), False),
StructField("Prod1TWR", DecimalType(20,5), False),
StructField("Prod2 MV", DecimalType(20,5), False),
StructField("Prod2TWR", DecimalType(20,5), False),
StructField("StockAll", DecimalType(20,5), False)
])
df_mr = (spark.read
.option("delimiter", ";")
.option("header", True)
.option("inferSchema", True)
.csv("/FileStore/tables/samplecsvdatadeci.txt"))
df_mr.printSchema()
df_mr.show()
df = (
spark.read
.option("delimiter", ";")
.option("header", True)
.schema(schema)
.csv("/FileStore/tables/samplecsvdatadeci.txt")
)
df.printSchema()
df.show()
df.createOrReplaceTempView("output")
%sql
select * from output
Output
Related
Suppose I have a dataframe where a have an id and a distinct list of keys and values, such as the following:
import pyspark.sql.functions as fun
from pyspark.sql.types import StructType, StructField, ArrayType, IntegerType, StringType
schema = StructType(
[
StructField('id', StringType(), True),
StructField('columns',
ArrayType(
StructType([
StructField('key', StringType(), True),
StructField('value', IntegerType(), True)
])
)
)
]
)
data = [
('1', [('Savas', 5)]),
('2', [('Savas', 5), ('Ali', 3)]),
('3', [('Savas', 5), ('Ali', 3), ('Ozdemir', 7)])
]
df = spark.createDataFrame(data, schema)
df.show()
For each struct in the array type column I want create a column, as follows:
df1 = df\
.withColumn('temp', fun.explode('names'))\
.select('id', 'temp.key', 'temp.value')\
.groupby('id')\
.pivot('key')\
.agg(fun.first(fun.col('value')))\
.sort('user_id')
df1.show()
Is there a more efficient way to achieve the same result?
I am reading a gz file in pyspark creating an RDD & Schema and then using that RDD to create the Dataframe. But I am not able to see any output.
Here is my code, I am not sure what I am doing wrong.
lines = sc.textFile("x.gz")
parts = lines.map(lambda l: l.split("\t"))
db = parts.map(lambda p: (p[0], int(p[1]), int(p[2]), int(p[3]).strip()))
schema = StructType([
StructField("word", StringType(), True),
StructField("count1", IntegerType(), True),
StructField("count2", IntegerType(), True),
StructField("count3", IntegerType(), True)])
df = sqlContext.createDataFrame(db, schema)
df.show()
df.createOrReplaceTempView("dftable")
result = sqlContext.sql("SELECT COUNT(*) FROM dftable")
result.show()
Moreover, I also want to calculate the number of rows in my table, that's why I used SQL query. But whenever try to call .show() error is thrown. What am I doing it wrong over here?
The data in the gz file is something like.....
A'String' some_number some_number some_number
some_number are in string format.
Please guide me what am I doing wrong?
I am new in apache spark. I create the schema and data frame and it show me result but the format was not good and it so messy. Hardly I can read the line. So i want to show my result in pandas format. I attached the screen shot of my data frame result. But i don't know how to show my result in pandas format.
Here's my code
from pyspark.sql.types import StructType, StructField, IntegerType
from pyspark.sql.types import *
from IPython.display import display
import pandas as pd
import gzip
schema = StructType([StructField("crimeid", StringType(), True),
StructField("Month", StringType(), True),
StructField("Reported_by", StringType(), True),
StructField("Falls_within", StringType(), True),
StructField("Longitude", FloatType(), True),
StructField("Latitue", FloatType(), True),
StructField("Location", StringType(), True),
StructField("LSOA_code", StringType(), True),
StructField("LSOA_name", StringType(), True),
StructField("Crime_type", StringType(), True),
StructField("Outcome_type", StringType(), True),
])
df = spark.read.csv("crimes.gz",header=False,schema=schema)
df.printSchema()
PATH = "crimes.gz"
csvfile = spark.read.format("csv")\
.option("header", "false")\
.schema(schema)\
.load(PATH)
df1 =csvfile.show()
it shows the result like below
but in want this data pandas form
Thanks
You can try showing them vertically per row, or truncate big names if you like:
df.show(2, vertical=True)
df.show(2, truncate=4, vertical=True)
Please try:
from pyspark.sql.types import StructType, StructField, IntegerType
from pyspark.sql.types import *
from IPython.display import display
import pandas as pd
import gzip
schema = StructType([StructField("crimeid", StringType(), True),
StructField("Month", StringType(), True),
StructField("Reported_by", StringType(), True),
StructField("Falls_within", StringType(), True),
StructField("Longitude", FloatType(), True),
StructField("Latitue", FloatType(), True),
StructField("Location", StringType(), True),
StructField("LSOA_code", StringType(), True),
StructField("LSOA_name", StringType(), True),
StructField("Crime_type", StringType(), True),
StructField("Outcome_type", StringType(), True),
])
df = spark.read.csv("crimes.gz",header=False,schema=schema)
df.printSchema()
pandasDF = df.toPandas() # transform PySpark dataframe in Pandas dataframe
print(pandasDF.head()) # print 5 first rows
I am reading a csv file using pyspark with predefined schema.
schema = StructType([
StructField("col1", IntegerType(), True),
StructField("col2", StringType(), True)
StructField("col3", FloatType(), True)
])
df = spark.sqlContext.read
.schema(schema)
.option("header",true)
.option("delimiter", ",")
.csv(path)
Now in the csv file, there is float value in col1 and string value in col3. I need to raise an exception and get the names of these columns(col1, col3) because these columns contain the values of different data type than that of defined in schema.
How do I achieve this?
In pyspark versions >2.2 you can use columnNameOfCorruptRecord with csv:
schema = StructType(
[
StructField("col1", IntegerType(), True),
StructField("col2", StringType(), True),
StructField("col3", FloatType(), True),
StructField("corrupted", StringType(), True),
]
)
df = spark.sqlContext.read.csv(
path,
schema=schema,
header=True,
sep=",",
mode="PERMISSIVE",
columnNameOfCorruptRecord="corrupted",
).show()
+----+----+----+------------+
|col1|col2|col3| corrupted|
+----+----+----+------------+
|null|null|null|0.10,123,abc|
+----+----+----+------------+
EDIT: CSV record fields are not independent of one another, so it can't generally be said that one field is corrupt, but others are not. Only the entire record can be corrupt or not corrupt.
For example, suppose we have a comma delimited file with one row and two floating point columns, the Euro values 0,10 and 1,00. The file looks like this:
col1,col2
0,10,1,00
Which field is corrupt?
I have a text file which is similar to below
20190920
123456789,6325,NN5555,123,4635,890,C,9
985632465,6467,KK6666,654,9780,636,B,8
258063464,6754,MM777,789,9461,895,N,5
And I am using spark 1.6 with scala to read this text file
val df = sqlcontext.read.option("com.databricks.spark.csv")
.option("header","false").option("inferSchema","false").load(path)
df.show()
When I used above command to read it is reading only first column. Is there anything to add to read that file with all column values.
Output I got :
20190920
123456789
985632465
258063464
3
In this case you should provide schema,So your code will look like this
val mySchema = StructType(
List(
StructField("col1", StringType, true),
StructField("col2", StringType, true),
// and other columns ...
)
)
val df = sqlcontext.read
.schema(mySchema)
.option("com.databricks.spark.csv")
.option("header","false")
.option("inferSchema","false")
.load(path)