reading multiple csv files using pyspark - pyspark

I have requirement to read multiple csv files in one go. Now these csv files may have variable number of columns and in any order. We have requirement to read only specific columns from csv files . How do we do that ? I have tried defining custom schema but then the I get different data in columns.
For ex :
CSV file
ID, Name , Address
How do I select only Id and address column. Since if I say select (Id, Address) then it gives me ID and Name data in Address column. I want to select only ID and Address column according to header names while reading.
Thanks,
Naveed

You can iterate over the files and create a final dataframe like:
files = ['path/to/file1.csv', 'path/to/file2.csv', 'path/to/file3.csv', 'path/to/file4.csv']
#define the output dataframe's schema column name and type should be correct
schema = t.StructType([
t.StructField("a", t.StringType(), True), StructField("c", t.StringType(), True)
])
output_df = spark.createDataFrame([],schema)
for i,file in enumerate(data):
df = spark.read.csv(file, header=True)
output_df = output_df.union(df.select('a','c'))
output_df.show()
output_df will contain your desired output.

Related

Pyspark dynamic column name

I have a dataframe which contains months and will change quite frequently. I am saving this dataframe values as list e.g. months = ['202111', '202112', '202201']. Using a for loop to to iterate through all list elements and trying to provide dynamic column values with following code:
for i in months:
df = (
adjustment_1_prepared_df.select("product", "mnth", "col1", "col2")
.groupBy("product")
.agg(
f.min(f.when(condition, f.col("col1")).otherwise(9999999)).alias(
concat("col3_"), f.lit(i.col)
)
)
)
So basically in alias I am trying to give column name as a combination of constant (minInv_) and a variable (e.g. 202111) but I am getting error. How can I give a column name as combination of fixed string and a variable.
Thanks in advance!
.alias("col3_"+str(i.col))

How to drop first row from parquet file?

I have parquet file which contain two columns(id,feature).file consists of 14348 row.file
How i drop first row id,feature from file
code
val df = spark.read.format("parquet").load("file:///usr/local/spark/dataset/model/data/user/part-r-00000-7d55ba81-5761-4e36-b488-7e6214df2a68.snappy.parquet")
val header = df.first()
val data = df.filter(row => row != header)
data .show()
result seems as output
If you are trying to "ignore" the schema defined in the file, it is implicitly done once you read your file, using spark like:
spark.read.format("parquet").load(your_file)
If you are trying to only skip the first row on your DF and if you already know the id you can do: val filteredDF = originalDF.filter(s"id != '${excludeID}' "). If you don't know the id, you can use monotonically_increasing_id to tag it and then filter, similar like: filter spark dataframe based on maximum value of a column
You need to drop the first row based on id if you know that, else go for indexing approach i.e., assigning the row number and delete the first row.
I'm using Spark 2.4.0, and you could use the header option to the DataFrameReader call like so -
spark.read.format("csv").option("header", true).load(<path_to_file>)
Reference for the other options for DataFrameReader are here

Load multiple .csv-files into one table and create ID per .csv -postgres

Heyho. I am using Postgresql 9.5 and I am desperating at a problem.
I have multiple .csv-Files (40) and all of them have the same columncount und -names. I would now like to import them into one table, but I want an ID per .csv-file. Is it possible to automate this in postgres? (including adding a new id column) And how?
The approach might look like this:
test1.csv ==> table_agg ==> set ID = 1
test2.csv ==> table_agg ==> set ID = 2
.
.
.
test40.csv ==> table_agg ==> set ID = 40
I would be very glad if someone could help me
Add a table that contains the filename and other info you would like to add to each dataset. Add a serial column, that you can use as a foreign key in your data table, i.e. a dataset identifier.
Create the data table. Add a foreign key field to refer to the dataset entry in the other table.
Use a Python script to parse and import the csv files into the database. First add the entry to the datasets table. Then determine the dataset ID and insert the rows into the data table with the corresponding dataset ID set.
My simple solution to assign an ID to each .csv-file in Python and to output all .csv-files in one.
import glob, os, pandas as pd
path =r'PathToFolder'
# all .csv-files in this folder
allFiles = glob.glob(path + "/*.csv")
# safe DFs in list_
list_ = []
# DF for later concat
frame = pd.DataFrame()
# ID per DF/.csv
count = 0
for file_ in allFiles:
# read .csv-files
df = pd.read_csv(file_,index_col=None,skiprows=[1], header=0)
# new column with ID per DF
df['new_id'] = count
list_.append(df)
count = count + 1
frame = pd.concat(list_)
frame.to_csv('PathToOuputCSV', index = False)
Continue with SQL:
CREATE TABLE statement..
COPY TABLE_NAME FROM 'PathToCSV' DELIMITER ',' CSV HEADER;

Filter columns having count equal to the input file rdd Spark

I'm filtering Integer columns from the input parquet file with below logic and been trying to modify this logic to add additional validation to see if any one of the input columns have count equals to the input parquet file rdd count. I would want to filter out such column.
Update
The number of columns and names in the input file will not be static, it will change every time we get the file.
The objective is to also filter out column for which the count is equal to the input file rdd count. Filtering integer columns is already achieved with below logic.
e.g input parquet file count = 100
count of values in column A in the input file = 100
Filter out any such column.
Current Logic
//Get array of structfields
val columns = df.schema.fields.filter(x =>
x.dataType.typeName.contains("integer"))
//Get the column names
val z = df.select(columns.map(x => col(x.name)): _*)
//Get array of string
val m = z.columns
New Logic be like
val cnt = spark.read.parquet("inputfile").count()
val d = z.column.where column count is not equals cnt
I do not want to pass the column name explicitly to the new condition, since the column having count equal to input file will change ( val d = .. above)
How do we write logic for this ?
According to my understanding of your question, your are trying filter in columns with integer as dataType and whose distinct count is not equal to the count of rows in another input parquet file. If my understanding is correct, you can add column count filter in your existing filter as
val cnt = spark.read.parquet("inputfile").count()
val columns = df.schema.fields.filter(x =>
x.dataType.typeName.contains("string") && df.select(x.name).distinct().count() != cnt)
Rest of the codes should follow as it is.
I hope the answer is helpful.
Jeanr and Ramesh suggested the right approach and here is what I did to get the desired output, it worked :)
cnt = (inputfiledf.count())
val r = df.select(df.col("*")).where(df.col("MY_COLUMN_NAME").<(cnt))

Spark: Read a csv file into a map like structure using scala

I have a csv file of the format:
key, age, marks, feature_n
abc, 23, 84, 85.3
xyz, 25, 67, 70.2
Here the number of features can vary. In eg: I have 3 features (age, marks and feature_n). I have to convert it into a Map[String,String] as below :
[key,value]
["abc","age:23,marks:84,feature_n:85.3"]
["xyz","age:25,marks:67,feature_n:70.2"]
I have to join the above data with another dataset A on column 'key' and append the 'value' to another column in dataset A. The csv file can be loaded into a dataframe with schema (schema defined by first row of the csv file).
val newRecords = sparkSession.read.option("header", "true").option("mode", "DROPMALFORMED").csv("/records.csv");
Post this I will join the dataframe newRecords with dataset A and append the 'value' to one of the columns of dataset A.
How can I iterate over each column for each row, excluding the column "key" and generate the string of format "age:23,marks:84,feature_n:85.3" from newRecords?
I can alter the format of csv file and have the data in JSON format if it helps.
I am fairly new to Scala and Spark.
I would suggest the following solution:
val updated:RDD[String]=newRecords.drop(newRecords.col("key")).rdd.map(el=>{val a=el.toSeq;val st= "age"+a.head+"marks:"+a(1)+" feature_n:"+a.tail; st})