When using pyspark to write parquet files, is there a way to set the encoding? It looks like the default is "Dictionary Encoding (PLAIN_DICTIONARY = 2 and RLE_DICTIONARY = 8)"
Plain: (PLAIN = 0)
Dictionary Encoding (PLAIN_DICTIONARY = 2 and RLE_DICTIONARY = 8)
Run Length Encoding / Bit-Packing Hybrid (RLE = 3)
Related
I have multiple .csv files in a folder on Azure. Using PySpark I am trying to create a dataframe that has two columns, filename and firstrow, which are captured for each file within the folder.
Ideally I would like to avoid having to read the files in full as some of them can be quite large.
I am new to PySpark so I do not yet understand the basics so I would appreciate any help.
I have write a code for your scenario and it is working fine.
Create a empty list and append it with all the filenames stored in the source
# read filenames
filenames = []
l = dbutils.fs.ls("/FileStore/tables/")
for i in l:
print(i.name)
filenames.append(i.name)
# converting filenames to tuple
d = [(x,) for x in filenames]
print(d)
Read the data from multiple files and store in a list
# create data by reading from multiple files
data = []
i = 0
for n in filename:
temp = spark.read.option("header", "true").csv("/FileStore/tables/" + n).limit(1)
temp = temp.collect()[0]
temp = str(temp)
s = d[i] + (temp,)
data.append(s)
i+=1
print(data)
Now create DataFrame from the data with column names.
column = ["filename", "filedata"]
df = spark.createDataFrame(data, column)
df.head(2)
Similar to How do I parse large compressed csv files in Foundry? but without the file being compressed, a system generated (>10GB) csv file which needs to be parsed as a Foundry Dataset.
A dataset this size normally causes the driver to OOM, so how can I parse this file?
Using the filesystem, you can read the file and yield a rowwise operation to split on each seperator (,) in this case.
df = raw_dataset
fs = df.filesystem()
def process_file(fl):
with fs.open("data_pull.csv", "r") as f:
header = [x.strip() for x in f.readline().split(",")]
Log = Row(*header)
for i in f:
yield Log(*i.split(","))
rdd = fs.files().rdd
rdd = rdd.flatMap(process_file)
df = rdd.toDF()
I am trying to filter the good and bad rows by counting the number of delimiters in a TSV.gz file and write to separate files in HDFS
I ran the below commands in spark-shell
Spark Version: 1.6.3
val file = sc.textFile("/abc/abc.tsv.gz")
val data = file.map(line => line.split("\t"))
var good = data.filter(a => a.size == 995)
val bad = data.filter(a => a.size < 995)
When I checked the first record the value could be seen in the spark shell
good.first()
But when I try to write to an output file I am seeing the below records,
good.saveAsTextFile(good.tsv)
Output in HDFS (top 2 rows):
[Ljava.lang.String;#1287b635
[Ljava.lang.String;#2ef89922
Could ypu please let me know on how to get the required output file in HDFS
Thanks.!
Your final RDD is type of org.apache.spark.rdd.RDD[Array[String]]. Which leads to writing objects instead of string values in the write operation.
You should convert the array of strings to tab separated string values again before saving. Just try;
good.map(item => item.mkString("\t")).saveAsTextFile("goodFile.tsv")
Here is a string-to-binary conversion program.
str='haseeb'
int16bStr = uint16(str);
typecast(int16bStr,'uint8');
a=dec2bin(int16bStr)
I'm using this code to convert the text "haseeb" into binary form:
str =
haseeb
a =
1101000
1100001
1110011
1100101
1100101
1100010
I want to convert this binary output of "a" back to text (i.e. "haseeb"). I am using Matlab r2009b.
You can just do this :
word = reshape(char(bin2dec(a)),1,[])
And you can skip your typecast just by doing
int8bStr = uint8(str);
I'm new to PowerBuilder 12, and would like to know is there any way to determine the encoding (e.g. Unicode, BIG5) of an input file. Any comments and code samples are appreciated! Thanks!
From the PB 12.5 help file :
FileEncoding ( filename )
filename : The name of the file you want to test for encoding type
Return Values
EncodingANSI!
EncodingUTF8!
EncodingUTF16LE!
EncodingUTF16BE!
If filename does not exist, returns null.
Finding Unicode is pretty easy, if you assume the Unicode file has a BOM prefix (although reality is that not all Unicode files do have this). Some code to do this is below. However, I have no idea about Big5; it looks to me (at first glance at the spec, never had occasion to use it) like it doesn't have a similar prefix.
Good luck,
Terry
function of_filetype (string as_filename) returns encoding
integer li_NullCount, li_NonNullCount, li_OffsetTest
long ll_File
encoding le_Return
blob lblb_UTF16BE, lblb_UTF16LE, lblb_UTF8, lblb_Test, lblb_BOMTest, lblb_Null
lblb_UTF16BE = Blob ("~hFE~hFF", EncodingANSI!)
lblb_UTF16LE = Blob ("~hFF~hFE", EncodingANSI!)
lblb_UTF8 = Blob ("~hEF~hBB~hBF", EncodingANSI!)
lblb_Null = blobmid (blob ("~h01", encodingutf16le!), 2, 1)
SetNull (le_Return)
// Get a set of bytes to test
ll_File = FileOpen (as_FileName, StreamMode!, Read!, Shared!)
FileRead (ll_File, lblb_Test)
FileClose (ll_File)
// test for BOMs: UTF-16BE (FF FE), UTF-16LE (FF FE), UTF-8 (EF BB BF)
lblb_BOMTest = BlobMid (lblb_Test, 1, Len (lblb_UTF16BE))
IF lblb_BOMTest = lblb_UTF16BE THEN RETURN EncodingUTF16BE!
lblb_BOMTest = BlobMid (lblb_Test, 1, Len (lblb_UTF16LE))
IF lblb_BOMTest = lblb_UTF16LE THEN RETURN EncodingUTF16LE!
lblb_BOMTest = BlobMid (lblb_Test, 1, Len (lblb_UTF8))
IF lblb_BOMTest = lblb_UTF8 THEN RETURN EncodingUTF8!
//I've removed a hack from here that I wouldn't encourage others to use, basically checking for
//0x00 in places I'd "expect" them to be if it was a Unicode file, but that makes assumptions
RETURN le_Return