How to load dynamic data into cassandra table? How to read csv file wih header also? - scala

I want to load csv file (Its changing columns) into cassandra table?
File sometimes comes 10 columns and sometime 8 according to this how do i insert data into cassandra table?
Is there any way to load with using scala or batch commands?
How to read csv file wih header also?

There's a number of options here really. You could code your own solution using one of the Datastax drivers, or you could use the cqlsh COPY command, or the Datastax Bulk loader tool.
The fact that your source file changes format throws a bit of a curve ball at you here and assuming you dont have any control on the files that you have to load then in each base you'll need to create something that initially parses the file or transforms it into a common format with the same amount of columns.
For example if you're using the shell you could count the columns using something like awk and then base your actions upon that. A simple example with bash to count the number of columns:
$ cat csv.ex1
apples,bananas,grapes,pineapples
$ cat csv.ex2
oranges,mangos,melons,pears,rasberries,strawberries,blueberries
$ cat csv.ex1 | awk -F "," '{print "num of cols: "NF}'
num of cols: 4
$cat csv.ex2 | awk -F "," '{print "num of cols: "NF}'
num of cols: 7
Once you have this you should then be able to parse or transform your file accordingly and load into Cassandra like you would with any other csv file.

Related

Restore a PostgreSQL-9.6 database from an old complete dump and/or a up-to-date just base directory and other rescued files

I'm trying to restore/rescue a database from some that I have:
I have all the recent files in PGDATA/base (/var/lib/postgresql/9.6/main/base/), but I have not the complete /var/lib/postgresql/9.6/main/
I have all files from an old backup (and not much different) dump that I restored in a new installation of PostgreSQL-9.6.
I have a lot of rescued files from the hard drive (from ddrescue) and I got thousand of files without a name (having a "#" and then a number instead and in lost+found directory), so, for instance:
I have the pg_class file
I have the pg_clog directory with 0000 file
Edit:
Probably I have the content of pg_xlog, but I don't have the name of the files. I have 5 files sized 16777216 bytes:
#288294 (date 2019-04-01)
#288287 (date 2019-05-14)
#288293 (date 2019-07-02)
#261307 (date 2019-11-27)
#270185 (date 2020-01-28)
Also my old dump is from 2019-04-23, so the first one could
be the same?
So my next step is going to try to read those files with pg_xlogdump
and/or trying to name them with those namefiles (beginning with
00000001000000000000000A by date and put them to the new one pg_xlog directory, that I saw that the system filenaming them, could be?). Also I realized that the last one has the date of the day hard drive crashed, so I have the last one.
The PGDATA/base directory I rescued from the hard drive (damaged) contains directories 1, 12406, 12407 and 37972 with a lot of files inside. I check with pg_filedump -fi that my updated data is stored on files in directory 37972.
Same (but old) data is stored in files in directory PGDATA/base/16387 in the restored dump.
I tried directly to copy the files from one to other mixing the updated data over the old database but it doesn't work. After solved permission errors I can go in to the "Frankenstein" database in that way:
postgres#host:~$ postgres --single -P -D /var/lib/postgresql/9.6/main/ dbname
And I tried to do something, like reindex, and I get this error:
PostgreSQL stand-alone backend 9.6.16
backend> reindex system dbname;
ERROR: could not access status of transaction 136889
DETAIL: Could not read from file "pg_subtrans/0002" at offset 16384: Success.
CONTEXT: while checking uniqueness of tuple (1,7) in relation "pg_toast_2619"
STATEMENT: reindex system dbname;
Certainly pg_subtrans/0002 file is part of the "Frankenstein" and not the good one (because I didn't find it yet, not with that name), so I tried: to copied another one that seems similar first and then, to generate 8192 zeroes with dd to that file, in both cases I get the same error (and in case that the file doesn't exist get the DETAIL: Could not open file "pg_subtrans/0002": No such file or directory.). Anyway I have not idea that what should be that file. Do you think could I get that data from other file? Or could I find the missing file using some tool? So pg_filedump show me empty for the other file in that directory pg_subtrans/0000.
Extra note: I found this useful blog post that talk about restore from just rescued files using pg_filedump, pg_class's file, reindex system and other tools and but is so hard for me to understand how to adapt it to my concrete and easier problem (I think that my problem is easier because I have a dump): https://www.commandprompt.com/blog/recovering_a_lost-and-found_database/
Finally we restored completely database based on PGDATA/base/37972 directory after 4 parts:
Checking and "greping" with pg_filedump -fi which file correspond
to each table. To "greping" easier we made a script.
#!/bin/bash
for filename in ./*; do
echo "$filename"
pg_filedump -fi "$filename"|grep "$1"
done
NOTE: Only useful with small strings.
Executing the great tool pg_filedump -D. -D is a new option (from postgresql-filedump version ≥10), to decode tuples using given comma separated list of types.
As we know types because we made the database, we "just" need to give a comma separated list of types related to the table. I wrote "just" because in some cases it could be a little bit complicated. One of our tables need this kind of command:
pg_filedump -D text,text,text,text,text,text,text,text,timestamp,text,text,text,text,int,text,text,int,text,int,text,text,text,text,text,text,text,text,text,text,int,int,int,int,int,int,int,int,text,int,int 38246 | grep COPY > restored_table1.txt
From pg_filedump -D manual:
Supported types:
bigint
bigserial
bool
char
charN -- char(n)
date
float
float4
float8
int
json
macaddr
name
oid
real
serial
smallint
smallserial
text
time
timestamp
timetz
uuid
varchar
varcharN -- varchar(n)
xid
xml
~ -- ignores all attributes left in a tuple
All those text for us were type character varying(255) but varcharN didn't work for us, so after other tests we finally change it for text.
timestamp for us was type timestamp with time zone but timetz didn't work for us, so after other tests we finally change it for timestamp and we opted to lose the time zone data.
This changes work perfect for this table.
Other tables were much easier:
pg_filedump -D int,date,int,text 38183 | grep COPY > restored_table2.txt
As we get just "raw" data we have to re-format to CSV format. So we made a python program for format from pg_filedump -D output to CSV.
We inserted each CSV to the PostgreSQL (after create each empty table again):
COPY scheme."table2"(id_comentari,id_observacio,text,data,id_usuari,text_old)
FROM '<path>/table2.csv' DELIMITER '|' CSV HEADER;
I hope this will help other people :)
That is doomed. Without the information in pg_xlog and (particularly) pg_clog, you cannot get the information back.
A knowledgeable forensics expert might be able to salvage some of your data, but it is not a straightforward process.

What data type does YCSB load into a database?

I am loading data to Cassandra through YCSB using this command -
bin/ycsb load cassandra-10 -p hosts="132.67.105.254" -P workloads/workloada > workloada_res.txt
I just want what "sort of data" is loaded using above command. I mean a single character or a string.
Have you tried to run the same command with the basic switch instead of the cassandra-10 one?
From the documentation:
The basic parameter tells the Client to use the dummy BasicDB layer. [...]
If you used BasicDB, you would see the insert statements for the database [in the terminal]. If you used a real DB interface layer, the records would be loaded into the database.
You will then notice that YCSB generates rowkeys like user###, with ### < recordcount. The columns are named field0 to fieldN with N = fieldcount - 1 and the content of each cell is a random string of fieldlength characters.
recordcount is given in workloada and equals 1000.
fieldcount defaults to 10 and fieldlength to 100 but you can overwrite both in you workload file or using the -p switch.

Copying untitled columns from tsv file to postgresql?

By tsv I mean a file delimited by tabs. I have a pretty large (6GB) data file that I have to import into a PostgreSQL database, and out of 56 columns, the first 8 are meaningful, then out of the other 48 there are several columns (like 7 or so) with 1's sparsely distributed with the rest being 0's. Is there a way to specify which columns in the file you want to copy into the table? If not, then I am fine with importing the whole file and just extracting the desired columns to use as data for my project, but I am concerned about allocating excessively large memory to a table in which less than 1/4 of the data is meaningful. Will this pose an issue, or will I be fine accommodating the meaningful columns into my table? I have considered using that table as a temp table and then importing the meaningful columns to another table, but I have been instructed to try to avoid doing an intermediary cleaning step, so I should be fine directly using the large table if it won't cause any problems in PostgreSQL.
With PostgreSQL 9.3 or newer, COPY accepts a program as input . This option is precisely meant for that kind of pre-processing. For instance, to keep only tab-separated fields 1 to 4 and 7 from a TSV file, you could run:
COPY destination_table FROM PROGRAM 'cut -f1-4,7 /path/to/file' (format csv, delimiter '\t');
This also works with \copy in psql, in which case the program is executed client-side.

How to write csv file into one file by pyspark

I use this method to write csv file. But it will generate a file with multiple part files. That is not what I want; I need it in one file. And I also found another post using scala to force everything to be calculated on one partition, then get one file.
First question: how to achieve this in Python?
In the second post, it is also said a Hadoop function could merge multiple files into one.
Second question: is it possible merge two file in Spark?
You can use,
df.coalesce(1).write.csv('result.csv')
Note:
when you use coalesce function you will lose your parallelism.
You can do this by using the cat command line function as below. This will concatenate all of the part files into 1 csv. There is no need to repartition down to 1 partition.
import os
test.write.csv('output/test')
os.system("cat output/test/p* > output/test.csv")
Requirement is to save an RDD in a single CSV file by bringing the RDD to an executor. This means RDD partitions present across executors would be shuffled to one executor. We can use coalesce(1) or repartition(1) for this purpose. In addition to it, one can add a column header to the resulted csv file.
First we can keep a utility function for make data csv compatible.
def toCSVLine(data):
return ','.join(str(d) for d in data)
Let’s suppose MyRDD has five columns and it needs 'ID', 'DT_KEY', 'Grade', 'Score', 'TRF_Age' as column Headers. So I create a header RDD and union MyRDD as below which most of times keeps the header on top of the csv file.
unionHeaderRDD = sc.parallelize( [( 'ID','DT_KEY','Grade','Score','TRF_Age' )])\
.union( MyRDD )
unionHeaderRDD.coalesce( 1 ).map( toCSVLine ).saveAsTextFile("MyFileLocation" )
saveAsPickleFile spark context API method can be used to serialize data that is saved in order save space. Use pickFile to read the pickled file.
I needed my csv output in a single file with headers saved to an s3 bucket with the filename I provided. The current accepted answer, when I run it (spark 3.3.1 on a databricks cluster) gives me a folder with the desired filename and inside it there is one csv file (due to coalesce(1)) with a random name and no headers.
I found that sending it to pandas as an intermediate step provided just a single file with headers, exactly as expected.
my_spark_df.toPandas().to_csv('s3_csv_path.csv',index=False)

How should I open a PostgreSQL dump file and add actual data to it?

I have a pretty basic database. I need to drop a good size users list into the db. I have the dump file, need to convert it to a .pg file and then somehow load this data into it.
The data I need to add are in CSV format.
I assume you already have a .pg file, which I assume is a database dump in the "custom" format.
PostgreSQL can load data in CSV format using the COPY statement. So the absolute simplest thing to do is just add your data to the database this way.
If you really must edit your dump, and the file is in the "custom" format, there is unfortunately no way to edit the file manually. However, you can use pg_restore to create a plain SQL backup from the custom format and edit that instead. pg_restore with no -d argument will generate an SQL script for insertion.
As suggested by Daniel, the simplest solution is to keep your data in CSV format and just import into into Postgres as is.
If you're trying to to merge this CSV data into a 3rd party Postgres dump file, then you'll need to first convert the data into SQL insert statements.
One possible unix solution:
awk -F, '{printf "INSERT INTO TABLE my_tab (\"%s\",\"%s\",\"%s\");\n",$1,$2,$3}' data.csv