Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
The community reviewed whether to reopen this question 8 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I'm able to write pytest functions by manually giving column names and values to create data frame and passing it to the production code to check all the transformed fields values in palantir foundry code repository.
Instead of manually passing column names and their respective values I want to store all the required data in dataset and import that dataset into pytest function to fetch all the required values and passing over to the production code to check all the transformed field values.
Is there anyways to accept dataset as an input the test function in planatir code repository.
You can probably do something like this:
Lets say you have your csv inside a fixtures/ folder next to your test.
test_yourtest.py
fixtures/yourfilename.csv
You can just read it directly and pass it to create a new dataframe. I didn't test this code but it should be something similar to this:
def load_file(spark_context):
filename = "yourfilename.csv"
file_path = os.path.join(Path(__file__).parent, "fixtures", filename)
return open(file_path).read()
Now you can load your CSV, it's just a matter of loading it into a dataframe and passing it into your pyspark logic that you want to test. Get CSV to Spark dataframe
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have a massive delimited file and many normalized tables to input the data. Is there a best practice for bringing in the data and inserting the data into its proper fields and tables?
For instance, right now I've created a temp table that holds all the arbitrary data. Some logic runs against each row to determine what values will be going in to what table. Without too much specifics the part that concerns me looks something like:
INSERT INTO table VALUES (
(SELECT TOP 1 field1 FROM #tmpTable),
(SELECT TOP 1 field30 FROM #tmpTable),
(SELECT TOP 1 field2 FROM #tmpTable),
...
(SELECT TOP 1 field4 FROM #tmpTable))
With that, my questions are: Is it reasonable to be using a temp table for this purpose? And is it poor practice to use these SELECT statements so liberally like this? It feels sort of hacky, are there a better ways to handle mass data importing and separation like this?
You should try SSIS.
SSIS How to Create an ETL Package
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I am a bit disappointed with slick & its TableQueries : the model of an application can be a "class Persons(tag: Tag) extends Table[Person] for example (where Person is a case class with some fields like name, age,address...).
The weird point is the "val persons = TableQuery[Persons]" contains all the records.
To have for example all the adults, we can use:
adults = persons.filter(p => p.age >= 18).list()
Is the content of the database loaded in the variable persons?
Is there on the contrary a mechanism that allows to evaluate not "persons" but "adults"?(a sort of lazy variable)?
Can we say something like 'at any time, "persons" contains the entire database'?
Are there good practices, some important ideas that can help the developer?
thanks.
You are mistaken in your assumption that persons contains all of the records. The Table and TableQuery classes are representations of a SQL table, and the whole point of the library is to ease the interaction with SQL databases by providing a convenient, scala-like syntax.
When you say
val adults = persons.filter{ p => p.age >= 18 }
You've essentially created a SQL query that you can think of as
SELECT * FROM PERSONS WHERE AGE >= 18
Then when you call .list() it executes that query, transforming the result rows from the database back into instances of your Person case class. Most of the methods that have anything to do with slick's Table or Query classes will be focused on generating Queries (i.e. "select" statements). They don't actually load any data until you invoke them (e.g. by calling .list() or .foreach).
As for good practices and important ideas, I'd suggest you read through their documentation, as well as take a look at the scaladocs for any of the classes you are curious about.
http://slick.typesafe.com/docs/
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
As the question clearly indicates, what is actually a Schema in PostgreSQL that I see in the top level of the hierarchy in pgAdmin(III)?
Ok, I'm answering my own question just to clarify any other people (who do not have time to read docs or want a more simplified version):
You can think of a Schema as a namespace/package (just like in Java or C++). For example, let us assume mydb is the name of our database, A and B is the name of two different schemas which are present in the same database (mdb).
Now, we can use the same table name in two different schemas in the same single database:
mydb -> A -> myTable
mydb -> B -> myTable
Hope, that clarifies your answer. For more detail: PostgreSQL 9.3.1 Documentation - 5.7. Schemas
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I am looking for a document DB supporting Windows XP 32 bits, satisfying the following requirements:
The support must not be discontinued. I.e. I want to be able to install the most recent version of the DB. MongoDB does not fit, since they dropped supporting XP and CouchDB does not fit since they dropped supporting any Windows 32 bits.
It should be relatively simple. Obviously, the application is not an enterprise one, so a complex DB, like Cassandra, is out. In fact, I would like to avoid column databases, since I think they exist to solve enterprise level problems, which is not the case here. On the other hand, I do not want relational DBs, because I want to avoid DB upgrades each time new fields are added (and they will be added).
It should support indexing on part of the document, like MongoDB. I could use a relational DB, like hsqldb to store the data as json string. This lets adding new fields easy - no schema needs to be changed. But these fields would not be indexable by the database. Again, unlike MongoDB.
Finally, the DB will run on the same machine as the application itself - one more down for MongoDB, which would steal all the RAM from the application to itself.
So, in a sense, I am looking for something like MongoDB, but with the support of Windows XP 32 bits.
Any advices?
P.S.
I know that Windows XP has one year to live before MS drops supporting it. However, I have to support XP anyway.
With HSQLDB and some other relational databases, you store the document as a CLOB. This clob can be accessed via a single table which contains the index for all the indexed fields. For example
CREATE TABLE DATAINDEX(DOCID BIGINT GENERATED BY DEFAULT AS IDENTITY, FIELDNAME VARCHAR(128), FIELD VARCHAR(10000),
DOCUMENT CLOB, PRIMARY KEY (DOCID, FIELDNAME))
CREATE INDEX IDS ON (FIELDNAME, FIELD);
The whole document is the CLOB. A copy of selected fields that need an index for searching is stored in the (fieldnname, field) columns. The rows with the same DOCID will have the same CLOB in the DOCUMENT column. One row is inserted with the first field and the clob, then it is duplicated by selecting and inserting the existing DOCID and clob with the second field and so on.
-- use this to insert the CLOB with the first field
INSERT INTO DATAINDEX VALUES DEFAULT, 'f1', 'fieldvalue 1', ?
-- use this to insert the second, third and other fields
INSERT INTO DATAINDEX VALUES
IDENTITY(), 'f2', 'filedvalue 2',
(SELECT DOCUMENT FROM DATAINDEX WHERE DOCID = IDENTITY() LIMIT 1)
The above is just one example. You can create your own DOCID. The principle is to use the same DOCID and to insert the first row with the CLOB. The second and third rows select the DOCID and the clob from the previously inserted row to create new rows with the other fields. You will probably use JDBC parameters to insert into the FIELDNAME and FIELD columns.
This allows you to perform searches such as:
SELECT DOCID, DOCUMENT FROM DATAINDEX
WHERE FIELDNAME = 'COMPANY NAME' AND FIELD LIKE 'Corp%'
This may not satisfy all your requirements, but the answer is intended to cover what is possible with HSQLDB.
Which programming framework are you using? If .NET is a possibility you can try RavenDB. It can be used as both an embedded and standalone database.
For Java you can try out OrientDB. It is also embeddable: https://github.com/nuvolabase/orientdb/wiki/Embedded-Server