I have strict requirement, to save the null values to the Mongodb, as I am aware of the case of nosql where storing null is not recommended but my business requirement have a scenario.
a sample csv file which has a null value
a,b,c,id
,2,3,A
4,4,4,B
code to save csv to mongodb
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("a", DataTypes.IntegerType, false),
DataTypes.createStructField("b", DataTypes.IntegerType, true),
DataTypes.createStructField("c", DataTypes.IntegerType, true),
DataTypes.createStructField("id", DataTypes.StringType, true),
});
Dataset<Row> g = spark.read()
.format("csv")
.schema(schema)
.option("header", "true")
.option("inferSchema","false")
.load("/home/Documents/SparkLogs/a.csv");
MongoSpark.save(g
.write()
.option("database", "A")
.option("collection","b").mode("overwrite")
)
;
Mongodb Output
{
"_id" : ObjectId("5d663b6bec20c94c990e6d0c"),
"a" : 4,
"b" : 4,
"c" : 4,
"id" : "B"
}
/* 2 */
{
"_id" : ObjectId("5d663b6bec20c94c990e6d0d"),
"b" : 2,
"c" : 3,
"id" : "A"
}
My requirement is to have a 'a' field will null type in it.
Saving as DataSet with MongoSpark will ignore the null value keys defaultly. So my workaround is to convert Dataset to javaPairRDD of BsonObject types.
Code
/** imports ***/
import scala.Tuple2;
import java.beans.Encoder;
import java.util.UUID;
import org.apache.hadoop.conf.Configuration;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import org.bson.BSONObject;
import org.bson.BasicBSONObject;
import com.mongodb.hadoop.MongoOutputFormat;
/** imports ***/
private static void saveToMongoDB_With_Null(Dataset<Row> ds, Configuration outputConfig,String [] cols) {
JavaPairRDD<Object,BSONObject> document = ds
.toJavaRDD()
.mapToPair(f -> {
BSONObject doc = new BasicBSONObject();
for(String p:cols)
doc.put(p, f.getAs(p));
return new Tuple2<Object, BSONObject>(null, doc);
});
document.saveAsNewAPIHadoopFile(
"file:///this-is-completely-unused"
, Object.class
, BSONObject.class
, MongoOutputFormat.class
, outputConfig);
}
Configuration outputConfig = new Configuration();
outputConfig.set("mongo.output.uri",
"mongodb://192.168.0.19:27017/database.collection");
outputConfig.set("mongo.output.format",
"com.mongodb.hadoop.MongoOutputFormat");
Dataset<Row> g = spark.read()
.format("csv")
.schema(schema)
.option("header", "true")
.option("inferSchema","false")
.load("/home/Documents/SparkLogs/a.csv");
saveToMongoDB_With_Null(g, outputConfig,g.columns());
Needed Maven Dependency
<!-- https://mvnrepository.com/artifact/org.mongodb.mongo-hadoop/mongo-hadoop-core -->
<dependency>
<groupId>org.mongodb.mongo-hadoop</groupId>
<artifactId>mongo-hadoop-core</artifactId>
<version>2.0.2</version>
</dependency>
MongoDB output after workflow
{
"_id" : "a62e9b02-da97-493b-9563-fc19054df60e",
"a" : null,
"b" : 2,
"c" : 3,
"id" : "A"
}
{
"_id" : "fed373a8-e671-44a4-8b85-7c7e2ff59585",
"a" : 4,
"b" : 4,
"c" : 4,
"id" : "B"
}
Downsides
Bringing the high-level api like Dataset to low-level rdds will loose the spark's ability to optimise the queryplans , so trade-off is performance.
Related
I have the following piece of code that I use to read the incoming JSON which looks like this:
{
"messageTypeId": 2,
"messageId": "19223201",
"BootNotification" :
{
"reason": "PowerUp",
"chargingStation": {
"serialNumber" : "12345",
"model" : "",
"vendorName" : "",
"firmwareVersion" : "",
"modem": {
"iccid": "",
"imsi": ""
}
}
}
}
I have the following reads using play-json that would process this JSON:
implicit val ocppCallRequestReads: Reads[OCPPCallRequest] = Reads { jsValue =>
val messageTypeId = (jsValue \ 0).toOption
val messageId = (jsValue \ 1).toOption
val actionName = (jsValue \ 2).toOption
val payload = (jsValue \ 3).toOption
(messageTypeId.zip(messageId.zip(actionName.zip(payload)))) match {
case Some(_) => JsSuccess(
OCPPCallRequest( // Here I know all 4 exists, so safe to call head
messageTypeId.head.as[Int],
messageId.head.as[String],
actionName.head.as[String],
payload.head
)
)
case None => JsError( // Here, I know that I have to send a CallError back!
"ERROR OCCURRED" // TODO: Work on this!
)
}
}
It is not playing nicely when it comes to delivering the exact error message for the case None. It is all in or nothing, but what I want to avoid is that in my case None block, I would like to avoid looking into each of the Option and populate the corresponding error message. Any ideas on how I could make it more functional?
I have implemented below logic in scala so far for this :
val hadoopConf = new Configuration(sc.hadoopConfiguration);
//hadoopConf.set("textinputformat.record.delimiter", "2016-")
hadoopConf.set("textinputformat.record.delimiter", "^([0-9]{4}.*)")
val accessLogs = sc.newAPIHadoopFile("/user/root/sample.log", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], hadoopConf).map(x=>x._2.toString)
I want to put regex to recognize the if line started with date format then treat it as a new record else continue to add lines in old record.
But this is not working. If i am passing date manually then its working fine. Below is the same code like this i want to put the regex:
//hadoopConf.set("textinputformat.record.delimiter", "2016-")
Please help on this.thanks in advance.
Here below is the sample format:
2016-12-23 07:00:09,693 [jetty-51 - /app/service] INFO org.apache.cxf.interceptor.LoggingOutInterceptor S:METHOD_NAME=METHNAME : WebAppSessionId= : ChannelSessionId=web-xxx-xxx-xxx : ClientIp=xxxxxxx : - Outbound Message
---------------------------
ID: 1978
Address: https://sample.domain.com/SampleService.xxx/basic
Encoding: UTF-8
Content-Type: text/xml
Headers: {Accept=[*/*], SOAPAction=["WebDomain.Service/app"]}
Payload: <soap:Envelope>
</soap:Envelope>
2016-12-26 08:00:01,514 [jetty-1195 - /app/service/serviceName] ERROR com.testservices.cache.impl.ActiveSpaceCacheHandler S:METHOD_NAME=ServiceInquiryWithBands : WebAppSessionId= : ChannelSessionId=SERVICE : ClientIp=client-ip : - ActiveSpaceCacheHandler:getServiceResponseFromCache(); exception: java.lang.Exception: getServiceResponseData: com.tibco.as.space.RuntimeASException: field key is not nullable and is missing in tuple for cachekey:Request.US
2016-12-26 08:00:01,624 [jetty-979 - /app/service/serviceName] ERROR com.testservices.cache.impl.ActiveSpaceCacheHandler S:METHOD_NAME=ServiceInquiryWithBands : WebAppSessionId= : ChannelSessionId=SERVICE : ClientIp=client-ip : - ActiveSpaceCacheHandler:setServiceResponseInCache(); exception: com.test.as.space.RuntimeASException: field key is not nullable and is missing in tuple for cachekey
I couldn't get it working with a regex. The best I could do was hadoopConf.set("textinputformat.record.delimiter", "\n20") which may work for you if you don't have those characters in the middle of a log entry. This approach will also and give you future-proofing, supporting dates up to 2099
If you need a regex, you could try http://dronamk.blogspot.co.uk/2013/03/regex-custom-input-format-for-hadoop.html
My code:
// Create some dummy data
val s = """2016-12-23 07:00:09,693 [jetty-51 - /app/service] INFO org.apache.cxf.interceptor.LoggingOutInterceptor S:METHOD_NAME=METHNAME : WebAppSessionId= : ChannelSessionId=web-xxx-xxx-xxx : ClientIp=xxxxxxx : - Outbound Message
|---------------------------
| ID: 1978
| Address: https://sample.domain.com/SampleService.xxx/basic
| Encoding: UTF-8
| Content-Type: text/xml
| Headers: {Accept=[*/*], SOAPAction=["WebDomain.Service/app"]}
| Payload: <soap:Envelope>
| </soap:Envelope>
|2016-12-26 08:00:01,514 [jetty-1195 - /app/service/serviceName] ERROR com.testservices.cache.impl.ActiveSpaceCacheHandler S:METHOD_NAME=ServiceInquiryWithBands : WebAppSessionId= : ChannelSessionId=SERVICE : ClientIp=client-ip : - ActiveSpaceCacheHandler:getServiceResponseFromCache(); exception: java.lang.Exception: getServiceResponseData: com.tibco.as.space.RuntimeASException: field key is not nullable and is missing in tuple for cachekey:Request.US
|2016-12-26 08:00:01,624 [jetty-979 - /app/service/serviceName] ERROR com.testservices.cache.impl.ActiveSpaceCacheHandler S:METHOD_NAME=ServiceInquiryWithBands : WebAppSessionId= : ChannelSessionId=SERVICE : ClientIp=client-ip : - ActiveSpaceCacheHandler:setServiceResponseInCache(); exception: com.test.as.space.RuntimeASException: field key is not nullable and is missing in tuple for cachekey
""".stripMargin
import java.io._
val pw = new PrintWriter(new File("log.txt"))
pw.write(s)
pw.close
// Now process the data
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
import org.apache.hadoop.io.Text
import org.apache.hadoop.io.LongWritable
import org.apache.spark.{SparkContext, SparkConf}
val conf = sc.getConf
sc.stop()
conf.registerKryoClasses(Array(classOf[org.apache.hadoop.io.LongWritable]))
val sc = new SparkContext(conf)
val hadoopConf = new Configuration(sc.hadoopConfiguration)
hadoopConf.set("textinputformat.record.delimiter", "\n20")
val accessLogs = sc.newAPIHadoopFile("log.txt", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], hadoopConf)
accessLogs.map(x => x._2.toString).zipWithIndex().collect().foreach(println)
Note that I'm using zipWithIndex just for debugging purposes. The output is:
(2016-12-23 07:00:09,693 [jetty-51 - /app/service] INFO org.apache.cxf.interceptor.LoggingOutInterceptor S:METHOD_NAME=METHNAME : WebAppSessionId= : ChannelSessionId=web-xxx-xxx-xxx : ClientIp=xxxxxxx : - Outbound Message
---------------------------
ID: 1978
Address: https://sample.domain.com/SampleService.xxx/basic
Encoding: UTF-8
Content-Type: text/xml
Headers: {Accept=[*/*], SOAPAction=["WebDomain.Service/app"]}
Payload:
,0)
(16-12-26 08:00:01,514 [jetty-1195 - /app/service/serviceName] ERROR com.testservices.cache.impl.ActiveSpaceCacheHandler S:METHOD_NAME=ServiceInquiryWithBands : WebAppSessionId= : ChannelSessionId=SERVICE : ClientIp=client-ip : - ActiveSpaceCacheHandler:getServiceResponseFromCache(); exception: java.lang.Exception: getServiceResponseData: com.tibco.as.space.RuntimeASException: field key is not nullable and is missing in tuple for cachekey:Request.US,1)
(16-12-26 08:00:01,624 [jetty-979 - /app/service/serviceName] ERROR com.testservices.cache.impl.ActiveSpaceCacheHandler S:METHOD_NAME=ServiceInquiryWithBands : WebAppSessionId= : ChannelSessionId=SERVICE : ClientIp=client-ip : - ActiveSpaceCacheHandler:setServiceResponseInCache(); exception: com.test.as.space.RuntimeASException: field key is not nullable and is missing in tuple for cachekey
,2)
Note the index is the second field in the output.
I ran this code on an IBM Datascience Exerience notebook running Scala 2.10 and Spark 1.6
i would like merging from local file from /opt/one.txt with file at my hdfs hdfs://localhost:54310/dummy/two.txt.
contains at one.txt : f,g,h
contains at two.txt : 2424244r
my code :
val cfg = new Configuration()
cfg.addResource(new Path("/usr/local/hadoop/etc/hadoop/core-site.xml"))
cfg.addResource(new Path("/usr/local/hadoop/etc/hadoop/hdfs-site.xml"))
cfg.addResource(new Path("/usr/local/hadoop/etc/hadoop/mapred-site.xml"))
try
{
val srcPath = "/opt/one.txt"
val dstPath = "/dumCBF/two.txt"
val srcFS = FileSystem.get(URI.create(srcPath), cfg)
val dstFS = FileSystem.get(URI.create(dstPath), cfg)
FileUtil.copyMerge(srcFS,
new Path(srcPath),
dstFS,
new Path(dstPath),
true,
cfg,
null)
println("end proses")
}
catch
{
case m:Exception => m.printStackTrace()
case k:Throwable => k.printStackTrace()
}
i was following tutorial from : http://deploymentzone.com/2015/01/30/spark-and-merged-csv-files/
and it's not working at all, error would below this :
java.io.FileNotFoundException: File does not exist: /opt/one.txt
i dont why, sounds of error like that? BTW the file one.txt is exist
and then, im add some code to check exist file :
if(new File(srcPath).exists()) println("file is exist")
any idea or references? thanks!
EDIT 1,2 : typo extensions
Writing to Mongodb via Java Driver. Getting the following error:
com.mongodb.WriteConcernException: { "serverUsed" : "127.0.0.1:27017" , "err" : "_a != -1" , "n" : 0 , "connectionId" : 3 , "ok" : 1.0}
at com.mongodb.CommandResult.getWriteException(CommandResult.java:90)
at com.mongodb.CommandResult.getException(CommandResult.java:79)
at com.mongodb.CommandResult.throwOnError(CommandResult.java:131)
at com.mongodb.DBTCPConnector._checkWriteError(DBTCPConnector.java:135)
at com.mongodb.DBTCPConnector.access$000(DBTCPConnector.java:39)
at com.mongodb.DBTCPConnector$1.execute(DBTCPConnector.java:186)
at com.mongodb.DBTCPConnector$1.execute(DBTCPConnector.java:181)
at com.mongodb.DBTCPConnector.doOperation(DBTCPConnector.java:210)
at com.mongodb.DBTCPConnector.say(DBTCPConnector.java:181)
at com.mongodb.DBCollectionImpl.insertWithWriteProtocol(DBCollectionImpl.java:528)
at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:193)
at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:165)
at com.mongodb.DBCollection.insert(DBCollection.java:161)
at com.mongodb.DBCollection.insert(DBCollection.java:147)
at com.mongodb.DBCollection$insert.call(Unknown Source)
Can't find any reference in docs to "err" : "_a != -1". Any thoughts?
EDIT:
Adding code I used (not all as it relies on other libraries to parse files):
MongoClient mongoClient = new MongoClient()
mongoClient.setWriteConcern(WriteConcern.SAFE)
DB db = mongoClient.getDB("vcf")
List<DBObject> documents = new ArrayList<DBObject>()
DBCollection recordsColl = db.getCollection("records")
//loop through file
BasicDBObject mongoRecord = new BasicDBObject()
//add data to mongoRecord
documents.add(mongoRecord)
//end loop
recordsColl.insert(documents)
mongoClient.close()
I am using spring data mongodb to interact with my mongodb setup. I was testing different write concerns and noticed that with Unacknowledged write concern, the time for updating 1000 documents was around 5-6 secs even though Unacknowledged write concern doesn't wait for any acknowledgement.
I tested the same with raw java driver and the time was around 40 msec.
What could be cause of this huge time difference between raw java driver and spring data mongodb update?
Note that I am using Unacknowledged write concern and mongodb v2.6.1 with default configurations.
Adding Code Used for comparison:-
Raw Java driver code:-
MongoClient mongoClient = new MongoClient("localhost", 27017);
DB db = mongoClient.getDB( "testdb" );
DBCollection collection = db.getCollection("product");
WriteResult wr = null;
try {
long start = System.currentTimeMillis();
wr = collection.update(
new BasicDBObject("productId", new BasicDBObject("$gte", 10000000)
.append("$lt", 10001000)),
new BasicDBObject("$inc", new BasicDBObject("price", 100)),
false, true, WriteConcern.UNACKNOWLEDGED);
long end = System.currentTimeMillis();
System.out.println(wr + " Time taken: " + (end - start) + " ms.");
}
Spring Code:-
Config.xml
<mongo:mongo host="localhost" port="27017" />
<mongo:db-factory dbname="testdb" mongo-ref="mongo" />
<bean id="Unacknowledged" class="com.mongodb.WriteConcern">
<constructor-arg name="w" type="int" value="0"/>
</bean>
<bean id="mongoTemplate" class="org.springframework.data.mongodb.core.MongoTemplate">
<constructor-arg name="mongoDbFactory" ref="mongoDbFactory" />
<property name="writeConcern" ref="Unacknowledged"/>
</bean>
Java Code for update function which is part of ProductDAOImpl:-
public int update(long fromProductId, long toProductId, double changeInPrice)
{
Query query = new Query(new Criteria().andOperator(
Criteria.where("productId").gte(fromProductId),
Criteria.where("productId").lt(toProductId)));
Update update = new Update().inc("price", changeInPrice);
WriteResult writeResult =
mongoTemplate.updateMulti(query, update, Product.class);
return writeResult.getN();
}
Accesing code:-
ProductDAOImpl productDAO = new ProductDAOImpl();
productDAO.setMongoTemplate(mongoTemplate);
long start = System.currentTimeMillis();
productDAO.update(10000000, 10001000, 100);
long end = System.currentTimeMillis();
System.out.println("Time taken = " + (end - start) + " ms.");
Schema:-
{
"_id" : ObjectId("53b64d000cf273a0d95a1a3d"),
"_class" : "springmongo.domain.Product",
"productId" : NumberLong(6),
"productName" : "product6",
"manufacturer" : "company30605739",
"supplier" : "supplier605739",
"category" : "category30605739",
"mfgDate" : ISODate("1968-04-26T05:00:00.881Z"),
"price" : 665689.7224373372,
"tags" : [
"tag82",
"tag61",
"tag17"
],
"reviews" : [
{
"name" : "name528965",
"rating" : 6.5
},
{
"name" : "name818975",
"rating" : 7.5
},
{
"name" : "name436239",
"rating" : 3.9
}
],
"manufacturerAdd" : {
"state" : "state55",
"country" : "country155",
"zipcode" : 718
},
"supplierAdd" : {
"state" : "state69",
"country" : "country69",
"zipcode" : 691986
}
}
Hope it helps.