I'm coding something which will read PDFs online and return a set of keywords that are found in the document. However I keep running into a problem with the extractText() function from the PyPDF2 package.
Here's my code to open the PDFs and read it:
x = myurl.pdf
if ".pdf" in x:
remoteFile = urlopen(Request(x, headers={"User-Agent": "Magic-Browser"})).read()
memoryFile = StringIO(remoteFile)
pdfFile = PyPDF2.PdfFileReader(memoryFile, strict=False)
num_pages = pdfFile.numPages
count = 0
text = ""
while count < num_pages:
pageObj = pdfFile.getPage(count)
count += 1
text += pageObj.extractText()
The error that I keep running into on the extractText() line goes like this:
Traceback (most recent call last):
File "errortest.py", line 30, in <module>
text += pageObj.extractText()
File "/anaconda2/lib/python2.7/site-packages/PyPDF2/pdf.py", line 2595, in extractText
content = ContentStream(content, self.pdf)
File "/anaconda2/lib/python2.7/site-packages/PyPDF2/pdf.py", line 2674, in __init__
self.__parseContentStream(stream)
File "/anaconda2/lib/python2.7/site-packages/PyPDF2/pdf.py", line 2706, in __parseContentStream
operands.append(readObject(stream, None))
File "/anaconda2/lib/python2.7/site-packages/PyPDF2/generic.py", line 98, in readObject
return NumberObject.readFromStream(stream)
File "/anaconda2/lib/python2.7/site-packages/PyPDF2/generic.py", line 271, in readFromStream
return FloatObject(num)
File "/anaconda2/lib/python2.7/site-packages/PyPDF2/generic.py", line 231, in __new__
return decimal.Decimal.__new__(cls, str(value))
File "/anaconda2/lib/python2.7/decimal.py", line 547, in __new__
"Invalid literal for Decimal: %r" % value)
File "/anaconda2/lib/python2.7/decimal.py", line 3872, in _raise_error
raise error(explanation)
decimal.InvalidOperation: Invalid literal for Decimal: '99.-72'
Would be great if someone could help me out! Thanks!
There is too little information to be certain, but PyPDF2 (and now pypdf) improved a lot in 2022. You will probably just need to upgrade to the latest version of pypdf.
If you encounter a bug in pypdf again, please open an issue: https://github.com/py-pdf/pypdf
A good bug ticket contains (1) your pypdf version (2) the code + PDF document that caused the issue.
My Problem:
I want to use functions of opencv like the MIL-Tracker or MedianFlow-Tracker in Matlab (these functions are not in mexopencv). But I don't know how or understand how to do this. The documentation of opencv/mexopencv doesn't help me. This doesn't help: how do OpenCV shared libraries in matlab? - because the link in the answer is down.
So is there a way to use these functions in Matlab? And if- How?
Why?: As a part of my bachelor thesis I have to compare different already implemented ways to track people.
If you would like to use these functions specifically in MATLAB you could always write your own MEX file in C/C++ and send the data back/forward between the two calls, however this would require some basic C++ knowledge and understanding creating MEX files.
Personally I would definately recommend trying this with Python and the OpenCV Python interface since its so widely used and more supported than using the calls in MATLAB (plus its always a useful skill to be able to switch between Python and MATLAB as and when needed).
There is a full example with the MIL-Tracker and the MedianFlow-Tracker (and others) here (Which demonstrates using them in C++ and Python!).
Python Example :
import cv2
import sys
(major_ver, minor_ver, subminor_ver) = (cv2.__version__).split('.')
if __name__ == '__main__' :
# Set up tracker.
# Instead of MIL, you can also use
tracker_types = ['BOOSTING', 'MIL','KCF', 'TLD', 'MEDIANFLOW', 'GOTURN']
tracker_type = tracker_types[2]
if int(minor_ver) < 3:
tracker = cv2.Tracker_create(tracker_type)
else:
if tracker_type == 'BOOSTING':
tracker = cv2.TrackerBoosting_create()
if tracker_type == 'MIL':
tracker = cv2.TrackerMIL_create()
if tracker_type == 'KCF':
tracker = cv2.TrackerKCF_create()
if tracker_type == 'TLD':
tracker = cv2.TrackerTLD_create()
if tracker_type == 'MEDIANFLOW':
tracker = cv2.TrackerMedianFlow_create()
if tracker_type == 'GOTURN':
tracker = cv2.TrackerGOTURN_create()
# Read video
video = cv2.VideoCapture("videos/chaplin.mp4")
# Exit if video not opened.
if not video.isOpened():
print "Could not open video"
sys.exit()
# Read first frame.
ok, frame = video.read()
if not ok:
print 'Cannot read video file'
sys.exit()
# Define an initial bounding box
bbox = (287, 23, 86, 320)
# Uncomment the line below to select a different bounding box
bbox = cv2.selectROI(frame, False)
# Initialize tracker with first frame and bounding box
ok = tracker.init(frame, bbox)
while True:
# Read a new frame
ok, frame = video.read()
if not ok:
break
# Start timer
timer = cv2.getTickCount()
# Update tracker
ok, bbox = tracker.update(frame)
# Calculate Frames per second (FPS)
fps = cv2.getTickFrequency() / (cv2.getTickCount() - timer);
# Draw bounding box
if ok:
# Tracking success
p1 = (int(bbox[0]), int(bbox[1]))
p2 = (int(bbox[0] + bbox[2]), int(bbox[1] + bbox[3]))
cv2.rectangle(frame, p1, p2, (255,0,0), 2, 1)
else :
# Tracking failure
cv2.putText(frame, "Tracking failure detected", (100,80), cv2.FONT_HERSHEY_SIMPLEX, 0.75,(0,0,255),2)
# Display tracker type on frame
cv2.putText(frame, tracker_type + " Tracker", (100,20), cv2.FONT_HERSHEY_SIMPLEX, 0.75, (50,170,50),2);
# Display FPS on frame
cv2.putText(frame, "FPS : " + str(int(fps)), (100,50), cv2.FONT_HERSHEY_SIMPLEX, 0.75, (50,170,50), 2);
# Display result
cv2.imshow("Tracking", frame)
# Exit if ESC pressed
k = cv2.waitKey(1) & 0xff
if k == 27 : break
I would definately try it using Python (if this is an option). Otherwise if MATLAB is a must then probably try implementing the C++ example code shown in the link before as a MEX file and linking openCV during the compilation i.e.
mex trackerMexOpenCV.cpp 'true filepath location to openCV lib'
I hope this helps!
I seem to get this error when I am using the callback function modelcheckpoint..
I read from a github issue that the solution would be make use of model.get_weight, but I am implicitly only storing that since i am only storing the one with best weight.
Keras only seem to save weights using h5, which make me question is there any other way to do store them using the eras API, if so how? If not, how do i store it?
Made an example to recreate the problem:
#!/usr/bin/python
import glob, os
import sys
from os import listdir
from os.path import isfile, join
import numpy as np
import warnings
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from keras.utils import np_utils
from keras import metrics
import keras
from keras import backend as K
from keras.models import Sequential
from keras.optimizers import SGD, Adam
from keras.layers.core import Dense, Activation, Lambda, Reshape,Flatten
from keras.layers import Conv1D,Conv2D,MaxPooling2D, MaxPooling1D, Reshape
#from keras.utils.visualize_util import plot
from keras.models import Model
from keras.layers import Input, Dense
from keras.layers.merge import Concatenate, Add
import h5py
import random
import tensorflow as tf
import math
from keras.callbacks import CSVLogger
from keras.callbacks import ModelCheckpoint
if len(sys.argv) < 5:
print "Missing Arguments!"
print "python keras_convolutional_feature_extraction.py <workspace> <totale_frames> <fbank-dim> <window-height> <batch_size>"
print "Example:"
print "python keras_convolutional_feature_extraction.py deltas 15 40 5 100"
sys.exit()
total_frames = int(sys.argv[2])
total_frames_with_deltas = total_frames*3
dim = int(sys.argv[3])
window_height = int(sys.argv[4])
inserted_batch_size = int(sys.argv[5])
stride = 1
splits = ((dim - window_height)+1)/stride
#input_train_data = "/media/carl/E2302E68302E443F/"+str(sys.argv[1])+"/fbank/org_train_total_frames_"+str(total_frames)+"_dim_"+str(dim)+"_winheig_"+str(window_height)+"_batch_"+str(inserted_batch_size)+"_fws_input"
#output_train_data ="/media/carl/E2302E68302E443F/"+str(sys.argv[1])+"/fbank/org_train_total_frames_"+str(total_frames)+"_dim_"+str(dim)+"_winheig_"+str(window_height)+"_batch_"+str(inserted_batch_size)+"_fws_output"
#input_test_data = "/media/carl/E2302E68302E443F/"+str(sys.argv[1])+"/fbank/org_test_total_frames_"+str(total_frames)+"_dim_"+str(dim)+"_winheig_"+str(window_height)+"_batch_"+str(1)+"_fws_input"
#output_test_data = "/media/carl/E2302E68302E443F/"+str(sys.argv[1])+"/fbank/org_test_total_frames_"+str(total_frames)+"_dim_"+str(dim)+"_winheig_"+str(window_height)+"_batch_"+str(1)+"_fws_output"
#train_files =[f for f in listdir(input_train_data) if isfile(join(input_train_data, f))]
#test_files =[f for f in listdir(input_test_data) if isfile(join(input_test_data, f))]
#print len(train_files)
np.random.seed(100)
print "hallo"
def train_generator():
while True:
# input = random.choice(train_files)
# h5f = h5py.File(input_train_data+'/'+input, 'r')
# train_input = h5f['train_input'][:]
# train_output = h5f['train_output'][:]
# h5f.close()
train_input = np.random.randint(100,size=((inserted_batch_size,splits*total_frames_with_deltas,window_height,3)))
train_list_list = []
train_input = train_input.reshape((inserted_batch_size,splits*total_frames_with_deltas,window_height,3))
train_input_list = np.split(train_input,splits*total_frames_with_deltas,axis=1)
for i in range(len(train_input_list)):
train_input_list[i] = train_input_list[i].reshape(inserted_batch_size,window_height,3)
#for i in range(len(train_input_list)):
# train_input_list[i] = train_input_list[i].reshape(inserted_batch_size,33,window_height,1,3)
train_output = np.random.randint(5, size = (1,total_frames,5))
middle = int(math.ceil(total_frames/2))
train_output = train_output[:,middle:middle+1,:].reshape((inserted_batch_size,1,5))
#print train_output.shape
#print len(train_input_list)
#print train_input_list[0].shape
yield (train_input_list, train_output)
print "hallo"
def test_generator():
while True:
# input = random.choice(test_files)
# h5f = h5py.File(input_test_data+'/'+input, 'r')
# test_input = h5f['test_input'][:]
# test_output = h5f['test_output'][:]
# h5f.close()
test_input = np.random.randint(100,size=((inserted_batch_size,splits*total_frames_with_deltas,window_height,3)))
test_input = test_input.reshape((inserted_batch_size,splits*total_frames_with_deltas,window_height,3))
test_input_list = np.split(test_input,splits*total_frames_with_deltas,axis=1)
#test_input_list = np.split(test_input,45,axis=3)
for i in range(len(test_input_list)):
test_input_list[i] = test_input_list[i].reshape(inserted_batch_size,window_height,3)
#for i in range(len(test_input_list)):
# test_input_list[i] = test_input_list[i].reshape(inserted_batch_size,33,window_height,1,3)
test_output = np.random.randint(5, size = (1,total_frames,5))
middle = int(math.ceil(total_frames/2))
test_output = test_output[:,middle:middle+1,:].reshape((inserted_batch_size,1,5))
yield (test_input_list, test_output)
print "hallo"
def fws():
#print "Inside"
# Params:
# batch , lr, decay , momentum, epochs
#
#Input shape: (batch_size,40,45,3)
#output shape: (1,15,50)
# number of unit in conv_feature_map = splitd
next(train_generator())
model_output = []
list_of_input = [Input(shape=(8,3)) for i in range(splits*total_frames_with_deltas)]
output = []
#Conv
skip = total_frames_with_deltas
for steps in range(total_frames_with_deltas):
conv = Conv1D(filters = 100, kernel_size = 8)
column = 0
for _ in range(splits):
#print "column " + str(column) + "steps: " + str(steps)
output.append(conv(list_of_input[(column*skip)+steps]))
column = column + 1
#print len(output)
#print splits*total_frames_with_deltas
conv = []
for section in range(splits):
column = 0
skip = splits
temp = []
for _ in range(total_frames_with_deltas):
temp.append(output[((column*skip)+section)])
column = column + 1
conv.append(Add()(temp))
#print len(conv)
output_conc = Concatenate()(conv)
#print output_conc.get_shape
output_conv = Reshape((splits, -1))(output_conc)
#print output_conv.get_shape
#Pool
pooled = MaxPooling1D(pool_size = 6, strides = 2)(output_conv)
reshape = Reshape((1,-1))(pooled)
#Fc
dense1 = Dense(units = 1024, activation = 'relu', name = "dense_1")(reshape)
#dense2 = Dense(units = 1024, activation = 'relu', name = "dense_2")(dense1)
dense3 = Dense(units = 1024, activation = 'relu', name = "dense_3")(dense1)
final = Dense(units = 5, activation = 'relu', name = "final")(dense3)
model = Model(inputs = list_of_input , outputs = final)
sgd = SGD(lr=0.1, decay=1e-1, momentum=0.9, nesterov=True)
model.compile(loss="categorical_crossentropy", optimizer=sgd , metrics = ['accuracy'])
print "compiled"
model_yaml = model.to_yaml()
with open("model.yaml", "w") as yaml_file:
yaml_file.write(model_yaml)
print "Model saved!"
log= CSVLogger('/home/carl/kaldi-trunk/dnn/experimental/yesno_cnn_50_training_total_frames_'+str(total_frames)+"_dim_"+str(dim)+"_window_height_"+str(window_height)+".csv")
filepath='yesno_cnn_50_training_total_frames_'+str(total_frames)+"_dim_"+str(dim)+"_window_height_"+str(window_height)+"weights-improvement-{epoch:02d}-{val_acc:.2f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_weights_only=True, mode='max')
print "log"
#plot_model(model, to_file='model.png')
print "Fit"
hist_current = model.fit_generator(train_generator(),
steps_per_epoch=444,#len(train_files),
epochs = 10000,
verbose = 1,
validation_data = test_generator(),
validation_steps=44,#len(test_files),
pickle_safe = True,
workers = 4,
callbacks = [log,checkpoint])
fws()
Execute the script by: python name_of_script.py yens 50 40 8 1
which give me a full traceback:
full traceback
Error:
carl#ca-ThinkPad-T420s:~/Dropbox$ python mini.py yesno 50 40 8 1
Using TensorFlow backend.
Couldn't import dot_parser, loading of dot files will not be possible.
hallo
hallo
hallo
compiled
Model saved!
log
Fit
/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:2252: UserWarning: Expected no kwargs, you passed 1
kwargs passed to function are ignored with Tensorflow backend
warnings.warn('\n'.join(msg))
Epoch 1/10000
2017-05-26 13:01:45.851125: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-05-26 13:01:45.851345: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-05-26 13:01:45.851392: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
443/444 [============================>.] - ETA: 4s - loss: 100.1266 - acc: 0.3138Epoch 00000: saving model to yesno_cnn_50_training_total_frames_50_dim_40_window_height_8weights-improvement-00-0.48.hdf5
Traceback (most recent call last):
File "mini.py", line 205, in <module>
File "mini.py", line 203, in fws
File "/usr/local/lib/python2.7/dist-packages/keras/legacy/interfaces.py", line 88, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1933, in fit_generator
callbacks.on_epoch_end(epoch, epoch_logs)
File "/usr/local/lib/python2.7/dist-packages/keras/callbacks.py", line 77, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File "/usr/local/lib/python2.7/dist-packages/keras/callbacks.py", line 411, in on_epoch_end
self.model.save_weights(filepath, overwrite=True)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 2503, in save_weights
save_weights_to_hdf5_group(f, self.layers)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 2746, in save_weights_to_hdf5_group
f.attrs['layer_names'] = [layer.name.encode('utf8') for layer in layers]
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/tmp/pip-4rPeHA-build/h5py/_objects.c:2684)
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/tmp/pip-4rPeHA-build/h5py/_objects.c:2642)
File "/usr/local/lib/python2.7/dist-packages/h5py/_hl/attrs.py", line 93, in __setitem__
self.create(name, data=value, dtype=base.guess_dtype(value))
File "/usr/local/lib/python2.7/dist-packages/h5py/_hl/attrs.py", line 183, in create
attr = h5a.create(self._id, self._e(tempname), htype, space)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/tmp/pip-4rPeHA-build/h5py/_objects.c:2684)
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/tmp/pip-4rPeHA-build/h5py/_objects.c:2642)
File "h5py/h5a.pyx", line 47, in h5py.h5a.create (/tmp/pip-4rPeHA-build/h5py/h5a.c:1904)
RuntimeError: Unable to create attribute (Object header message is too large)
If you look at the amount of data Keras is trying to save under layer_names attribute (inside the output HDF5 file being create), you will find that it takes more than 64kB.
np.asarray([layer.name.encode('utf8') for layer in model.layers]).nbytes
>> 77100
I quote from https://support.hdfgroup.org/HDF5/faq/limits.html:
Is there an object header limit and how does that affect HDF5 ?
There is a limit (in HDF5-1.8) of the object header, which is 64 KB.
The datatype for a dataset is stored in the object header, so there is
therefore a limit on the size of the datatype that you can have. (See
HDFFV-1089)
The code above was (almost entirely) copied from the traceback:
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 2746, in save_weights_to_hdf5_group
f.attrs['layer_names'] = [layer.name.encode('utf8') for layer in layers]
I am using numpy asarray method to get the figure fast but h5py gets similar figure (I guess), see https://github.com/h5py/h5py/blob/master/h5py/_hl/attrs.py#L102 if you want to find exact figure.
Anyway, either you will need to implement your own methods for saving/loading of the weights (or use existing workarounds), or you need to give a really short name to ALL the layers inside your model :), something like this:
list_of_input = [Input(shape=(8,3), name=('i%x' % i)) for i in range(splits*total_frames_with_deltas)]
conv = Conv1D(filters = 100, kernel_size = 8, name='cv%x' % steps)
conv.append(Add(name='add%x' % section)(temp))
output_conc = Concatenate(name='ct')(conv)
output_conv = Reshape((splits, -1), name='rs1')(output_conc)
pooled = MaxPooling1D(pool_size = 6, strides = 2, name='pl')(output_conv)
reshape = Reshape((1,-1), name='rs2')(pooled)
dense1 = Dense(units = 1024, activation = 'relu', name = "d1")(reshape)
dense2 = Dense(units
= 1024, activation = 'relu', name = "d2")(dense1)
dense3 = Dense(units = 1024, activation = 'relu', name = "d3")(dense1)
final = Dense(units = 5, activation = 'relu', name = "fl")(dense3)
You mustn't forget to name all the layers because the (numpy) string array into which the layer names are converted is using the size of the longest string for each individual string in it when it is saved!
After renaming the layers as proposed above (which takes almost 26kB) the model is saved successfully. Hope this elaborate answer helps someone.
Update: I have just made a PR to Keras which should fix the issue without implementing any custom loading/saving methods, see 7508
A simple solution, albeit possibly not the most elegant, could be to run a while loop with epochs = 1.
Get the weights at the end of every epoch together with the accuracy and the loss
Save the weights to file 1 with model.get_weight
if accuracy is greater than at the previous epoch (i.e. loop), store the weights to a different file (file 2)
Run the loop again loading the weights from file 1
Break the loops setting a manual early stopping so that it breaks if the loss does not improve for a certain number of loops
You can use get_weights() together with numpy.save.
It's not the best solution, because it will save several files, but it actually works.
The problem is that you won't have the "optimizer" saved with the current states. But you can perhaps work around that by using smaller learning rates after loading.
Custom callback using numpy.save:
def myCallback(epoch,logs):
global storedLoss
#do your comparisons here using the "logs" var.
print(logs)
if (logs['loss'] < storedLoss):
storedLoss = logs['loss']
for i in range(len(model.layers)):
WandB = model.layers[i].get_weights()
if len (WandB) > 0: #necessary because some layers have no weights
np.save("W" + "-" + str(i), WandB[0],False)
np.save("B" + "-" + str(i), WandB[1],False)
#remember that get and set weights use a list: [weights,biases]
#it may happen (not sure) that there is no bias, and thus you may have to check it (len(WandB)==1).
The logs var brings a dictionary with named metrics, such as "loss", and "accuracy", if you used it.
You can store the losses within the callback in a global var, and compare if each loss is better or worse than the last.
When fitting, use the lambda callback:
from keras.callbacks import LambdaCallback
model.fit(...,callbacks=[LambdaCallback(on_epoch_end=myCallback)])
In the example above, I used the LambdaCallback, which has more possibilities than just on_epoch_end.
For loading, do a similar loop:
#you have to create the model first and then set the layers
def loadModel(model):
for i in range(len(model.layers)):
WandBForCheck = model.layers[i].get_weights()
if len (WandBForCheck) > 0: #necessary because some layers have no weights
W = np.load(Wfile + str(i))
B = np.load(Bfile + str(i))
model.layers[i].set_weights([W,B])
See follow-up at https://github.com/fchollet/keras/issues/6766 and https://github.com/farizrahman4u/keras-contrib/pull/90.
I saw the YAML and the root cause is probably that you have so many Inputs. A few Inputs with many dimensions is preferred to many Inputs, especially if you can use scanning and batch operations to do everything efficiently.
Now, ignoring that entirely, here is how you can save and load your model if it has too much stuff to save as JSON efficiently:
You can pass save_weights_only=True. That won't save optimizer weights, so isn't a great solution.
Just put together a PR for saving model weights and optimizer weights but not configuration. When you want to load, first instantiate and compile the model as you did when you were going to train it, then use load_all_weights to load the model and optimizer weights into that model. I'll try to merge it soon so you can use it from the master branch.
You could use it something like this:
from keras.callbacks import LambdaCallback
from keras_contrib.utils.save_load_utils import save_all_weights, load_all_weights
# do some stuff to create and compile model
# use `save_all_weights` as a callback to checkpoint your model and optimizer weights
model.fit(..., callbacks=[LambdaCallback(on_epoch_end=lambda epoch, logs: save_all_weights(model, "checkpoint-{:05d}.h5".format(epoch))])
# use `load_all_weights` to load model and optimizer weights into an existing model
# if not compiled (no `model.optimizer`), this will just load model weights
load_all_weights(model, 'checkpoint-1337.h5')
So I don't endorse the model, but if you want to get it to save and load anyways this should probably work for you.
As a side note, if you want to save weights in a different format, something like this would work.
pickle.dump([K.get_value(w) for w in model.weights], open( "save.p", "wb" ) )
Cheers
Your model architecture must be too large to be saved.
USE get_weights AND set_weights TO SAVE AND LOAD MODEL, RESPECTIVELY.
Do not use callback model checkpoint. just once the training ends, save its weights with pickle.
Have a look at this link: Unable to save DataFrame to HDF5 ("object header message is too large")
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
Here is a scenario:
I write an iPhone app using NSLocalizedString incase I decide to release it in different countries.
I decide to release the App over in France.
The translator takes my Localized.strings and does a great job translating
I update the app, and need some more translating.
I'm using genstrings and it overwrites the good work the translator did, is there a easy way for me to manage my translations over App versions?
Check out this project on GitHub, which provides a python scripts which makes genstrings a little bit smarter.
Since I don't like link-only answers (links may die), I'll also drop here the python script (all credits go to the author of the linked project)
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Localize.py - Incremental localization on XCode projects
# João Moreno 2009
# http://joaomoreno.com/
# Modified by Steve Streeting 2010 http://www.stevestreeting.com
# Changes
# - Use .strings files encoded as UTF-8
# This is useful because Mercurial and Git treat UTF-16 as binary and can't
# diff/merge them. For use on iPhone you can run an iconv script during build to
# convert back to UTF-16 (Mac OS X will happily use UTF-8 .strings files).
# - Clean up .old and .new files once we're done
# Modified by Pierre Dulac 2012 http://friendcashapp.com
# Changes
# - use logging instead of print
# Adds
# - MIT Licence
# - the first parameter in the command line to specify the path of *.lproj directories
# - an optional paramter to control the debug level (set to info by default)
# Fixes
# - do not convert a file if it is already in utf-8
# - allow multiline translations generated by genstrings by modifing the re_translation regex
# -
# MIT Licence
#
# Copyright (C) 2012 Pierre Dulac
#
# Permission is hereby granted, free of charge, to any person obtaining a copy of this software and
# associated documentation files (the "Software"), to deal in the Software without restriction,
# including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense,
# and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so,
# subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all copies or substantial
# portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT
# LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
# IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
# WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
# SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
from sys import argv
from codecs import open
from re import compile
from copy import copy
import os
import shutil
import optparse
import logging
logging.getLogger().level = logging.INFO
__version__ = "0.1"
__license__ = "MIT"
USAGE = "%prog [options] <url>"
VERSION = "%prog v" + __version__
re_translation = compile(r'^"((?:[^"]|\\")+)" = "((?:[^"]|\\")+)";(?:\n)?$')
re_comment_single = compile(r'^/\*.*\*/$')
re_comment_start = compile(r'^/\*.*$')
re_comment_end = compile(r'^.*\*/$')
class LocalizedString():
def __init__(self, comments, translation):
self.comments, self.translation = comments, translation
self.key, self.value = re_translation.match(self.translation).groups()
def __unicode__(self):
return u'%s%s\n' % (u''.join(self.comments), self.translation)
class LocalizedFile():
def __init__(self, fname=None, auto_read=False):
self.fname = fname
self.reset()
if auto_read:
self.read_from_file(fname)
def reset(self):
self.strings = []
self.strings_d = {}
def read_from_file(self, fname=None):
self.reset()
fname = self.fname if fname == None else fname
try:
#f = open(fname, encoding='utf_8', mode='r')
f = open(fname, encoding='utf_8', mode='r')
except:
print 'File %s does not exist.' % fname
exit(-1)
try:
line = f.readline()
logging.debug(line)
except:
logging.error("Can't read line for file: %s" % fname)
raise
i = 1
while line:
comments = [line]
if not re_comment_single.match(line):
while line and not re_comment_end.match(line):
line = f.readline()
comments.append(line)
line = f.readline()
i += 1
# handle multi lines
while len(line) > 1 and line[-2] != u';':
line += f.readline()
i += 1
logging.debug("%d %s" % (i, line.rstrip('\n')))
if line and re_translation.match(line):
translation = line
else:
logging.error("Line %d of file '%s' raising the exception: %s" % (i, self.fname, line))
raise Exception('invalid file')
line = f.readline()
i += 1
while line and line == u'\n':
line = f.readline()
i += 1
string = LocalizedString(comments, translation)
self.strings.append(string)
self.strings_d[string.key] = string
f.close()
def save_to_file(self, fname=None):
fname = self.fname if fname == None else fname
try:
f = open(fname, encoding='utf_8', mode='w')
except:
print 'Couldn\'t open file %s.' % fname
exit(-1)
# sort by key
self.strings.sort(key=lambda item: item.key)
for string in self.strings:
f.write(string.__unicode__())
f.close()
def merge_with(self, new):
merged = LocalizedFile()
for string in new.strings:
if self.strings_d.has_key(string.key):
new_string = copy(self.strings_d[string.key])
new_string.comments = string.comments
string = new_string
merged.strings.append(string)
merged.strings_d[string.key] = string
return merged
def update_with(self, new):
for string in new.strings:
if not self.strings_d.has_key(string.key):
self.strings.append(string)
self.strings_d[string.key] = string
def merge(merged_fname, old_fname, new_fname):
try:
old = LocalizedFile(old_fname, auto_read=True)
new = LocalizedFile(new_fname, auto_read=True)
merged = old.merge_with(new)
merged.save_to_file(merged_fname)
except Exception, inst:
logging.error('Error: input files have invalid format.')
raise
STRINGS_FILE = 'Localizable.strings'
def localize(path, excluded_paths):
languages = [os.path.join(path,name) for name in os.listdir(path) if name.endswith('.lproj') and os.path.isdir(os.path.join(path,name))]
print "languages found", languages
for language in languages:
original = merged = language + os.path.sep + STRINGS_FILE
old = original + '.old'
new = original + '.new'
if os.path.isfile(original):
try:
open(original, encoding='utf_8', mode='r').read()
os.rename(original, old)
except:
os.system('iconv -f UTF-16 -t UTF-8 "%s" > "%s"' % (original, old))
# gen
os.system('find %s -name \*.m -not -path "%s" | xargs genstrings -q -o "%s"' % (path, excluded_paths, language))
try:
open(original, encoding='utf_8', mode='r').read()
shutil.copy(original, new)
except:
os.system('iconv -f UTF-16 -t UTF-8 "%s" > "%s"' % (original, new))
# merge
merge(merged, old, new)
logging.info("Job done for language: %s" % language)
else:
os.system('genstrings -q -o "%s" `find %s -name "*.m" -not -path "%s"`' % (language, path, excluded_paths))
os.rename(original, old)
try:
open(old, encoding='utf_8', mode='r').read()
except:
os.system('iconv -f UTF-16 -t UTF-8 "%s" > "%s"' % (old, original))
if os.path.isfile(old):
os.remove(old)
if os.path.isfile(new):
os.remove(new)
def parse_options():
"""parse_options() -> opts, args
Parse any command-line options given returning both
the parsed options and arguments.
"""
parser = optparse.OptionParser(usage=USAGE, version=VERSION)
parser.add_option("-d", "--debug",
action="store_true", default=False, dest="debug",
help="Set to DEBUG the logging level (default to INFO)")
parser.add_option("-p", "--path",
action="store", type="str", default=os.getcwd(), dest="path",
help="Path (relative or absolute) to use for searching for *.lproj directories")
parser.add_option("-e", "--exclude",
action="store", type="str", default=None, dest="excluded_paths",
help="Regex for paths to exclude ex. ``./Folder1/*``")
opts, args = parser.parse_args()
return opts, args
if __name__ == '__main__':
opts, args = parse_options()
if opts.debug:
logging.getLogger().level = logging.DEBUG
if opts.path:
opts.path = os.path.realpath(opts.path)
if opts.excluded_paths:
opts.excluded_paths = os.path.realpath(opts.excluded_paths)
logging.info("Running the script on path %s" % opts.path)
localize(opts.path, opts.excluded_paths)
I use:
http://www.loc-suite.com
To only translate the new parts
I was having a similar issue. I changed a lot of keys for my NSLocalizedString-macros and was frightened that I'd ship the App with missing translations (didn't want to run through the whole App manually and check if everything's there either...).
I tried out the github project that Gabriella Petronella posted but I wasn't really that happy with it, so I wrote my own python module to accomplish what I wanted to do.
(I'm not gonna post the code here, since it's a whole module and not only one script :D)
Here is the couple of options you can chose to go with:
You can use some hand-written solution like the script mentioned above which will not completely rewrite the old files while adding a recently translated strings to them.
You can also create an additional strings.h file which will contain all the strings you do have so you will not need to rewrite them all the time, just in one place. So genstrings is not necessary anymore. However there is a con of using this: the string.h file will be unstructured which is probably not convenient for the big projects.
Thanks to Best practice using NSLocalizedString
// In strings.h
#define YOUR_STRING_KEY NSLocalizedString(#"Cancel", nil)
// Somewhere else in you code
NSLog(#"%#", YOUR_STRING_KEY);
I actually started using a tool called PhraseApp https://phraseapp.com/projects
It's worth looking into if you have to localise an app!