I'm trying to implement the DOSNES algorithm from this publication but in Python for a project. I found this Matlab Implementation which works well but I probably mistranslated one or more steps in my code (mainly with axis I guess) because I clearly don't reach the same result. This is the part I'm strugglering with in Matlab:
P(1:n + 1:end) = 0; % set diagonal to zero
P = 0.5 * (P + P'); '% symmetrize P-values
P = max(P ./ sum(P(:)), realmin); % make sure P-values sum to one
const = sum(P(:) .* log(P(:))); % constant in KL divergence
ydata = .0001 * randn(n, no_dims);
y_incs = zeros(size(ydata));
gains = ones(size(ydata));
% Run the iterations
for iter=1:max_iter
% Compute joint probability that point i and j are neighbors
sum_ydata = sum(ydata .^ 2, 2);
num = 1 ./ (1 + bsxfun(@plus, sum_ydata, bsxfun(@plus, sum_ydata', -2 * (ydata * ydata')))); % Student-t distribution
num(1:n+1:end) = 0; % set diagonal to zero
Q = max(num ./ sum(num(:)), realmin); % normalize to get probabilities
% Compute the gradients (faster implementation)
L = (P - Q) .* num;
y_grads = 4 * (diag(sum(L, 1)) - L) * ydata;
% Update the solution
gains = (gains + .2) .* (sign(y_grads) ~= sign(y_incs)) ... % note that the y_grads are actually -y_grads
+ (gains * .8) .* (sign(y_grads) == sign(y_incs));
gains(gains < min_gain) = min_gain;
y_incs = momentum * y_incs - epsilon * (gains .* y_grads);
ydata = ydata + y_incs;
% Spherical projection
ydata = bsxfun(@minus, ydata, mean(ydata, 1));
r_mean = mean(sqrt(sum(ydata.^2,2)),1);
ydata = bsxfun(@times, ydata, r_mean./ sqrt(sum(ydata.^2,2)) );
% Update the momentum if necessary
if iter == mom_switch_iter
momentum = final_momentum;
end
% Print out progress
if ~rem(iter, 10)
cost = const - sum(P(:) .* log(Q(:)));
disp(['Iteration ' num2str(iter) ': error is ' num2str(cost)]);
end
end
and this is my python version :
no_dims = 3
n = X.shape[0]
min_gain = 0.01
momentum = 0.5
final_momentum = 0.8
epsilon = 500
mom_switch_iter = 250
max_iter = 1000
P[np.diag_indices_from(P)] = 0.
P = ( P + P.T )/2
P = np.max(P / np.sum(P), axis=0)
const = np.sum( P * np.log(P) )
ydata = 1e-4 * np.random.random(size=(n, no_dims))
y_incs = np.zeros(shape=ydata.shape)
gains = np.ones(shape=ydata.shape)
for iter in range(max_iter):
sum_ydata = np.sum(ydata**2, axis = 1)
bsxfun_1 = sum_ydata.T + -2*np.dot(ydata, ydata.T)
bsxfun_2 = sum_ydata + bsxfun_1
num = 1. / ( 1 + bsxfun_2 )
num[np.diag_indices_from(num)] = 0.
Q = np.max(num / np.sum(num), axis=0)
L = (P - Q) * num
t = np.diag( L.sum(axis=0) ) - L
y_grads = 4 * np.dot( t , ydata )
gains = (gains + 0.2) * ( np.sign(y_grads) != np.sign(y_incs) ) \
+ (gains * 0.8) * ( np.sign(y_grads) == np.sign(y_incs) )
# gains[np.where(np.sign(y_grads) != np.sign(y_incs))] += 0.2
# gains[np.where(np.sign(y_grads) == np.sign(y_incs))] *= 0.8
gains = np.clip(gains, a_min = min_gain, a_max = None)
y_incs = momentum * y_incs - epsilon * gains * y_grads
ydata += y_incs
ydata -= ydata.mean(axis=0)
alpha = np.sqrt(np.sum(ydata ** 2, axis=1))
r_mean = np.mean(alpha)
ydata = ydata * (r_mean / alpha).reshape(-1, 1)
if iter == mom_switch_iter:
momentum = final_momentum
if iter % 10 == 0:
cost = const - np.sum( P * np.log(Q) )
print( "Iteration {} : error is {}".format(iter, cost) )
If you want to do trials, you can download the repository here which uses Iris Dataset and an attached library. test.py is my test implementation with Iris dataset and visu.py is the result the paper has on MNIST dataset but restricted to 1000k random points.
Many thanks for your support,
Nicolas
EDIT 1
This is the final code working as expected :
P[np.diag_indices_from(P)] = 0.
P = ( P + P.T )/2
P = P / np.sum(P)
const = np.sum(xlogy(P, P))
ydata = 1e-4 * np.random.random(size=(n, no_dims))
y_incs = np.zeros(shape=ydata.shape)
gains = np.ones(shape=ydata.shape)
for iter in range(max_iter):
sum_ydata = np.sum(ydata**2, axis = 1)
bsxfun_1 = sum_ydata.T + -2*np.dot(ydata, ydata.T)
bsxfun_2 = sum_ydata + bsxfun_1
num = 1. / ( 1 + bsxfun_2 )
num[np.diag_indices_from(num)] = 0.
Q = num / np.sum(num)
L = (P - Q) * num
t = np.diag( L.sum(axis=0) ) - L
y_grads = 4 * np.dot( t , ydata )
gains = (gains + 0.2) * ( np.sign(y_grads) != np.sign(y_incs) ) \
+ (gains * 0.8) * ( np.sign(y_grads) == np.sign(y_incs) )
gains = np.clip(gains, a_min = min_gain, a_max = None)
y_incs = momentum * y_incs - epsilon * gains * y_grads
ydata += y_incs
ydata -= ydata.mean(axis=0)
alpha = np.sqrt(np.sum(ydata ** 2, axis=1))
r_mean = np.mean(alpha)
ydata = ydata * (r_mean / alpha).reshape(-1, 1)
if iter == mom_switch_iter:
momentum = final_momentum
if iter % 10 == 0:
cost = const - np.sum( xlogy(P, Q) )
print( "Iteration {} : error is {}".format(iter, cost) )
maxin matlab (it has two arguments, so it will compare those one by one and return a full sizeP) with a reducingmaxin python (axis=0will reduce along this axis, meaning that the result will have one dimension less).maxappears to be to prevent the subsequentlogfrom choking at a zero. This is the kind of hobbyist programming that should make you very suspicious of the code to put it kindly. The proper way of doing this in python is not taking themaxbut usingscipy.special.xlogyinstead ofP * np.log(P).Pthen the element wise product ofPandlog(P)at this location will be0times-infwhich gives youNaN. The correct value, however, is notNaNbut0as you can verify for example using L'Hopital's rule. As I mentioned in my other comment the way to go in Python is usingscipy.special.xlogywhich treats zero correctly.