0

I'm trying to implement the DOSNES algorithm from this publication but in Python for a project. I found this Matlab Implementation which works well but I probably mistranslated one or more steps in my code (mainly with axis I guess) because I clearly don't reach the same result. This is the part I'm strugglering with in Matlab:

P(1:n + 1:end) = 0;                                 % set diagonal to zero
P = 0.5 * (P + P');                                '% symmetrize P-values
P = max(P ./ sum(P(:)), realmin);                   % make sure P-values sum to one

const = sum(P(:) .* log(P(:)));                     % constant in KL divergence

ydata = .0001 * randn(n, no_dims);

y_incs  = zeros(size(ydata));
gains = ones(size(ydata));

% Run the iterations
for iter=1:max_iter

    % Compute joint probability that point i and j are neighbors
    sum_ydata = sum(ydata .^ 2, 2);
    num = 1 ./ (1 + bsxfun(@plus, sum_ydata, bsxfun(@plus, sum_ydata', -2 * (ydata * ydata')))); % Student-t distribution
    num(1:n+1:end) = 0;                                                 % set diagonal to zero
    Q = max(num ./ sum(num(:)), realmin);                               % normalize to get probabilities

    % Compute the gradients (faster implementation)
    L = (P - Q) .* num;
    y_grads = 4 * (diag(sum(L, 1)) - L) * ydata;

    % Update the solution
    gains = (gains + .2) .* (sign(y_grads) ~= sign(y_incs)) ...         % note that the y_grads are actually -y_grads
          + (gains * .8) .* (sign(y_grads) == sign(y_incs));
    gains(gains < min_gain) = min_gain;
    y_incs = momentum * y_incs - epsilon * (gains .* y_grads);
    ydata = ydata + y_incs;

    % Spherical projection
    ydata = bsxfun(@minus, ydata, mean(ydata, 1));        
    r_mean = mean(sqrt(sum(ydata.^2,2)),1);
    ydata = bsxfun(@times, ydata, r_mean./ sqrt(sum(ydata.^2,2)) );


    % Update the momentum if necessary
    if iter == mom_switch_iter
        momentum = final_momentum;
    end

    % Print out progress
    if ~rem(iter, 10)
        cost = const - sum(P(:) .* log(Q(:)));
        disp(['Iteration ' num2str(iter) ': error is ' num2str(cost)]);
    end

end

and this is my python version :

no_dims = 3
n = X.shape[0]
min_gain = 0.01
momentum = 0.5
final_momentum = 0.8
epsilon = 500
mom_switch_iter = 250
max_iter = 1000

P[np.diag_indices_from(P)] = 0.

P = ( P + P.T )/2

P = np.max(P / np.sum(P), axis=0)

const = np.sum( P * np.log(P) )

ydata = 1e-4 * np.random.random(size=(n, no_dims))

y_incs  = np.zeros(shape=ydata.shape)
gains = np.ones(shape=ydata.shape)

for iter in range(max_iter):
    sum_ydata = np.sum(ydata**2, axis = 1)

    bsxfun_1 = sum_ydata.T + -2*np.dot(ydata, ydata.T)
    bsxfun_2 = sum_ydata + bsxfun_1
    num = 1. / ( 1 + bsxfun_2 )

    num[np.diag_indices_from(num)] = 0.

    Q = np.max(num / np.sum(num), axis=0)

    L = (P - Q) * num

    t =  np.diag( L.sum(axis=0) ) - L
    y_grads = 4 * np.dot( t , ydata )

    gains = (gains + 0.2) * ( np.sign(y_grads) != np.sign(y_incs) ) \
          + (gains * 0.8) * ( np.sign(y_grads) == np.sign(y_incs) )

    # gains[np.where(np.sign(y_grads) != np.sign(y_incs))] += 0.2
    # gains[np.where(np.sign(y_grads) == np.sign(y_incs))] *= 0.8

    gains = np.clip(gains, a_min = min_gain, a_max = None)

    y_incs = momentum * y_incs - epsilon * gains * y_grads
    ydata += y_incs

    ydata -= ydata.mean(axis=0)

    alpha = np.sqrt(np.sum(ydata ** 2, axis=1))
    r_mean = np.mean(alpha)
    ydata = ydata * (r_mean / alpha).reshape(-1, 1)

    if iter == mom_switch_iter:
        momentum = final_momentum

    if iter % 10 == 0:
        cost = const - np.sum( P * np.log(Q) )
        print( "Iteration {} : error is {}".format(iter, cost) )

If you want to do trials, you can download the repository here which uses Iris Dataset and an attached library. test.py is my test implementation with Iris dataset and visu.py is the result the paper has on MNIST dataset but restricted to 1000k random points.

Many thanks for your support,

Nicolas

EDIT 1

This is the final code working as expected :

P[np.diag_indices_from(P)] = 0.

P = ( P + P.T )/2

P = P / np.sum(P)

const = np.sum(xlogy(P, P))

ydata = 1e-4 * np.random.random(size=(n, no_dims))

y_incs  = np.zeros(shape=ydata.shape)
gains = np.ones(shape=ydata.shape)

for iter in range(max_iter):
    sum_ydata = np.sum(ydata**2, axis = 1)

    bsxfun_1 = sum_ydata.T + -2*np.dot(ydata, ydata.T)
    bsxfun_2 = sum_ydata + bsxfun_1
    num = 1. / ( 1 + bsxfun_2 )

    num[np.diag_indices_from(num)] = 0.
    Q = num / np.sum(num)

    L = (P - Q) * num

    t =  np.diag( L.sum(axis=0) ) - L
    y_grads = 4 * np.dot( t , ydata )

    gains = (gains + 0.2) * ( np.sign(y_grads) != np.sign(y_incs) ) \
          + (gains * 0.8) * ( np.sign(y_grads) == np.sign(y_incs) )

    gains = np.clip(gains, a_min = min_gain, a_max = None)

    y_incs = momentum * y_incs - epsilon * gains * y_grads
    ydata += y_incs

    ydata -= ydata.mean(axis=0)

    alpha = np.sqrt(np.sum(ydata ** 2, axis=1))
    r_mean = np.mean(alpha)
    ydata = ydata * (r_mean / alpha).reshape(-1, 1)

    if iter == mom_switch_iter:
        momentum = final_momentum

    if iter % 10 == 0:
        cost = const - np.sum( xlogy(P, Q) )
        print( "Iteration {} : error is {}".format(iter, cost) )
5
  • Right at the beginning you seem to replace a nonreducing max in matlab (it has two arguments, so it will compare those one by one and return a full size P) with a reducing max in python (axis=0 will reduce along this axis, meaning that the result will have one dimension less). Commented Mar 31, 2018 at 11:41
  • Btw. the purpose of this max appears to be to prevent the subsequent log from choking at a zero. This is the kind of hobbyist programming that should make you very suspicious of the code to put it kindly. The proper way of doing this in python is not taking the max but using scipy.special.xlogy instead of P * np.log(P). Commented Mar 31, 2018 at 11:51
  • Thanks for your feedback, I'll try it now. I used the Matlab documentation and based on the example 2, I was thinking that max() without a given axis means maximun of each columns. If the objective is just to "normalize" the matrix, what is the purpose of using max() ? P ./ P(:) should be enought, right ? Commented Mar 31, 2018 at 12:01
  • 1
    The reason is that if there is a zero in P then the element wise product of P and log(P) at this location will be 0 times -inf which gives you NaN. The correct value, however, is not NaN but 0 as you can verify for example using L'Hopital's rule. As I mentioned in my other comment the way to go in Python is using scipy.special.xlogy which treats zero correctly. Commented Mar 31, 2018 at 12:10
  • It's awesome now it works ! You can submit this as answer. Thank you so much :D Commented Mar 31, 2018 at 12:16

1 Answer 1

1

Right at the beginning you seem to replace a nonreducing max in matlab (it has two arguments, so it will compare those one by one and return a full size P) with a reducing max in python (axis=0 will reduce along this axis, meaning that the result will have one dimension less).

My advice, however, is to leave out the max altogether because it looks pretty much like an amateurish attempt of sidestepping the problem of p log p being defined at 0 only via taking the limit p->0 which using L'Hopital's rule can be shown to be 0, whereas the computer will returm NaN when asked to compute 0 * log(0).

The proper way of going about this is using scipy.special.xlogy which treats 0 correctly.

Sign up to request clarification or add additional context in comments.

2 Comments

I'll create a library for this implementation, may I mention you as contributor or you prefer not ?
@NicolasM. I leave it at your discretion.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.