Why am I getting different results from these two functions? Is there something specific to math.h::pow() that I'm doing incorrectly? Or do I need to use a different abs() function than cmath::abs()?
library(Rcpp)
sourceCpp(...) # below C++ function
#include <Rcpp.h>
#include <cmath>
#include <math.h>
#include <iostream>
using namespace Rcpp;
// [[Rcpp::export]]
double dist_q (NumericVector& x, NumericVector& y, int& q) {
int nx= x.size(), ny = y.size();
if (nx != ny) {
std::cout << "ERROR: Length of x and y differ." << std::endl;
return -1;
}
double temp = 0.0;
int m = 0;
for (int i = 0; i < nx; i++) {
if (!NumericVector::is_na(x[i]) && !NumericVector::is_na(y[i])) {
++m;
temp += pow(abs(x[i] - y[i]), (double) q);
}
}
temp = (1 / (double) m * temp);
return pow(temp, (1 / (double) q));
}
#--------------------------------------------------
# R function
dist_qR <- function(x, y, q= 2) {
if (!is.numeric(x) | !is.numeric(y)) stop("Both x and y must be numeric.")
if (q < 1 | q %% 1 != 0) stop("q must be an integer >= 1")
m <- sum(!is.na(x) & !is.na(y))
return((1 / m * sum(abs(x - y)^q, na.rm=TRUE))^(1/q))
}
# test
set.seed(1415)
x <- rnorm(1000)
y <- rnorm(1000)
x2 <- x; x2[x2>1.5] <- NA
y2 <- y; y2[y2 > 1.5] <- NA
> dist_q(x,y,2)
[1] 1.089495
> dist_qR(x,y,2)
[1] 1.438455
> dist_q(x2,y2,2)
[1] 0.9119293
> dist_qR(x2,y2,2)
[1] 1.249269
Re: @Konrad -- R> ?"&&":
& and && indicate logical AND and | and || indicate logical OR. The shorter form performs elementwise comparisons in much the same way as arithmetic operators. The longer form evaluates left to right examining only the first element of each vector. Evaluation proceeds only until the result is determined. The longer form is appropriate for programming control-flow and typically preferred in if clauses.
Timing
Using fabs as below:
x <- list(x= rnorm(100),
x2= rnorm(1000),
x3= rnorm(10000),
x4= rnorm(100000))
y <- list(x= rnorm(100),
x2= rnorm(1000),
x3= rnorm(10000),
x4= rnorm(100000))
x2 <- lapply(x, function(l) {l[l>1.5] <- NA; return(l)})
y2 <- lapply(y, function(l) {l[l>1.5] <- NA; return(l)})
library(microbenchmark)
microbenchmark(n100_r= dist_qR(x[[1]], y[[1]], 3),
n1000_r= dist_qR(x[[2]], y[[2]], 3),
n10000_r= dist_qR(x[[3]], y[[3]], 3),
n100000_r= dist_qR(x[[4]], y[[4]], 3),
n100_c= dist_q(x[[1]], y[[1]], 3),
n1000_c= dist_q(x[[2]], y[[2]], 3),
n10000_c= dist_q(x[[3]], y[[3]], 3),
n100000_c= dist_q(x[[4]], y[[4]], 3), times= 50)
Unit: microseconds
expr min lq mean median uq max neval cld
n100_r 33.431 42.521 72.43820 83.8690 95.599 120.525 50 a
n1000_r 125.803 133.720 163.82538 167.1510 189.731 208.792 50 a
n10000_r 954.812 1061.260 1190.66434 1086.6260 1131.053 3577.318 50 b
n100000_r 10169.212 10806.144 11702.11890 11041.3280 12827.495 15562.607 50 d
n100_c 12.317 14.662 20.83844 20.5270 26.099 39.002 50 a
n1000_c 91.200 97.651 104.58960 105.1295 109.674 119.938 50 a
n10000_c 875.928 939.270 953.09926 951.4395 970.060 1015.807 50 b
n100000_c 8272.492 9227.011 9472.04164 9339.7640 9531.108 12799.929 50 c
# missing values not shown, but I get roughly the same timing
It looks like the speed up is minimal for large vectors--likely since the mathematics functions in R are implemented in C. But the speedup is substantial for short vectors. I'm guessing this is likely due to the overhead I use for type-checking in R.
||and&&rather than|and&for logical conditions of length 1.