Skip to content

Commit 3330d66

Browse files
pitrouwesm
authored andcommitted
ARROW-4118: [Python] Fix benchmark setup for "asv run"
"conda activate" unfortunately isn't available from a non-interactive shell, and running bash as interactive doesn't look like a workable solution. Also fix a setup slowness issue in the Parquet benchmarks, and fix a C++ ABI issue by downloading packages from Anaconda rather than conda-forge. Author: Antoine Pitrou <antoine@python.org> Closes apache#3357 from pitrou/ARROW-4118-fix-asv-run and squashes the following commits: b07b68e <Antoine Pitrou> ARROW-4118: Fix benchmark setup for "asv run"
1 parent bcfacaa commit 3330d66

4 files changed

Lines changed: 37 additions & 24 deletions

File tree

docs/source/python/benchmarks.rst

Lines changed: 13 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -19,35 +19,37 @@ Benchmarks
1919
==========
2020

2121
The ``pyarrow`` package comes with a suite of benchmarks meant to
22-
run with `asv`_. You'll need to install the ``asv`` package first
22+
run with `ASV`_. You'll need to install the ``asv`` package first
2323
(``pip install asv`` or ``conda install -c conda-forge asv``).
2424

25-
The benchmarks are run using `asv`_ which is also their only requirement.
26-
2725
Running the benchmarks
2826
----------------------
2927

30-
To run the benchmarks, call ``asv run --python=same``. You cannot use the
31-
plain ``asv run`` command at the moment as asv cannot handle python packages
32-
in subdirectories of a repository.
28+
To run the benchmarks for a locally-built Arrow, run ``asv dev`` or
29+
``asv run --python=same``.
3330

34-
Running with arbitrary revisions
35-
--------------------------------
31+
Running for arbitrary Git revisions
32+
-----------------------------------
3633

3734
ASV allows to store results and generate graphs of the benchmarks over
38-
the project's evolution. For this you have the latest development version of ASV:
35+
the project's evolution. You need to have the latest development version of ASV:
3936

4037
.. code::
4138
4239
pip install git+https://github.com/airspeed-velocity/asv
4340
41+
The build scripts assume that Conda's ``activate`` script is on the PATH
42+
(the ``conda activate`` command unfortunately isn't available from
43+
non-interactive scripts).
44+
4445
Now you should be ready to run ``asv run`` or whatever other command
45-
suits your needs.
46+
suits your needs. Note that this can be quite long, as each Arrow needs
47+
to be rebuilt for each Git revision you're running the benchmarks for.
4648

4749
Compatibility
4850
-------------
4951

5052
We only expect the benchmarking setup to work with Python 3.6 or later,
51-
on a Unix-like system.
53+
on a Unix-like system with bash.
5254

5355
.. _asv: https://asv.readthedocs.org/

python/asv-build.sh

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,9 @@ set -e
2121

2222
# ASV doesn't activate its conda environment for us
2323
if [ -z "$ASV_ENV_DIR" ]; then exit 1; fi
24-
conda activate $ASV_ENV_DIR
24+
# Avoid "conda activate" because it's only set up in interactive shells
25+
# (https://github.com/conda/conda/issues/8072)
26+
source activate $ASV_ENV_DIR
2527
echo "== Conda Prefix for benchmarks: " $CONDA_PREFIX " =="
2628

2729
# Build Arrow C++ libraries
@@ -32,6 +34,8 @@ export ORC_HOME=$CONDA_PREFIX
3234
export PROTOBUF_HOME=$CONDA_PREFIX
3335
export BOOST_ROOT=$CONDA_PREFIX
3436

37+
export CXXFLAGS="-D_GLIBCXX_USE_CXX11_ABI=1"
38+
3539
pushd ../cpp
3640
mkdir -p build
3741
pushd build
@@ -40,9 +44,11 @@ cmake -GNinja \
4044
-DCMAKE_BUILD_TYPE=release \
4145
-DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
4246
-DARROW_CXXFLAGS=$CXXFLAGS \
43-
-DARROW_PYTHON=ON \
44-
-DARROW_PLASMA=ON \
45-
-DARROW_BUILD_TESTS=OFF \
47+
-DARROW_USE_GLOG=off \
48+
-DARROW_PARQUET=on \
49+
-DARROW_PYTHON=on \
50+
-DARROW_PLASMA=on \
51+
-DARROW_BUILD_TESTS=off \
4652
..
4753
cmake --build . --target install
4854

@@ -52,7 +58,8 @@ popd
5258
# Build pyarrow wrappers
5359
export SETUPTOOLS_SCM_PRETEND_VERSION=0.0.1
5460
export PYARROW_BUILD_TYPE=release
55-
export PYARROW_PARALLEL=4
61+
export PYARROW_PARALLEL=8
62+
export PYARROW_WITH_PARQUET=1
5663
export PYARROW_WITH_PLASMA=1
5764

5865
python setup.py clean

python/asv.conf.json

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@
3535
// of the repository.
3636
"repo_subdir": "python",
3737

38+
// Custom build commands for Arrow.
3839
"build_command": ["/bin/bash {build_dir}/asv-build.sh"],
3940
"install_command": ["/bin/bash {build_dir}/asv-install.sh"],
4041
"uninstall_command": ["/bin/bash {build_dir}/asv-uninstall.sh"],
@@ -56,7 +57,8 @@
5657
// determined by looking for tools on the PATH environment
5758
// variable.
5859
"environment_type": "conda",
59-
"conda_channels": ["conda-forge", "defaults"],
60+
// Avoid conda-forge to avoid C++ ABI issues
61+
"conda_channels": ["defaults"],
6062

6163
// the base URL to show a commit for the project.
6264
"show_commit_url": "https://github.com/apache/arrow/commit/",

python/benchmarks/parquet.py

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -15,11 +15,12 @@
1515
# specific language governing permissions and limitations
1616
# under the License.
1717

18-
import pandas as pd
19-
import random
2018
import shutil
2119
import tempfile
2220

21+
import numpy as np
22+
import pandas as pd
23+
2324
import pyarrow as pa
2425
try:
2526
import pyarrow.parquet as pq
@@ -38,18 +39,19 @@ class ParquetManifestCreation(object):
3839

3940
def setup(self, num_partitions, num_threads):
4041
if pq is None:
41-
raise NotImplementedError
42+
raise NotImplementedError("Parquet support not enabled")
4243

4344
self.tmpdir = tempfile.mkdtemp('benchmark_parquet')
44-
num1 = [random.choice(range(0, num_partitions))
45-
for _ in range(self.size)]
46-
num2 = [random.choice(range(0, 1000)) for _ in range(self.size)]
45+
rnd = np.random.RandomState(42)
46+
num1 = rnd.randint(0, num_partitions, size=self.size)
47+
num2 = rnd.randint(0, 1000, size=self.size)
4748
output_df = pd.DataFrame({'num1': num1, 'num2': num2})
4849
output_table = pa.Table.from_pandas(output_df)
4950
pq.write_to_dataset(output_table, self.tmpdir, ['num1'])
5051

5152
def teardown(self, num_partitions, num_threads):
52-
shutil.rmtree(self.tmpdir)
53+
if self.tmpdir is not None:
54+
shutil.rmtree(self.tmpdir)
5355

5456
def time_manifest_creation(self, num_partitions, num_threads):
5557
pq.ParquetManifest(self.tmpdir, metadata_nthreads=num_threads)

0 commit comments

Comments
 (0)