Skip to content

Commit 0ff776f

Browse files
tanyaschlusserpitrou
authored andcommitted
ARROW-4313: Define general benchmark database schema
This is a work in progress. It provides the ddl for a proposed schema for the benchmark database. - The Dockerfile loads the ddl into the appropriate directory so that it will be added upon startup. - The `examples` directory will probably be deleted before this is completed. Some of the queries do not work (was surprised that 'null' does not compare to equal). - My SQL style may be ugly (depending on preference) so it can change; no problem. Thank you for this opportunity to learn! Author: Tanya Schlusser <tanya@tickel.net> Closes apache#3586 from tanyaschlusser/ARROW-4313 and squashes the following commits: 25e0b76 <Tanya Schlusser> ARROW-4313: Ingestion functions moved out of the trigger functions file; redundant permissions revocation removed d47b1e5 <Tanya Schlusser> ARROW-4313: Add 'summarized_tables_view' to the views file; improve documentation functions; re-push generated documentation files. e5ce80a <Tanya Schlusser> ARROW-4313: Add ingestion function 'ingest_benchmark_runs_with_context' and example 72132cc <Tanya Schlusser> ARROW-4313: Standard JSON (not strings) is acceptable for field entries 3fa5f3a <Tanya Schlusser> ARROW-4313: Add license to '.env' 7637cac <Tanya Schlusser> ARROW-4313: allow Travis to skip .csv files 7e0ef1d <Tanya Schlusser> ARROW-4313: Add a benchmarking database schema
1 parent 41fc38f commit 0ff776f

41 files changed

Lines changed: 4777 additions & 0 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

dev/benchmarking/.env

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
# Licensed to the Apache Software Foundation (ASF) under one or more
2+
# contributor license agreements. See the NOTICE file distributed with
3+
# this work for additional information regarding copyright ownership.
4+
# The ASF licenses this file to You under the Apache License, Version 2.0
5+
# (the "License"); you may not use this file except in compliance with
6+
# the License. You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
#
16+
17+
PG_USER=benchmark
18+
PG_PASS=benchmark

dev/benchmarking/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
/machine.json

dev/benchmarking/Dockerfile

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
# Licensed to the Apache Software Foundation (ASF) under one
2+
# or more contributor license agreements. See the NOTICE file
3+
# distributed with this work for additional information
4+
# regarding copyright ownership. The ASF licenses this file
5+
# to you under the Apache License, Version 2.0 (the
6+
# "License"); you may not use this file except in compliance
7+
# with the License. You may obtain a copy of the License at
8+
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
11+
# Unless required by applicable law or agreed to in writing,
12+
# software distributed under the License is distributed on an
13+
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14+
# KIND, either express or implied. See the License for the
15+
# specific language governing permissions and limitations
16+
# under the License.
17+
#
18+
FROM postgres:11-alpine
19+
20+
# Any `.sh` and `.sql` files copied to the entrypoint directory
21+
# will be run during startup. See `docker-entrypoint.sh` in
22+
# https://github.com/docker-library/postgres/blob/master/11/alpine/
23+
COPY ddl/* /docker-entrypoint-initdb.d/

dev/benchmarking/README.md

Lines changed: 255 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,255 @@
1+
<!--
2+
~ Licensed to the Apache Software Foundation (ASF) under one
3+
~ or more contributor license agreements. See the NOTICE file
4+
~ distributed with this work for additional information
5+
~ regarding copyright ownership. The ASF licenses this file
6+
~ to you under the Apache License, Version 2.0 (the
7+
~ "License"); you may not use this file except in compliance
8+
~ with the License. You may obtain a copy of the License at
9+
~
10+
~ http://www.apache.org/licenses/LICENSE-2.0
11+
~
12+
~ Unless required by applicable law or agreed to in writing,
13+
~ software distributed under the License is distributed on an
14+
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
~ KIND, either express or implied. See the License for the
16+
~ specific language governing permissions and limitations
17+
~ under the License.
18+
-->
19+
20+
21+
> NOTE: For those deploying this database, Postgres does not by default use
22+
> UTF-8, however it is [required for the jsonb][pg-jsonb] format used in
23+
> some columns to always work. This [stackoverflow post][so-utf8] describes
24+
> how to do it for Amazon RDS. This [section of the docs][pg-charset]
25+
> states how to do it in general, i.e.: `initdb -E UTF8`.
26+
27+
# Benchmark database
28+
29+
This directory contains files related to the benchmark database.
30+
31+
- 'ddl/\*.sql' contains the database definition.
32+
- 'examples/' contain code to test the database and demonstrate its use.
33+
- 'Dockerfile' and 'docker-compose.yml' are for developing benchmarks
34+
against a testing database.
35+
- An auto-generated summary of views in the [Data model][./data_model.rst].
36+
37+
## Setup
38+
39+
To create a 'machine.json' file that will uniquely identify a computer for
40+
benchmark submission, run the provided shell script and fill in the prompts
41+
to identify the GPU.
42+
43+
> NOTE: this does not work on VMs or Windows.
44+
45+
```shell
46+
./make_machine_json.sh
47+
```
48+
49+
Submit the machine details via http using the command
50+
51+
> NOTE: This will only work if we have selected graphql as a client
52+
> and have it running in production or if during development
53+
> you have run `docker-compose up` to create and run both a
54+
> database Docker container and graphql client Docker container.
55+
56+
```shell
57+
./graphql_submit.sh machine machine.json localhost:5000/graphql
58+
```
59+
60+
or submit after starting up the psql client from this directory, using
61+
62+
```
63+
\set content `cat machine.json`
64+
SELECT ingest_machine_view(:'content'::jsonb);
65+
```
66+
67+
> NOTE: If you don't have a "machine.json" file generated,
68+
> use the example file "examples/machine.json" instead.
69+
70+
## Local testing
71+
72+
There is a file named "[.env][.env]" in this directory that is used by
73+
`docker-compose` to set up the postgres user and password for the
74+
local containers. Currentlty the name and password are both
75+
`benchmark`. This will be the password for the psql client as well.
76+
77+
The Postgres Alpine image runs any added '\*.sql' and '\*.sh' scripts placed
78+
in '/docker-entrypoint-initdb.d/' during its startup script, so the local
79+
database will be set up automatically once the container is running.
80+
81+
To start the containers, be sure to have [Docker installed][docker],
82+
and then run the following from this directory (arrow/dev/benchmarking).
83+
84+
85+
```
86+
docker-compose up
87+
```
88+
89+
This will start a process that will show logs from both the running
90+
Postgres container and the running GraphQL container.
91+
To stop the running containers gracefully, background the process
92+
and run
93+
94+
```
95+
docker-compose down
96+
fg # To re-foreground the backgrounded process while it exits
97+
```
98+
99+
You will still have the container images "benchmarking_pg",
100+
"graphile/postgraphile", and "postgres:11-alpine" on your
101+
computer. You should keep them if you want to run this again.
102+
If you don't, then remove them with the command:
103+
104+
```
105+
docker rmi benchmarking_pg postgres:11-alpine graphile/postgraphile
106+
```
107+
108+
### Postgres client
109+
110+
The `psql` shell client is bundled with the PostgreSQL core distribution
111+
available from the [Postgres download page][postgres-downloads].
112+
Using the `PG_USER` defined in the `.env` file (currently "benchmark"),
113+
the command to connect to the container is:
114+
```shell
115+
psql -h localhost -p 5432 -U benchmark
116+
```
117+
There is an example script in [examples/example.sql](examples/example.sql) that
118+
runs some queries against the database. To run it in the psql client, type
119+
the following in the psql command-line interface:
120+
121+
```
122+
\i examples/example.sql
123+
```
124+
125+
#### Bulk ingestion using CSV
126+
127+
An example CSV file for bulk ingestion is in
128+
[examples/benchmark_run_example.csv](examples/benchmark_run_example.csv).
129+
The columns are listed in the same order as they are defined, to avoid having
130+
to explicitly name every column in ingestion. The "id" column is left empty
131+
and will be automatically assigned on insert.
132+
133+
To ingest the example CSV file from the command line,
134+
use the command below:
135+
136+
```shell
137+
CSV='examples/benchmark_run_example.csv' && \
138+
psql -U benchmark -h localhost -p 5432 \
139+
-c "\copy benchmark_run_view FROM '${CSV}' WITH (FORMAT csv, HEADER);"
140+
```
141+
142+
#### Bulk ingestion using JSON
143+
144+
To ingest the example JSON file using the psql client, use the command below.
145+
146+
```
147+
\set content `cat examples/benchmark_example.json`
148+
SELECT ingest_benchmark_view(:'content'::jsonb);
149+
```
150+
151+
### HTTP client
152+
153+
This section requires an actual HTTP client to be up, either
154+
for the production database or via the testing setup.
155+
(See the [local testing section](#local-testing) for how to set it up).
156+
157+
The 'graphile/postgraphile' container provides an HTTP interface
158+
to the database via two url routes:
159+
160+
- A GraphiQL page ([localhost:5000/graphiql][graphiql])
161+
to aid visual exploration of the data model.
162+
(The `--watch` flag on the command line. Not recommended for production.)
163+
- An endpoint that receives POST requests only (localhost:5000/graphql).
164+
165+
#### Ingestion
166+
167+
The script [graphql_submit.sh](./graphql_submit.sh) simplifies submission
168+
to the database via curl. Examples:
169+
170+
```shell
171+
./graphql_submit.sh benchmarks examples/benchmark_example.json
172+
./graphql_submit.sh runs examples/benchmark_run_example.json
173+
```
174+
175+
#### Querying
176+
177+
The output of the query is a JSON object that is hard to read on the command line.
178+
Here is an example query in the shell:
179+
```shell
180+
curl -X POST \
181+
-H "Content-Type: application/json" \
182+
--data '{"query": "{projectDetails{ projectName }}"}' \
183+
localhost:5000/graphql
184+
```
185+
186+
which (if you have previously run the "examples.sql" command) yields
187+
188+
```
189+
{"data":{"projectDetails":{"projectName":"Apache Arrow"}}}
190+
```
191+
192+
Here is an example query using Python:
193+
```python
194+
import json
195+
import requests
196+
197+
uri = "http://localhost:5000/graphql"
198+
query = json.load(open("examples/graphql_query_environment_view.json"))
199+
response = requests.post(uri, json=query)
200+
message = "{benchmarkLanguage}: {languageImplementationVersion}, {dependencies}"
201+
202+
for row in response.json()['data']['allEnvironmentViews']['edges']:
203+
print(message.format(**row['node']))
204+
205+
# result:
206+
#
207+
# Python: CPython 2.7, {"six":"","numpy":"1.14","other_lib":"1.0"}
208+
# Python: CPython 2.7, {"six":"","numpy":"1.15","other_lib":"1.0"}
209+
# Python: CPython 3.6, {"boost":"1.42","numpy":"1.15"}
210+
```
211+
212+
## Deployment
213+
214+
(work in progress).
215+
216+
> NOTE: For those deploying this database, Postgres does not by default use
217+
> UTF-8, however it is [required for the jsonb][pg-jsonb] format used in
218+
> some columns to always work. This [stackoverflow post][so-utf8] describes
219+
> how to do it for Amazon RDS. This [section of the docs][pg-charset]
220+
> states how to do it in general, i.e.: `initdb -E UTF8`.
221+
222+
223+
## Quick reference
224+
225+
- String variables `'have single quotes'`
226+
- Arrays `'{"have", "curly", "braces"}'::text[]` or `'{1, 2, 3}'::integer[]`
227+
- JSONb `'{"has":"this", "format":42}'::jsonb`
228+
- Elements inserted using JSON-formatted strings can use standard
229+
JSON-formatted arrays (`[1, 2, 3]`) and do not have to use the above
230+
string formats.
231+
- When comparing nullable values use `x IS NOT DISTINCT FROM y` rather than `x = y`
232+
- An auto-generated summary of the [Data model][./data_model.rst].
233+
234+
## Data model documentation
235+
236+
To recreate the data model documentation,
237+
(1) install the [psql client][postgres-downloads]
238+
(sorry you need to download the whole thing),
239+
(2) start the docker container using `docker-compose up`,
240+
(3) and then run these scripts:
241+
242+
```
243+
./make_dotfile.sh
244+
./make_data_model_rst.sh
245+
```
246+
247+
[pg-jsonb]: https://www.postgresql.org/docs/11/datatype-json.html#id-1.5.7.22.3
248+
[so-utf8]: https://stackoverflow.com/a/33557023
249+
[pg-charset]: https://www.postgresql.org/docs/9.3/multibyte.html#AEN34424
250+
[docker]: https://www.docker.com/get-started
251+
[citext-limitations]: https://www.postgresql.org/docs/11/citext.html#id-1.11.7.17.7
252+
[postgres-downloads]: https://www.postgresql.org/download/
253+
[graphiql]: http://localhost:5000/graphiql
254+
[postgraphile-lambda]: https://github.com/graphile/postgraphile-lambda-example
255+
[postgraphile-cli]: https://www.graphile.org/postgraphile/usage-cli/

0 commit comments

Comments
 (0)