A SQL program consists of the table definitions, followed by view definitions. e.g.:
-- define Person table
CREATE TABLE Person
(
name VARCHAR NOT NULL,
age INT,
present BOOLEAN
);
CREATE VIEW Adult AS SELECT Person.name FROM Person WHERE Person.age > 18;Given this program as input, the SQL compiler generates a Rust function which implements the query as a function: given the input data, it produces the output data.
The compiler can also generate a function which will incrementally maintain the
view Adult when presented with changes to table Person:
table changes
V
tables -----> SQL-to-DBSP compiler ------> DBSP circuit
views V
view changes
The compiler is invoked using a Linux shell script, called
sql-to-dbsp, residing in the directory SQL-compiler directory.
Here is an example:
$ ./sql-to-dbsp -h
Usage: sql-to-dbsp [options] Input file to compile
Options:
--alltables
Generate an input for each CREATE TABLE, even if the table is not used
by any view
Default: false
--anonymize
Produce in the output file an anonymized version of the input program
Default: false
--correlatedColumns
Dump information about the columns that are used in join equality
comparisons
Default: false
--crates
Followed by a program name. Generates code using multiple crates;
`outputFile` is interpreted as a directory.
Default: <empty string>
--dataflow
Emit the Dataflow graph of the program in the specified JSON file
--enterprise
Generate code supporting enterprise features
Default: false
--errors
Error output file; stderr if not specified
Default: <empty string>
--handles
Use handles (true) or Catalog (false) in the emitted Rust code
Default: false
-h, --help, -?
Show this message and exit
--ignoreOrder
Ignore ORDER BY clauses at the end
Default: false
--je, -je
Emit error messages as a JSON array to the error output
Default: false
--jit
Emit a JSON representation suitable for an interpreter
Default: false
--jpg, -jpg
Emit a jpg image of the circuit instead of Rust
Default: false
--js, -js
Emit a JSON file containing the schema of all views and tables in thes
specified file.
--lenient
Lenient SQL validation. If true it allows duplicate column names in a
view
Default: false
--no-restrict-io
Do not restrict the types of columns allowed in tables and views
Default: false
--noRust
Do not generate Rust output files
Default: false
--outputsAreSets
Ensure that outputs never contain duplicates
Default: false
--plan
Emit the Calcite plan of the program in the specified JSON file
--png, -png
Emit a png image of the circuit instead of Rust
Default: false
--runtime
Followed by a path. Path to the runtime to use. Used in conjunction
with '--crates'.
Default: <empty string>
--streaming
Compiling a streaming program, where only inserts are allowed
Default: false
--trimInputs
Do not ingest unused fields of input tables
Default: false
--unaryPlusNoop
Compile unary plus into a no-operation; similar to sqlite
Default: false
-O
Optimization level (0, 1, or 2)
Default: 2
-T
Specify logging level for a class (can be repeated)
Syntax: -Tkey=value
Default: {}
-i
Generate an incremental circuit
Default: false
-o
Output file; stdout if not specified
Default: <empty string>
-q
Quiet: do not print warnings
Default: false
-v
Output verbosity
Default: 0Here is a description of the non-obvious command-line options:
--correlatedColumns: Runs a compiler analysis over the input program which detects table columns that are directly compared in equijoin comparisons. This produces an output of the form:
[Correlated:] [table0.column1, table2.column0, table3.column4]
Note that a list of columns may contain more than two
table.column pairs, when some columns are used in sequences of
joins.
--handles: The Rust generated code can expose the input tables and
output views in two ways: through explicit handles, and through a
Catalog object. The catalog allows one to retrieve the handles
by name, but offers only untyped handles, that require a
serializer (for output handles) or a deserializer (for input
handles) to transmit data. The handles API gives access to typed
stream handles, which allow data insertion and retrieval without
using serialization/deserialization.
--ignoreOrder: ORDER BY clauses are not naturally incrementalizable.
Using this flag directs the compiler to ignore ORDER BY clauses
that occur last in a view definition (thus giving unsorte
outputs). This will not affect ORDER BY clauses in an OVER
clause, or ORDER BY clauses followed by LIMIT clauses. The
use of this flag is recommended with the -i flag that
incrementalizes the compiler output.
--lenient: Some SQL queries generate output views having multiple columns with the same name. Such views can cause problems with other tools that interface with the compiler outputs. By default, the compiler will emit an error when given such views. For example, the following definition:
`CREATE VIEW V AS SELECT T.COL2, S.COL2 from T, S`
will create a view with two columns named `COL2`. The workaround is to
explicitly name the view columns, e.g.:
`CREATE VIEW V AS SELECT T.COL2 AS TCOL2, S.COL2 AS SCOL2 from T, S`
Using the `--lenient` flag will only emit warnings, but compile such programs.
--plan: Generates a JSON structure for each view compiled, representing the Calcite optimized plan for computing the view
--outputsAreSets: SQL queries can produce outputs that contain duplicates, but
such outputs are rarely useful in practice. Using this flag will ensure that
each output VIEW does not contain duplicates. This can also be ensured by
using a SELECT DISTINCT statement in the view definition. An example query
that can produce duplicates is:
`CREATE VIEW V AS SELECT T.COL1 FROM T`
-O: sets the optimization level. A higher values implies more optimizations.
-d: Sets the lexical rules used. SQL dialects differ in rules for allowed identifiers, quoting identifiers, conversions to uppercase, case sensitivity of identifiers.
--streaming: Equivalent to adding the following property to all program tables:
'appendOnly' = 'true'.
The following command-line compiles a script called x.sql and writes
the result in a file lib.rs:
$ ./sql-to-dbsp x.sql --handles -o ../temp/src/lib.rsLet's assume we are compiling a file containing the program in the example above.
In the generated program every CREATE TABLE is translated to an
input, and every CREATE VIEW is translated to an output. The result
produced will look like this^[1]:
[1] Note: the compiler output changes as the compiler implementation evolves. This code is shown for illustrative purposes only.
$ cat ../temp/src/lib.rs
// Automatically-generated file
[...boring stuff removed...]
fn circuit(workers: usize) -> Result<(DBSPHandle, (CollectionHandle<Tup3<String, Option<i32>, Option<bool>>, Weight>, OutputHandle<SpineSnapshot<OrdZSet<Tup1<String>, Weight>>>, )), DbspError> {
let (circuit, streams) = Runtime::init_circuit(workers, |circuit| {
// CREATE TABLE `PERSON` (`NAME` VARCHAR NOT NULL, `AGE` INTEGER, `PRESENT` BOOLEAN)
#[derive(Clone, Debug, Eq, PartialEq)]
struct r#PERSON_0 {
r#field: String,
r#field_0: Option<i32>,
r#field_1: Option<bool>,
}
impl From<PERSON_0> for Tup3<String, Option<i32>, Option<bool>> {
fn from(table: r#PERSON_0) -> Self {
Tup3::new(table.r#field,table.r#field_0,table.r#field_1,)
}
}
impl From<Tup3<String, Option<i32>, Option<bool>>> for r#PERSON_0 {
fn from(tuple: Tup3<String, Option<i32>, Option<bool>>) -> Self {
Self {
r#field: tuple.0,
r#field_0: tuple.1,
r#field_1: tuple.2,
}
}
}
deserialize_table_record!(PERSON_0["PERSON", 3] {
(r#field, "NAME", false, String, None),
(r#field_0, "AGE", false, Option<i32>, Some(None)),
(r#field_1, "PRESENT", false, Option<bool>, Some(None))
});
serialize_table_record!(PERSON_0[3]{
r#field["NAME"]: String,
r#field_0["AGE"]: Option<i32>,
r#field_1["PRESENT"]: Option<bool>
});
// DBSPSourceMultisetOperator 312(32)
// CREATE TABLE `PERSON` (`NAME` VARCHAR NOT NULL, `AGE` INTEGER, `PRESENT` BOOLEAN)
let (PERSON, handlePERSON) = circuit.add_input_zset::<Tup3<String, Option<i32>, Option<bool>>, Weight>();
// rel#36:LogicalFilter.(input=LogicalTableScan#1,condition=>($1, 18))
// DBSPFilterOperator 332(57)
let stream3: Stream<_, OrdZSet<Tup3<String, Option<i32>, Option<bool>>, Weight>> = PERSON.filter(move |t: &Tup3<String, Option<i32>, Option<bool>>, | ->
bool {
wrap_bool(gt_i32N_i32((*t).1, 18i32))
});
// rel#38:LogicalProject.(input=LogicalFilter#36,inputs=0)
// DBSPMapOperator 350(79)
let stream4: Stream<_, OrdZSet<Tup1<String>, Weight>> = stream3.map(move |t: &Tup3<String, Option<i32>, Option<bool>>, | ->
Tup1<String> {
Tup1::new((*t).0.clone())
});
// CREATE VIEW `ADULT` AS
// SELECT `PERSON`.`NAME`
// FROM `schema`.`PERSON` AS `PERSON`
// WHERE `PERSON`.`AGE` > 18
#[derive(Clone, Debug, Eq, PartialEq)]
struct r#ADULT_0 {
r#field: String,
}
impl From<ADULT_0> for Tup1<String> {
fn from(table: r#ADULT_0) -> Self {
Tup1::new(table.r#field,)
}
}
impl From<Tup1<String>> for r#ADULT_0 {
fn from(tuple: Tup1<String>) -> Self {
Self {
r#field: tuple.0,
}
}
}
deserialize_table_record!(ADULT_0["ADULT", 1] {
(r#field, "NAME", false, String, None)
});
serialize_table_record!(ADULT_0[1]{
r#field["NAME"]: String
});
let handleADULT = stream4.accumulate_output();
Ok((handlePERSON, handleADULT, ))
})?;
Ok((circuit, streams))
}You can compile the generated Rust code:
$ cd ../temp
$ cargo buildThe generated file contains a Rust function called circuit (you can
change its name using the compiler option -f). Calling circuit
will return an executable DBSP circuit handle, and a tuple containing
a handle for each input and output stream, in the order they are
declared in the SQL program.
We can write a unit test to exercise this circuit:
#[test]
pub fn test() {
let (mut circuit, (person, adult) ) = circuit(2).unwrap();
// Feed two input records to the circuit.
// First input has a count of "1"
person.push( (SqlString::from_ref("Bob"), Some(12), Some(true)).into(), 1 );
// Second input has a count of "2"
person.push( (SqlString::from_ref("Tom"), Some(20), Some(false)).into(), 2 );
// Execute the circuit on these inputs
circuit.transaction().unwrap();
// Read the produced output
let out = adult.consolidate();
// Print the produced output
println!("{}", out);
}The unit test can be exercised with cargo test -- --nocapture.
This will print the output as:
layer:
("Tom",) -> 2
In general a circuit will connect with external data sources, and thus
will need to convert the data to/from other representations. Here is
an example using the Catalog circuit API. We compile the same
program as before, with different command-line flags:
$ ./sql-to-dbsp x.sql -i -o ../temp/src/lib.rsThis time we are producing an incremental version of the circuit. The difference between this circuit and the previous one is as follows:
-
For the non-incremental circuit, every time we supply an input value, the table
PERSONSis cleared and filled with the supplied value. Thecircuit.transaction()function computes the contents of the output viewADULTS. Reading the output handle gives us the entire contents of this view. -
For the incremental circuit, the
PERSONStable is initially empty. Every time we supply an input it is added to the table. (Input encoding formats such asJSONcan also specify deletions from the table.) Thecircuit.transaction()function computes the changes to the output viewADULTS. Reading the output handle gives us the latest changes to the contents of this view. Formats such as CSV cannot represent deletions, so they may be insufficient for representing changes. In the following example we input only insertions using CSV, but the output is received in a JSON format which can describe deletions too.
We exercise this circuit by inserting data using a CSV format:
#[test]
pub fn test() {
use dbsp_adapters::{CircuitCatalog, RecordFormat};
use use feldera_types::format::csv::CsvParserConfig;
let (mut circuit, catalog) = circuit(2)
.expect("Failed to build circuit");
let persons = catalog
.input_collection_handle(&SqlIdentifier::from("PERSON"))
.expect("Failed to get input collection handle");
let mut persons_stream = persons
.handle
.configure_deserializer(RecordFormat::Csv(Default::default())
.expect("Failed to configure deserializer");
persons_stream
.insert(b"Bob,12,true", &None)
.expect("Failed to insert data");
persons_stream
.insert(b"Tom,20,false", &None)
.expect("Failed to insert data");
persons_stream
.insert(b"Tom,20,false", &None)
.expect("Failed to insert data"); // Insert twice
persons_stream.flush();
// Execute the circuit on these inputs
circuit
.transaction()
.unwrap();
let adult = &catalog
.output_handles(&SqlIdentifier::from("ADULT"))
.expect("Failed to get output collection handles")
.delta_handle;
// Read the produced output
let reader = adult.concat().consolidate();
let mut cursor = reader
.cursor(RecordFormat::Csv(CsvParserConfig::default()))
.unwrap();
while cursor.key_valid() {
let mut w = cursor.weight();
println!("{}: {}", cursor.key_to_json().unwrap(), w);
cursor.step_key();
}
}The -js compiler flag is followed by a file name. If the flag is
supplied, the compiler will write information about the input tables
and output views in JSON in the supplied file name. Here is an
example of the generated JSON for the following program:
CREATE TABLE T (
COL1 INT NOT NULL
, COL2 DOUBLE NOT NULL FOREIGN KEY REFERENCES S(COL0)
, COL3 VARCHAR(3) NOT NULL PRIMARY KEY
, COL4 VARCHAR(3) ARRAY
)
CREATE VIEW V AS SELECT COL1 AS xCol FROM T
CREATE VIEW V1 (yCol) AS SELECT COL1 FROM T
```sql
Output:
```json
{
"inputs" : [ {
"name" : "t",
"fields" : [ {
"name" : "col1",
"case_sensitive" : false,
"columntype" : {
"type" : "INTEGER",
"nullable" : false
}
}, {
"name" : "col2",
"case_sensitive" : false,
"columntype" : {
"type" : "DOUBLE",
"nullable" : false
}
}, {
"name" : "col3",
"case_sensitive" : false,
"columntype" : {
"type" : "VARCHAR",
"nullable" : false,
"precision" : 3
}
}, {
"name" : "col4",
"case_sensitive" : false,
"columntype" : {
"type" : "ARRAY",
"nullable" : true,
"component" : {
"type" : "VARCHAR",
"nullable" : false,
"precision" : 3
}
}
} ],
"primary_key" : [ "col3" ],
"materialized" : false,
"foreign_keys" : [ {
"columns" : [ "col2" ],
"refers" : "s",
"tocolumns" : [ "col0" ]
} ]
} ],
"outputs" : [ {
"name" : "v",
"fields" : [ {
"name" : "xCol",
"case_sensitive" : false,
"columntype" : {
"type" : "INTEGER",
"nullable" : false
}
} ]
}, {
"name" : "V1",
"fields" : [ {
"name" : "yCol",
"case_sensitive" : true,
"columntype" : {
"type" : "INTEGER",
"nullable" : false
}
} ]
} ]
}