Add the inequalities to the leftjoin() and leftjoin!() functions.#85
Add the inequalities to the leftjoin() and leftjoin!() functions.#85kyo227 wants to merge 10 commits intosl-solution:masterfrom
Conversation
|
Please add tests for you functions. |
…r) in LeftJoin, that the ranges and inbit do not match, so the result will have extra lines. Last version use the filter() function, which greatly affected the efficency, so i changed it.
|
I've added tests to |
|
This is a good PR, however, I have some concerns about the way you implemented it.
|
src/join/join.jl
Outdated
| cnt = 1 | ||
| cnt_r = 1 | ||
| if length(r2) == 0 | ||
| _res = allowmissing(_res) |
There was a problem hiding this comment.
this shouldn't be here. If _res needs to support missing, it must be set when it is initialised.
src/join/join.jl
Outdated
| end | ||
| revised_ends = _mark_lt_part!(inbits, _columns(dsl)[left_range_col], _columns(dsr)[right_range_cols[2]], _fl, _fr, ranges, idx, new_ends, total_length < typemax(Int32) ? Val(Int32) : Val(Int64); strict = strict_inequality[2], threads = threads) | ||
| inbit_ranges = Vector{UnitRange{T}}(undef, nrow(dsl)) | ||
| ra = Vector{UnitRange{T}}(undef, nrow(dsl)) |
There was a problem hiding this comment.
ra is the lo:hi computed from the original new_ends, which corresponds to the inbits. For example, inbits = [1,0,1,0,0], ra = [1:1, 2:2, 3:4, 5:5], that means the 4th rows need to be deleted. And i modify new_ends later to get the right number of rows.
So i can use new_ends to calculate the lo:hi in _fill_right_cols_table_left!(). I have to calculate it in there.
There was a problem hiding this comment.
if you calculate new_ends = map(x->max(length(x), 1), ranges) in innerjoin and when ranges is 1:0 the corresponding row in the left dataset be set as missing in the output then it should be fine. Additionally, when for a row all inbits are false the output dataset should produce missing. So, the point is with a little modification ofinnerjoin we can achieve left range join.
There was a problem hiding this comment.
The ranges only check the left part of inequality(type = :left). For example, 2 => (:low, :high), the ranges is the result of 2 => (:low, nothing). So, maybe some rows shouldn't be in the result, but its ranges are not 1:0. So, the new_ends and the inbits can not find the correct number of rows for res. We need to calculate the lo2:hi2 to find the correct size of res. And inbits can just show how many rows can be matched, we don't know which rows don't match at all(these rows need to be add to result with missing).
There was a problem hiding this comment.
you may use new_ends = map(x->max(length(x), 1), ranges) and exploits inbits to figure out which rows do not have matching rows in the right data set.
There was a problem hiding this comment.
After we get the number of rows in result, the method in innerjoin use the r2.start(r2 is lo2:hi2) as the index of _res. But lo2:hi2 do not have the missing rows.

The _res[2] should be missing, but if we use the method in innerjoin, the _res[2] will be the 3rd rows.

So, in my code, i calculate the new new_ends as the index of _res. but the old new_ends is also needed, so ra is the old new_ends .
There was a problem hiding this comment.
I see.
However, I am thinking something like this:
Suppose we have (:lower, :upper) condition and ranges are [2:3,1:0,4:5,1:0] so we creates new_ends as cumsum([2,1,2,1]) and inbits is [0,0,0,0,0,0]. _mark_lt_part! will works on inbits in locations [1:2, 4:5]. Suppose that the first one, i.e. 1:2 does not have any matching rows but the second one has one. So _mark_lt_part! will set the first element of ranges to 1:0 and inbits will be something like [0,0,0,0,1,0]. At this moment we know that ragnes with values 1:0 only exist in the left data set and inbits will help us to figure out the matching rows in the right data set.
There was a problem hiding this comment.
But it is hard to find the correct location of _res by that way. Because in a single loop for one row, we only know the ranges, lo:hi and lo2:hi2 for that row. We don't know how many rows in previous rows because of multithreading. For your example, ranges is [1:0, 1:0, 4:5, 1:0], for 3rd rows, we just know lo:hi is 4:5, lo2:hi2 is 1:1. We can not know where the data in 3rd row goes in the _res.
And this is why I calculated lo:hi and lo2:hi2 in fornt of the loop of rows, and use map(x -> max(1, length(x)), inbit_ranges) to get the starting position of each row.
There was a problem hiding this comment.
I see your point, however, new_ends shows us exactly which rows of inbits belongs to a specific row in the left table, e.g. in the example above new_ends = [2,3,5,6], thus the first row in the left table has two rows in inbits, additionally we already know that row 1 is 1:0 and it only creates one row in the output data set, and so on.
src/join/join.jl
Outdated
| revised_ends = _mark_lt_part!(inbits, _columns(dsl)[left_range_col], _columns(dsr)[right_range_cols[2]], _fl, _fr, ranges, idx, new_ends, total_length < typemax(Int32) ? Val(Int32) : Val(Int64); strict = strict_inequality[2], threads = threads) | ||
| inbit_ranges = Vector{UnitRange{T}}(undef, nrow(dsl)) | ||
| ra = Vector{UnitRange{T}}(undef, nrow(dsl)) | ||
| @_threadsfor threads for i in 1:length(ranges) |
There was a problem hiding this comment.
put the core computations inside a separate function - it helps reducing allocations
There was a problem hiding this comment.
Not sure if we need this at all.
…s the test of test/join.jl. leftjoin!() may be not necessary to implement the range join function, so i didn't change it. I write a function named _ranges_join(), which can implement the leftjoin and innerjoin. _ranges_join() use the parameter join_type to distinguished the leftjoin and innerjoin.
|
I made changes to the implementation of range join in leftjoin and it passed the test in test/join.jl. |
|
Sorry, I found some problems in my code. This code is not ready yet. |

I refer to the inequality that implementation on the on parameter in innerjoin() function, then implemented it on the leftjoin() and leftjoin!() function.
Now, we can use leftjoin like this: leftjoin(dsl,dsr,on = [1=>1, 2=>(:lower,nothing)]).
The methods used to find ranges and idx are the same as the methods in innerjoin, i just modified the way that uses ranges and idx to generate the result dataset in leftjoin().
I have done some testing, and this function works fine, and the run time and allocations are similar with innerjoin().