modified the listFiles() function #736
Conversation
…tableList for the files so that we can sort them before set them to the task
|
@icaiyu Thanks for your contribution. But why do you think an alphabetical order is the "correct" order? It needs to be consistent per run (to resume), but it does not need to be an alphabetical order actually. AFAIU, the current ordering is at least consistent (per file system or else) even if it may not be in alphabetical order. It may be nice to have an alphabetical order just for usability, but it's not a "problem". I think we cannot accept this PR as-is because it may break compatibility of "resumed" users. Consider that a user has used An acceptable option may be to have a new configuration option like What do you think? |
|
@dmikurube |
|
@icaiyu Can you share the real example what you observed? As I've mentioned above, it may not be alphabetical, but I think it'd not be a problem in resuming. |
|
@icaiyu Summaries.
How to resume?Let's explain about You can download all of the sample files from here. Input DataThe following example uses three files. Line 3 in runSo
resume file.After fail The in_task:
FileInputTaskSource:
Files: [hoge/csv/sample_01.csv, hoge/csv/sample_02.csv, hoge/csv/sample_03.csv]
in_reports:
- {} # Task1
- null # Task2 <-- this data not load completely, 2nd is hoge/csv/sample_02.csv
- {} # Task3
out_reports:
- {}
- {}
- null
- null
- {}
- {} |
|
@icaiyu Apart from this issue itself, we've found that the We'll be discussing it at #740. Everything is TBD, but our future options may include removing the resume feature. If you could share your practical and real use-cases/requirements for the resume feature, those may be very helpful for us to decide, and maybe to re-design. Thanks! |
|
@hiroyuki-sato My samplesample_01.csv.gz, sample_02.csv.gz, sample_03.csv.gz, ... , sample_07.csv.gz My config.ymlin:
My runembulk run config.yml -o resume.yml 2017-07-24 10:33:20.339 +0200: Embulk v0.8.27 Expect resultSince the sample_04.csv.gz have an invalid record. so that I can fix the invalid record in sample_04.csv.gz Actual resultin:
|
Your You have to set Your The resume file generated if |
|
@hiroyuki-sato |
|
@dmikurube mentioned above, Probably, That's why he asked you below.
Maybe, you can't delete |
|
@hiroyuki-sato In my case , the order is very important, the embulk deal the file one by one The embulk should stop and record the Then I fix invalid data in the sample_04.csv then I run the resume.yml, the embulk should start again. sample_04.csv OK. upload it to the server. It seems that the embulk doesn't work as I want. I am trying to solve this problem in my plugin. Maybe I can write the certain resume() function in my plugin. |
|
If you load those files one by one, I recommend you below.
#!/bin/bash
while [ $( ls queue | wc -l ) -ne 0 ] ; do
ls -1 queue/sample* | head -1 | xargs -I % mv % datas
embulk run conf.yml -c diff.yml >> embulk.log 2>&1
ret=$?
if [ $ret -ne 0 ] ; then
echo "Failed: $ret" >> embulk.log
exit 1
fi
echo ""
donein:
type: file
path_prefix: datas/sample_
# ... |
Using the ArrayList instead of ImmutableList for the files so that we can sort them before set them to the task.
To load the files exactly in the correct order. The max_threads should equal 1. Using multi threads to load the files is impossible to load them in the correct order.