Skip to content

modified the listFiles() function #736

Open
icaiyu wants to merge 1 commit into
embulk:masterfrom
icaiyu:master
Open

modified the listFiles() function #736
icaiyu wants to merge 1 commit into
embulk:masterfrom
icaiyu:master

Conversation

@icaiyu
Copy link
Copy Markdown

@icaiyu icaiyu commented Jul 21, 2017

Using the ArrayList instead of ImmutableList for the files so that we can sort them before set them to the task.
To load the files exactly in the correct order. The max_threads should equal 1. Using multi threads to load the files is impossible to load them in the correct order.

…tableList for the files so that we can sort them before set them to the task
@dmikurube
Copy link
Copy Markdown
Member

@icaiyu Thanks for your contribution.

But why do you think an alphabetical order is the "correct" order? It needs to be consistent per run (to resume), but it does not need to be an alphabetical order actually. AFAIU, the current ordering is at least consistent (per file system or else) even if it may not be in alphabetical order. It may be nice to have an alphabetical order just for usability, but it's not a "problem".

I think we cannot accept this PR as-is because it may break compatibility of "resumed" users. Consider that a user has used LocalFileInput with the current Embulk version and resumed with some files done. The config diff is generated with the current version. Imagine the user upgrades Embulk to a new version (with this PR merged), and resume with the config diff generated by the older Embulk version. The behavior can be totally "broken" -- some files may be uploaded twice, and some files may be never uploaded. It's a bigger problem than just a non-alphabetical ordering.

An acceptable option may be to have a new configuration option like sort: alphabetical or like that not to break existing behaviors.

What do you think?

@icaiyu
Copy link
Copy Markdown
Author

icaiyu commented Jul 23, 2017

@dmikurube
Thank you for your reply, but I am getting more confussing about the Resume function of Embulk now.
Say that I have 9 files:
sample_01.csv, sample_02.csv, sample_03.csv...sample_09.csv
I want to upload them to the server using the LocalInputPlugin
After embulk upload 4 files, what if the network is broken? what if I press the Ctrl+C the stop the upload, Could the embulk record the number the sample files which are not uploaded yet?
In this case, it should record sample_04.csv in the diff.
So that I could resume the upload from the sample_05.csv.

@dmikurube
Copy link
Copy Markdown
Member

dmikurube commented Jul 24, 2017

@icaiyu Can you share the real example what you observed? As I've mentioned above, it may not be alphabetical, but I think it'd not be a problem in resuming.

@hiroyuki-sato
Copy link
Copy Markdown
Member

hiroyuki-sato commented Jul 24, 2017

@icaiyu
I hope this comment solve your question.

Summaries.

  • The LocalFileInput load data in parallel. (number of files = tasks)
  • So, those data don't always load in alphabetical order.

How to resume?

Let's explain about resume behavior.

You can download all of the sample files from here.

Input Data

The following example uses three files.

hoge/csv/sample_01.csv
hoge/csv/sample_02.csv <-- this fie has invalid record
hoge/csv/sample_03.csv

Line 3 in sample_02.csv has invalid record.
The aaa isn't a valid number.

aaa,14824,2015-01-27 19:01:23,20150127,embulk jruby

run

So embulk run fail.

embulk run -r resume.yml config.yml

[INFO] Writing resume state to 'resume.yml'
[INFO] Resume state is written. Run the transaction again with -r option to resume or use "cleanup" subcommand to delete intermediate data.
java.lang.RuntimeException: org.embulk.spi.DataException: Invalid record at line 3: aaa,14824,2015-01-27 19:01:23,20150127,embulk jruby
	at org.embulk.EmbulkRunner.runInternal(EmbulkRunner.java:360)
	at org.embulk.EmbulkRunner.run(EmbulkRunner.java:173)
	at org.embulk.cli.EmbulkRun.runSubcommand(EmbulkRun.java:475)
	at org.embulk.cli.EmbulkRun.run(EmbulkRun.java:99)
	at org.embulk.cli.EmbulkBundle.checkBundleWithEmbulkVersion(EmbulkBundle.java:42)
	at org.embulk.cli.EmbulkBundle.checkBundle(EmbulkBundle.java:15)
	at org.embulk.cli.Main.main(Main.java:26)

resume file.

After fail embulk run, Embulk generate resume.yml file
In this case, LocalFileInput load sample01.csv and sample03.csv correctly,
But, sample02.csv doesn't load yet.

The null records in in_reports part mean "That task not complete yet".

in_task:
  FileInputTaskSource:
    Files: [hoge/csv/sample_01.csv, hoge/csv/sample_02.csv, hoge/csv/sample_03.csv]
in_reports:
- {}   # Task1
- null # Task2 <-- this data not load completely, 2nd is hoge/csv/sample_02.csv
- {}   # Task3 
out_reports:
- {}
- {}
- null
- null
- {}
- {}

@dmikurube
Copy link
Copy Markdown
Member

@icaiyu Apart from this issue itself, we've found that the -r option was not working very well. It turned out that very very few people have actually tried using the resume feature... including ourselves as well.

We'll be discussing it at #740. Everything is TBD, but our future options may include removing the resume feature.

If you could share your practical and real use-cases/requirements for the resume feature, those may be very helpful for us to decide, and maybe to re-design. Thanks!

@icaiyu
Copy link
Copy Markdown
Author

icaiyu commented Jul 24, 2017

@hiroyuki-sato
The -r option doesn't generate the resume.yml.
I try the
embulk run config.yml -o resume.yml

My sample

sample_01.csv.gz, sample_02.csv.gz, sample_03.csv.gz, ... , sample_07.csv.gz
In the sample_04_csv.gz it have an invalid record.

My config.yml

in:
type: file
path_prefix: /home/cy/data/try1/csv/sample_
decoders:

  • {type: gzip}
    parser:
    charset: UTF-8
    newline: LF
    type: csv
    delimiter: ','
    quote: '"'
    escape: '"'
    null_string: 'NULL'
    trim_if_not_quoted: false
    skip_header_lines: 1
    allow_extra_columns: false
    allow_optional_columns: false
    columns:
    • {name: id, type: long}
    • {name: account, type: long}
    • {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
    • {name: purchase, type: timestamp, format: '%Y%m%d'}
    • {name: comment, type: string}
      out: {type: stdout}

My run

embulk run config.yml -o resume.yml

2017-07-24 10:33:20.339 +0200: Embulk v0.8.27
2017-07-24 10:33:20.531 +0200: Run with -o option is deprecated. Please use -c option instead. For example,
2017-07-24 10:33:20.541 +0200:
2017-07-24 10:33:20.541 +0200: $ embulk run config.yml -c diff.yml
2017-07-24 10:33:20.541 +0200:
2017-07-24 10:33:20.542 +0200: This -c option stores only diff of the next configuration.
2017-07-24 10:33:20.542 +0200: The diff will be merged to the original config.yml file.
2017-07-24 10:33:20.544 +0200:
2017-07-24 10:33:23.091 +0200 [INFO] (0001:transaction): Listing local files at directory '/home/cy/data/try1/csv' filtering filename by
prefix 'sample_'
2017-07-24 10:33:23.097 +0200 [INFO] (0001:transaction): "follow_symlinks" is set false. Note that symbolic links to directories are ski
pped.
2017-07-24 10:33:23.099 +0200 [INFO] (0001:transaction): Loading files [/home/cy/data/try1/csv/sample_05.csv.gz, /home/cy/data/try1/csv/
sample_02.csv.gz, /home/cy/data/try1/csv/sample_03.csv.gz, /home/cy/data/try1/csv/sample_07.csv.gz, /home/cy/data/try1/csv/sample_06.csv
.gz, /home/cy/data/try1/csv/sample_04.csv.gz, /home/cy/data/try1/csv/sample_01.csv.gz]
2017-07-24 10:33:23.177 +0200 [INFO] (0001:transaction): Using local thread executor with max_threads=2 / tasks=7
2017-07-24 10:33:23.193 +0200 [INFO] (0001:transaction): {done: 0 / 7, running: 0}
1,32864,2015-01-27 19:23:49,20150127,embulk
2,14824,2015-01-27 19:01:23,20150127,embulk jruby
3,27559,2015-01-28 02:20:02,20150128,Embulk "csv" parser plugin
1,32864,2015-01-27 19:23:49,20150127,embulk
2,14824,2015-01-27 19:01:23,20150127,embulk jruby
3,27559,2015-01-28 02:20:02,20150128,Embulk "csv" parser plugin
4,11270,2015-01-29 11:54:36,20150129,
4,11270,2015-01-29 11:54:36,20150129,
2017-07-24 10:33:23.417 +0200 [INFO] (0001:transaction): {done: 2 / 7, running: 1}
2017-07-24 10:33:23.417 +0200 [INFO] (0001:transaction): {done: 2 / 7, running: 1}
1,32864,2015-01-27 19:23:49,20150127,embulk
2,14824,2015-01-27 19:01:23,20150127,embulk jruby
3,27559,2015-01-28 02:20:02,20150128,Embulk "csv" parser plugin
4,11270,2015-01-29 11:54:36,20150129,
1,32864,2015-01-27 19:23:49,20150127,embulk
2,14824,2015-01-27 19:01:23,20150127,embulk jruby
3,27559,2015-01-28 02:20:02,20150128,Embulk "csv" parser plugin
4,11270,2015-01-29 11:54:36,20150129,
2017-07-24 10:33:23.446 +0200 [INFO] (0001:transaction): {done: 4 / 7, running: 1}
2017-07-24 10:33:23.447 +0200 [INFO] (0001:transaction): {done: 4 / 7, running: 1}
1,32864,2015-01-27 19:23:49,20150127,embulk
2,14824,2015-01-27 19:01:23,20150127,embulk jruby
3,27559,2015-01-28 02:20:02,20150128,Embulk "csv" parser plugin
4,11270,2015-01-29 11:54:36,20150129,
2017-07-24 10:33:23.451 +0200 [INFO] (0001:transaction): {done: 5 / 7, running: 1}
1,32864,2015-01-27 19:23:49,20150127,embulk
2017-07-24 10:33:23.458 +0200 [WARN] (0015:task-0005): Skipped line 3 (java.lang.NumberFormatException: For input string: "aaa"): aaa,14
824,2015-01-27 19:01:23,20150127,embulk jruby
1,32864,2015-01-27 19:23:49,20150127,embulk
3,27559,2015-01-28 02:20:02,20150128,Embulk "csv" parser plugin
4,11270,2015-01-29 11:54:36,20150129,
2017-07-24 10:33:23.464 +0200 [INFO] (0001:transaction): {done: 6 / 7, running: 1}
2,14824,2015-01-27 19:01:23,20150127,embulk jruby
3,27559,2015-01-28 02:20:02,20150128,Embulk "csv" parser plugin
4,11270,2015-01-29 11:54:36,20150129,
2017-07-24 10:33:23.466 +0200 [INFO] (0001:transaction): {done: 7 / 7, running: 0}
2017-07-24 10:33:23.470 +0200 [INFO] (main): Committed.
2017-07-24 10:33:23.470 +0200 [INFO] (main): Next config diff: {"in":{"last_path":"/home/cy/data/try1/csv/sample_07.csv.gz"},"out":{}}

Expect result

Since the sample_04.csv.gz have an invalid record.
I expect that embulk stop at the sample_04.csv.gz
Thus in the resume.yml, it should have
last_path: sample_04.csv.gz

so that I can fix the invalid record in sample_04.csv.gz
And then I can continue at the sample_04.csv.gz

Actual result

in:
type: file
path_prefix: /home/cy/data/try1/csv/sample_
decoders:

  • {type: gzip}
    parser:
    charset: UTF-8
    newline: LF
    type: csv
    delimiter: ','
    quote: '"'
    escape: '"'
    null_string: 'NULL'
    trim_if_not_quoted: false
    skip_header_lines: 1
    allow_extra_columns: false
    allow_optional_columns: false
    columns:
    • {name: id, type: long}
    • {name: account, type: long}
    • {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
    • {name: purchase, type: timestamp, format: '%Y%m%d'}
    • {name: comment, type: string}
      last_path: /home/cy/data/try1/csv/sample_07.csv.gz
      out: {type: stdout}

@hiroyuki-sato
Copy link
Copy Markdown
Member

@icaiyu

  • Add stop_on_invalid_record: true in your config.yml
  • Exeucte embulk run -r resume.yml config.yml

Your embulk run will fail and generate resume.yml.

You have to set stop_on_invalid_record: true. Otherwise, embulk skip an invalid record.
https://github.com/hiroyuki-sato/embulk-support/blob/master/core_736_resume_test/config.yml#L16

Your embulk run generate the following line.

2017-07-24 10:33:23.458 +0200 [WARN] (0015:task-0005): Skipped line 3 (java.lang.NumberFormatException: For input string: "aaa"): aaa,14

The resume file generated if eumbulk run failed.

@icaiyu
Copy link
Copy Markdown
Author

icaiyu commented Jul 24, 2017

@hiroyuki-sato
Thank you, I got the resume.yml now. What if I fix the invalid record in the sample_04.csv.gz(For example, I just delete that record manually)
Could I run the Embulk again that it only parse the sample_04.csv.gz ?

@hiroyuki-sato
Copy link
Copy Markdown
Member

@icaiyu

@dmikurube mentioned above,
#736 (comment)
we've found that the -r option was not working very well.

Probably, -r option does not work as you expected.

That's why he asked you below.

If you could share your practical and real use-cases/requirements for the resume feature, those may be very helpful for us to decide, and maybe to re-design. Thanks!

Maybe, you can't delete sample_04.csv.gz. you need sample_04.csv.gz file even it is empty.
But I've never tested it before.

@icaiyu
Copy link
Copy Markdown
Author

icaiyu commented Jul 24, 2017

@hiroyuki-sato
I means in the sample_04.csv there are many lines of record, only some of them are invalid, after I fix those invalid record, I hope that the embulk could continue to deal the sample_04.csv, and then the sample_05.csv, sample_06.csv ....

In my case , the order is very important, the embulk deal the file one by one
sample_01.csv OK. upload it to the server.
sample_02.csv OK. upload it to the server.
sample_03.csv OK. upload it to the server.
sample_04.csv Oh, there are some invalid records in this file!

The embulk should stop and record the
"last_path: sample_04.csv" in the resume.yml

Then I fix invalid data in the sample_04.csv then I run the resume.yml, the embulk should start again.

sample_04.csv OK. upload it to the server.
sample_05.csv OK. upload it to the server.
..
sample_09.csv OK. upload it to the server.

It seems that the embulk doesn't work as I want. I am trying to solve this problem in my plugin. Maybe I can write the certain resume() function in my plugin.
Anyway, Thank you for your reply, they are really helpful for me to understand how the embulk work.

@hiroyuki-sato
Copy link
Copy Markdown
Member

If you load those files one by one, I recommend you below.
It is the simplest way.

  • Put data files into queue directory.
  • Create data load directory
  • Save the following file as load.sh
  • Execute load.sh
    • It moves one file from the queue to data directory.
    • Execute embulk. It loads the single file. (sample01.csv)
    • Embulk save diff.yml as sample01.csv loaded.
    • It moves one file from queue to data directory. (sample02.csv)
    • Execute embulk. It loads the single file. (sample02.csv), because embulk knows sample01.csv already loaded.
  • You simply modify error file(sample04.csv) if load.sh stopped.
#!/bin/bash

while [ $( ls queue | wc -l ) -ne 0 ] ; do 
  ls -1 queue/sample* | head -1 | xargs -I % mv % datas
  embulk run conf.yml -c diff.yml  >> embulk.log 2>&1 
  ret=$?
  if [ $ret -ne 0 ] ; then
    echo "Failed: $ret" >> embulk.log 
    exit 1
  fi
  echo ""
done
in:
  type: file
  path_prefix: datas/sample_
  # ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants