Add delimiter option for reading CSV files for Feathr#307
Add delimiter option for reading CSV files for Feathr#307xiaoyongzhu merged 22 commits intofeathr-ai:mainfrom
Conversation
Signed-off-by: Chang Yong Lik <theeahlag@gmail.com>
2b1c0a6 to
ddb5643
Compare
c764d1e to
36fe0b3
Compare
Signed-off-by: Chang Yong Lik <theeahlag@gmail.com> (cherry picked from commit bc71fad93c08f6d06e40f7e289456c6a1b4d45e0)
Signed-off-by: Chang Yong Lik <theeahlag@gmail.com>
Signed-off-by: Chang Yong Lik <theeahlag@gmail.com>
Signed-off-by: Chang Yong Lik <theeahlag@gmail.com>
Signed-off-by: Chang Yong Lik <theeahlag@gmail.com>
Signed-off-by: Chang Yong Lik <theeahlag@gmail.com>
|
@ahlag thanks for the contribution! This PR looks good to me, but I'm not sure why the test fails. Spent a bit time to investigate and feel it might be caused by the newly added tests? |
|
@xiaoyongzhu |
|
Talked with @ahlag offline and asked him to run |
Signed-off-by: changyonglik <theeahlag@gmail.com>
|
@xiaoyongzhu TestFileFormat.scala val sqlContext = ss.sqlContext
sqlContext.setConf("spark.feathr.inputFormat.csvOptions.sep", "\t")FileFormat.scala val csvDelimiterOption = ss.sparkContext.getConf.get("spark.feathr.inputFormat.csvOptions.sep", ",") |
Hmm it's a bit weird. Is it possible to force set the delimiters? |
|
Look like there can only be one SparkContext |
|
I think I will try a new approach. Since an existing SparkContext cannot be edited nor can another one be created, I will test it with e2e by passing the config from the client. |
I did some research and found this answer: Maybe you can try sqlContext.sql? |
|
Ok, I'll give this a shot |
|
@xiaoyongzhu |
hangfei
left a comment
There was a problem hiding this comment.
Also update wiki that we support tsv.
src/main/scala/com/linkedin/feathr/offline/source/dataloader/BatchDataLoader.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/linkedin/feathr/offline/source/dataloader/BatchDataLoader.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/linkedin/feathr/offline/source/dataloader/BatchDataLoader.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/linkedin/feathr/offline/source/dataloader/BatchDataLoader.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/linkedin/feathr/offline/source/dataloader/hdfs/FileFormat.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/linkedin/feathr/offline/source/dataloader/hdfs/FileFormat.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/linkedin/feathr/offline/source/dataloader/hdfs/FileFormat.scala
Show resolved
Hide resolved
src/test/scala/com/linkedin/feathr/offline/util/TestSourceUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/linkedin/feathr/offline/util/SourceUtils.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: changyonglik <theeahlag@gmail.com>
Signed-off-by: changyonglik <theeahlag@gmail.com>
Signed-off-by: changyonglik <theeahlag@gmail.com>
880ea90 to
3ec3fdf
Compare
Signed-off-by: changyonglik <theeahlag@gmail.com>
3842194 to
7711253
Compare
Signed-off-by: changyonglik <theeahlag@gmail.com>
062d916 to
2d41211
Compare
Signed-off-by: changyonglik <theeahlag@gmail.com>
f2ecf06 to
0b01cc6
Compare
Signed-off-by: changyonglik <theeahlag@gmail.com>
78b3cbd to
b4e55fb
Compare
|
@xiaoyongzhu @hangfei |
|
@ahlag would you mind merge latest main and resolve the conflicts so that we can get this merged, thanks for your time! |
|
@xiaoyongzhu @blrchen |
* Added documentation Signed-off-by: Chang Yong Lik <theeahlag@gmail.com> * Added delimiter to CSVLoader Signed-off-by: Chang Yong Lik <theeahlag@gmail.com> (cherry picked from commit bc71fad93c08f6d06e40f7e289456c6a1b4d45e0) * Added delimiter to BatchDataLoader, FileFormat and SourceUtils Signed-off-by: Chang Yong Lik <theeahlag@gmail.com> * Added test case for BatchDataLoader Signed-off-by: Chang Yong Lik <theeahlag@gmail.com> * Added test case for FileFormat Signed-off-by: Chang Yong Lik <theeahlag@gmail.com> * Added test case for BatchDataLoader Signed-off-by: Chang Yong Lik <theeahlag@gmail.com> * Added test case and fixed indent Signed-off-by: Chang Yong Lik <theeahlag@gmail.com> * Passing failure Signed-off-by: changyonglik <theeahlag@gmail.com> * Removed unused imports from BatchDataLoader Signed-off-by: changyonglik <theeahlag@gmail.com> * Fixed test failures Signed-off-by: changyonglik <theeahlag@gmail.com> * Added release version Signed-off-by: changyonglik <theeahlag@gmail.com> * Removed tailing space Signed-off-by: changyonglik <theeahlag@gmail.com> * Removed wildcard imports Signed-off-by: changyonglik <theeahlag@gmail.com> * Paraphrased comments and docstring Signed-off-by: changyonglik <theeahlag@gmail.com> * Added DelimiterUtils Signed-off-by: changyonglik <theeahlag@gmail.com> * Refactored utils Signed-off-by: changyonglik <theeahlag@gmail.com> * Updated wiki to support both tsv and csv Signed-off-by: changyonglik <theeahlag@gmail.com> * Fixed spelling error Signed-off-by: changyonglik <theeahlag@gmail.com> * trigger GitHub actions Signed-off-by: Chang Yong Lik <theeahlag@gmail.com> Signed-off-by: changyonglik <theeahlag@gmail.com>
Signed-off-by: Chang Yong Lik theeahlag@gmail.com
Description
Resolves #241
How was this patch tested?
Tested locally with
sbtProgress Tracker
Does this PR introduce any user-facing changes?
Allows users to specify delimiters