0

I have a BigQuery table that's configured to be an External Table looking at Cloud Storage. The Source URI is: gs://[bucketname]/test_file*

I did not specify maxBadRecords in my table creation request, and the default value is 0. I also did not specify ignoreUnknownValues, and the default for that is False. So any extra, or fewer fields in my source json files would result in a "bad row", and an error should result. The schema for the table is:

[
  {
    "name": "second_column",
    "mode": "NULLABLE",
    "type": "INTEGER",
    "description": null,
    "fields": []
  },
  {
    "name": "first_column",
    "mode": "NULLABLE",
    "type": "STRING",
    "description": null,
    "fields": []
  }
]

I have 3 files:

test_file_1_1_1_1_1_1.json:

{"first_column": "a good value goes here", "second_column": 9}

test_file_2_2_2_2_2_2.json:

{"first_column": "a good value goes here", "second_column": 9, "extraneous_column":  "uh oh"}

test_file_3_3_3_3_3_3.json:

{"first_column": "a good value goes here"}

So file 1 matches the schema, file 2 has an extra column, and file 3 is missing a column.

Querying the table when only file 1 is in the bucket returns the expected data from the file. When I add file 2 to the bucket, I get this result:

Both files display

which is not what I'd expect. When I also add the third file, I get:

What?

Is this expected behavior? I'm confused, as I'd expect any number of bad rows to result in such an error.

4
  • Seems like the error is due to the extra column “extraneous_column”. So whenever you're trying to load files that contain columns that are not represented in the table schema it is recommended to use the Ignore unknown values and Number of bad records allowed flags based on your requirement. You can find more information from this link. Commented Jun 13, 2024 at 8:04
  • @kiranmathew but it succeeds when that file and the "good" file are in the bucket. It only starts failing when the third file is also added. Commented Jun 13, 2024 at 14:08
  • Also, I have confirmed that maxBadRecords and ignoreUnknownValues are indeed not set, via a get_table request from the Python client. Commented Jun 13, 2024 at 14:56
  • Hi @Jeffrey Van Laethem, It appears that this issue has to be investigated further, so if you have a support plan please create a new GCP support case. Otherwise, you can open a new issue on the issue tracker describing your issue. Commented Jun 17, 2024 at 12:17

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.