Skip to content

[C++][JSON] kMaxParserNumRows Value Increase/Removal #28994

@asfimport

Description

@asfimport

I'm a new pyArrow user and have been investigating occasional errors related to the Python exception: "ArrowInvalid: Exceeded maximum rows" when parsing JSON line files using pyarrow.json.read_json(). In digging in, it looks like the original source of this exception is in cpp/src/arrow/json/parser.cc on line 703, which appears to throw the error when the number of lines processed exceeds kMaxParserNumRows.

for (; num_rows_ < kMaxParserNumRows; ++num_rows_) {
      auto ok = reader.Parse<parse_flags>(json, handler);
      switch (ok.Code()) {
        case rj::kParseErrorNone:
          // parse the next object
          continue;
        case rj::kParseErrorDocumentEmpty:
          // parsed all objects, finish
          return Status::OK();
        case rj::kParseErrorTermination:
          // handler emitted an error
          return handler.Error();
        default:
          // rj emitted an error
          return ParseError(rj::GetParseError_En(ok.Code()), " in row ", num_rows_);
      }
    }
    return Status::Invalid("Exceeded maximum rows");
  }

 

This constant appears to be set in arrow/json/parser.h on line 53, and has been set this way since that file's initial commit.

constexpr int32_t kMaxParserNumRows = 100000;

 

There does not appear to be a comment in the code or in the commit or PR explaining this maximum number of lines.

 

I'm wondering what the reason for this maximum might be, and if it might be removed, increased, or made overridable in the C++ and the upstream Python. It is common to need to process JSON files of arbitrary length (logs from applications, third-party vendors, etc) where the user of the data does not have control over the size of the file.

Reporter: Ryan Stalets

Note: This issue was originally created as ARROW-13318. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions