External Formats and Options

Applies to: ❌ Data Cloud SQL ✅ Tableau Hyper API

Data Cloud SQL supports various external data formats.

Some external formats carry schema information while others don’t. Data Cloud SQL can infer the schema of formats having schema information automatically. But the schema must be given explicitly, or it’s taken from the target table in a COPY FROM statement for formats without schema information.

The format of an external source or target is set through the FORMAT option. If the FORMAT option isn’t specified, Data Cloud SQL tries to infer the format from the file extension. If multiple files are read or written, they’re all required to have the same extension.

Data Cloud SQL supports these formats.

Formatformat Option ValueRecognized File ExtensionsSchema Inference?Description
Text'text'.txt, .txt.gzNoText format, as in PostgreSQL. Optionally, gzip compressed.
CSV'csv'.csv, .csv.gzNoComma-Separated Value format, as in PostgreSQL. Optionally, gzip compressed.
Apache Parquet'parquet'.parquetYesThe Apache Parquet format. Versions 1 and 2 are supported.
Apache Iceberg'iceberg'Specified path must point to table directoryYesThe Apache Iceberg format. Versions 1 and 2 are supported, but version 3 isn’t supported.
Apache Arrow'arrowfile', 'arrowstream'arrow, arrowsNoThe Apache Arrow format version 1.3 with restrictions

Array types can’t be written to Apache Iceberg and Apache Parquet.

You can use format options to customize the way the external format is read and interpreted. The available options depend on the format. Syntactically, use the syntax for named parameters to define the options. For example, option => value. Separate multiple options with commas. In the COPY FROM, COPY TO, and CREATE EXTERNAL TABLE statement, define options in the WITH clause. For the external function, define options as function arguments after the initial argument describing the source or target. These example statements all read from a CSV file products.csv by using the field delimiter '|'.

These options are available for all or multiple external formats.

FORMAT => 'format_name'

The file format. This parameter can be omitted, if the names of the files to be scanned end in a recognized file extension.

COLUMNS => DESCRIPTOR(<column_def> [, ... ])

The schema of the file. A column_def consists of a column name followed by a column type. It can also include a COLLATE clause or a NOT NULL restriction. External functions require the COLUMNS option to read data. COPY and CREATE TEMPORARY EXTERNAL TABLE assume that the file has the same schema as the table.

SANITIZE => boolean

Data Cloud SQL requires input text to be valid UTF-8 and verifies the input when reading external data. If sanitize is false (default), then Data Cloud SQL raises an error if invalid UTF-8 data is met. If sanitize is set to true, then Data Cloud SQL replaces invalid UTF-8 sequences in text fields with the replacement character "�".

COMPRESSION => { 'auto' | 'none' | 'gzip' }

The 'auto' option (default) assumes that files with the .gz extension are gzipped and assumes that all other files are uncompressed. If set to 'none', all input files are treated as uncompressed. If set to 'gzip', all input files are expected to be gzip compressed. : gzip isn’t available for Apache Parquet, Apache Arrow, or Iceberg.

MAX_FILE_SIZE => integer

If set, the COPY TO statement writes to multiple files, with each of their sizes specified by MAX_FILE_SIZE except for the last file.

If you define the FORMAT => 'text' option or use a file extension of .txt, the data is read or written as a text file with one line per table row. Gzip-compressed text files can be read with the extension .txt.gz.

The delimiter character separates columns in a row. The column values are string representations of the values, as if the values were casted to the text type. The specified null string represents a SQL null value. An error is raised if any line of the input file contains more or fewer columns than expected.

Backslash characters (\) can be used in the data to quote data characters that otherwise represent row or column delimiters. Precede these characters with a backslash if they appear as part of a column value: backslash, newline, carriage return, and the current delimiter character.

The input is matched against the null string before removing backslashes. Therefore, a null string such as \N can’t be confused with the actual data value \N (represented as \\N).

Special backslash sequences are:

SequenceRepresents
\bBackspace (ASCII 8)
\fForm feed (ASCII 12)
\nNewline (ASCII 10)
\rCarriage return (ASCII 13)
\tTab (ASCII 9)
\vVertical tab (ASCII 11)
\<digits>Backslash followed by 1–3 octal digits specifies the character with that numeric code.
\x<digits>Backslash x followed by one or two hex digits specifies the character with that numeric code.

It’s strongly recommended that applications generating data convert data newlines and carriage returns to the \n and \r sequences.

Besides the common format options, the text format supports these options.

DELIMITER => 'delimiter'

Specifies the character that separates columns within each row (line) of the file. The default is a tab character in text format and a comma in CSV format. This character is a single one-byte character.

NULL => 'null_string'

Specifies the string that represents a null value. The default is \N (backslash-N) in text format and an unquoted empty string in CSV format.

ENCODING => { 'utf-8' | 'utf-16' | 'utf-16-le' | 'utf-16-be' }

Specifies the encoding of the file read. utf-16 automatically infers the endianness of the file from its byte order mark. If no byte order mark is present, little endian is used. utf-16-le (little endian) and utf-16-be (big endian) can be used to explicitly specify the endianness. The default is utf-8.

ON_CAST_FAILURE => { 'error' | 'set_null' }

Specifies the behavior if a cast failure occurs. error raises an error. If the field is nullable,set_null sets the value with cast failure to NULL. If the field isn’t nullable, then an error is raised. The default is error.

Use the FORMAT => 'csv' format option or a file extension of .csv to read and write the comma-separated value (CSV) file format. Instead of the escaping rules used by Data Cloud SQLs standard text format, it produces and recognizes the common CSV escaping system. Optionally, gzip compressed CSV files can also be read with the extension .csv.gz.

The values in each record are separated by the DELIMITER character. If the value has the delimiter character, the QUOTE character, the NULL string, a carriage return, or line feed character, then the whole value is prefixed and suffixed by the QUOTE character. Any occurrence within the value of a QUOTE character or the ESCAPE character is preceded by the escape character.

The CSV format has no standard way of distinguishing a NULL value from an empty string. Data Cloud SQL handles this by quoting. A NULL is represented by the NULL option string and isn’t quoted, while a non-NULL value matching the NULL option string is quoted. For example, with the default settings, a NULL is represented as an unquoted empty string, while an empty string data value is represented with double quotes (""). You can use FORCE_NOT_NULL to prevent NULL input comparisons for specific columns. You can also use FORCE_NULL to convert quoted null string data values to NULL.

FORCE_NULL and FORCE_NOT_NULL can be used simultaneously on the same column, which results in converting only quoted null strings to null values while unquoted null strings remain unchanged.

In CSV format, all characters are significant. A quoted value surrounded by white space, or any characters other than DELIMITER, include those characters. This scenario can cause errors if you import data from a system that pads CSV lines with white space to a fixed width. Before importing the data into Data Cloud SQL, preprocess the CSV file to remove the trailing white space.

Besides the common format options and the options for text format, the CSV format supports.

HEADER => boolean

Specifies whether the file contains a header line with the names of each column in the file. Default is false.

QUOTE => 'quote_character'

Specifies the quoting character to be used if a data value is quoted. The default is double-quote. This character is a one-byte character.

ESCAPE => 'escape_character'

Specifies the character that appears before a data character that matches the QUOTE value. The default is the same as the QUOTE value so that the quoting character is doubled if it appears in the data. This character is a one-byte character.

FORCE_NOT_NULL

Don’t match the specified columns’ values against the null string. By default, empty values are read as zero-length strings rather than nulls even if they aren’t quoted.

FORCE_NULL

Match the specified columns’ values against the null string even if it’s been quoted, and if a match is found, set the value to NULL. In the default case where the null string is empty, a quoted empty string is converted into NULL.

The FORMAT => 'parquet' option or a file extension of .parquet enables reading and writing the Apache Parquet format. Data stored in Parquet is usually smaller than in text format. If only a few columns are chosen, Data Cloud SQL only reads the parts of the file where the selected columns are stored, which speeds up queries. This feature is helpful for reading from an outside location with limited bandwidth, like when accessing Amazon S3 from a machine that isn’t an AWS server.

The Apache Parquet format stores schema information, so Data Cloud SQL can infer the schema for Parquet files.

Data Cloud SQL supports Parquet versions 1 and 2, including the DataPageV2 page format. However, some data types, encodings, and compressions aren’t supported. These restrictions apply when reading Parquet.

  • Nested columns and the nested types MAP and LIST

  • The types BSON, UUID, and ENUM

  • The physical type FIXED_LEN_BYTE_ARRAY without any logical or converted type

  • The types TIME_MILLIS and TIME_NANOS. Consider using TIME_MICROS instead.

  • The deprecated BIT_PACKED encoding

  • The DELTA_LENGTH_BYTE_ARRAY encoding and the BYTE_STREAM_SPLIT

Supported compressions are SNAPPY, GZIP, ZSTD, and LZ4_RAW

If a Parquet file contains columns with unsupported data types or encodings, Data Cloud SQL can still read the other columns in the file, as long as you don’t select any unsupported columns.

Besides the common format options, use these options to write to Parquet files.

ROWS_PER_ROW_GROUP => integer

The number of rows to include in a row group

CODEC => { 'uncompressed' | 'snappy' | 'gzip' | 'zstd' | 'lz4_raw' }

Options to compress the data

WRITE_STATISTICS => { 'none' | 'rowgroup' | 'page' | 'columnindex' }

How the statistics are written. Rowgroup writes row group statistics. page writes row group and page statistics. columnindex writes row group, column, and offset statistics, without page statistics.

The FORMAT => 'iceberg' option enables reading the Apache Iceberg format, which is a binary format designed for huge data analytics tables. Apache Iceberg provides metadata information like versioning, partitioning, and statistics on top of storage formats like parquet. Data Cloud SQL uses the metadata information to efficiently access Iceberg tables and only read the data from the parquet files required to answer a query.

To use Iceberg in a query, provide the path to the table root directory, which is the directory that has metadata/ and data/ subdirectories. The location of the Iceberg table can be specified as described in External Locations. You can query Iceberg tables from local disk and from Amazon S3. Here’s an example of how to use Iceberg, assuming:

Data Cloud SQL supports Iceberg v1 and v2. The Iceberg format has these restrictions and features.

  • Data Cloud SQL only supports Apache Parquet as data storage format. Other formats like ORC aren’t supported.

  • The restrictions for parquet file support apply.

  • Merge on Read is supported for positional deletes. Equality deletes aren’t supported.

  • Schema changes like added, dropped, and renamed columns are supported.

To read columnar binary format Apache Arrow, use the FORMAT => 'arrowfile' or FORMAT => 'arrowstream' option enables. As data is stored in a columnar format, Data Cloud SQL doesn’t read the whole file if only a subset of columns is selected. Loading partial Arrow files from S3 isn’t supported.

Data Cloud SQL supports the IPC streaming format (file extension .arrows) with FORMAT => 'arrowstream'. It also supports the IPC file format, sometimes referred to as Feather V2 (file extensions .arrow or .feather) with FORMAT => 'arrowfile'.

Arrow supports the common format options. The COLUMNS option is required. There is a 1-to-1 mapping from Arrow types to SQL types. Data Cloud SQL doesn’t perform implicit casts. The COMPRESSION option isn’t supported.

Arrow TypeHyper TypeNotes
BoolBool
Int signed, 8 bitSmallInt
Int signed, 16 bitSmallInt
Int signed, 32 bitInteger
Int signed, 64 bitBigInt
Int unsigned, 8 bitSmallInt
Int unsigned, 16 bitInteger
Int unsigned, 32 bitBigInt
Int unsigned, 64 bitNumeric(20,0)Not recommended - processing is slower than other Int types.
FloatingPoint with any precisionDOUBLE PRECISIONHalf floats aren’t supported.
Decimal(precision, scale)NUMERIC(precision, scale)Hyper doesn’t support precision > 38. For better performance, we recommend using precision <= 18.
DateDate
TimeTimeHyper doesn’t support nano second precision and truncates it to microseconds.
Timestamp without timezoneTIMESTAMPHyper doesn’t support nanosecond precision.
Timestamp with timezoneTIMESTAMPTZHyper doesn’t support nanosecond precision.
UTF8Text

Data Cloud SQL supports version 1.3 of the Arrow specification. These Arrow features aren’t supported.

  • Data types: Interval, LargeUtf8, Struct, lists, union, map types, Binary, LargeBinary, FixedSizeBinary, HalfFloat

  • Run-end or dictionary encoding

  • Arrow internal compression, LZ4 compression, on record batches. See RecordBatch::compression in the Arrow messages schema.

  • Big-Endian encoding

  • Duplicated column names

  • Schemas with zero columns

If an Arrow file contains columns with unsupported data types, Data Cloud SQL can still read the other columns in the file, as long as you don’t select any unsupported columns. This isn’t the case for unsupported encodings because Data Cloud can’t read dictionary-encoded Arrow data.

Because Data Cloud SQL parallelizes queries on the granularity of record batches, we recommend batch sizes of a few thousand rows.