External Formats and Options
Applies to: ❌ Data Cloud SQL ✅ Tableau Hyper API
Data Cloud SQL supports various external data formats.
Some external formats carry schema information while others don’t. Data Cloud SQL can infer the schema of formats having schema information automatically. But the schema must be given explicitly, or it’s taken from the target table in a COPY FROM statement for formats without schema information.
The format of an external source or target is set through the FORMAT
option. If the FORMAT
option isn’t specified, Data Cloud SQL tries to infer the format from the file extension. If multiple files are read or written, they’re all required to have the same extension.
Data Cloud SQL supports these formats.
Format | format Option Value | Recognized File Extensions | Schema Inference? | Description |
---|---|---|---|---|
Text | 'text' | .txt , .txt.gz | No | Text format, as in PostgreSQL. Optionally, gzip compressed. |
CSV | 'csv' | .csv , .csv.gz | No | Comma-Separated Value format, as in PostgreSQL. Optionally, gzip compressed. |
Apache Parquet | 'parquet' | .parquet | Yes | The Apache Parquet format. Versions 1 and 2 are supported. |
Apache Iceberg | 'iceberg' | Specified path must point to table directory | Yes | The Apache Iceberg format. Versions 1 and 2 are supported, but version 3 isn’t supported. |
Apache Arrow | 'arrowfile' , 'arrowstream' | arrow , arrows | No | The Apache Arrow format version 1.3 with restrictions |
Array types can’t be written to Apache Iceberg and Apache Parquet.
You can use format options to customize the way the external format is read and interpreted. The available options depend on the format. Syntactically, use the syntax for named parameters to define the options. For example, option => value
. Separate multiple options with commas. In the COPY FROM, COPY TO, and CREATE EXTERNAL TABLE statement, define options in the WITH
clause. For the external
function, define options as function arguments after the initial argument describing the source or target. These example statements all read from a CSV file products.csv
by using the field delimiter '|'
.
These options are available for all or multiple external formats.
FORMAT => 'format_name'
The file format. This parameter can be omitted, if the names of the files to be scanned end in a recognized file extension.
COLUMNS => DESCRIPTOR(<column_def> [, ... ])
The schema of the file. A
column_def
consists of a column name followed by a column type. It can also include aCOLLATE
clause or aNOT NULL
restriction. External functions require theCOLUMNS
option to read data.COPY
andCREATE TEMPORARY EXTERNAL TABLE
assume that the file has the same schema as the table.SANITIZE => boolean
Data Cloud SQL requires input text to be valid UTF-8 and verifies the input when reading external data. If
sanitize
isfalse
(default), then Data Cloud SQL raises an error if invalid UTF-8 data is met. Ifsanitize
is set totrue
, then Data Cloud SQL replaces invalid UTF-8 sequences in text fields with the replacement character "�".COMPRESSION => { 'auto' | 'none' | 'gzip' }
The
'auto'
option (default) assumes that files with the.gz
extension are gzipped and assumes that all other files are uncompressed. If set to'none'
, all input files are treated as uncompressed. If set to'gzip'
, all input files are expected to be gzip compressed. :gzip
isn’t available for Apache Parquet, Apache Arrow, or Iceberg.MAX_FILE_SIZE => integer
If set, the
COPY TO
statement writes to multiple files, with each of their sizes specified byMAX_FILE_SIZE
except for the last file.
If you define the FORMAT => 'text'
option or use a file extension of .txt
, the data is read or written as a text file with one line per table row. Gzip-compressed text files can be read with the extension .txt.gz
.
The delimiter character separates columns in a row. The column values are string representations of the values, as if the values were casted to the text
type. The specified null string represents a SQL null value. An error is raised if any line of the input file contains more or fewer columns than expected.
Backslash characters (\
) can be used in the data to quote data characters that otherwise represent row or column delimiters. Precede these characters with a backslash if they appear as part of a column value: backslash, newline, carriage return, and the current delimiter character.
The input is matched against the null string before removing backslashes. Therefore, a null string such as \N
can’t be confused with the actual data value \N
(represented as \\N
).
Special backslash sequences are:
Sequence | Represents |
---|---|
\b | Backspace (ASCII 8) |
\f | Form feed (ASCII 12) |
\n | Newline (ASCII 10) |
\r | Carriage return (ASCII 13) |
\t | Tab (ASCII 9) |
\v | Vertical tab (ASCII 11) |
\<digits> | Backslash followed by 1–3 octal digits specifies the character with that numeric code. |
\x<digits> | Backslash x followed by one or two hex digits specifies the character with that numeric code. |
It’s strongly recommended that applications generating data convert data newlines and carriage returns to the \n
and \r
sequences.
Besides the common format options, the text format supports these options.
DELIMITER => 'delimiter'
Specifies the character that separates columns within each row (line) of the file. The default is a tab character in text format and a comma in
CSV
format. This character is a single one-byte character.NULL => 'null_string'
Specifies the string that represents a null value. The default is
\N
(backslash-N) in text format and an unquoted empty string inCSV
format.ENCODING => { 'utf-8' | 'utf-16' | 'utf-16-le' | 'utf-16-be' }
Specifies the encoding of the file read.
utf-16
automatically infers the endianness of the file from its byte order mark. If no byte order mark is present, little endian is used.utf-16-le
(little endian) andutf-16-be
(big endian) can be used to explicitly specify the endianness. The default isutf-8
.ON_CAST_FAILURE => { 'error' | 'set_null' }
Specifies the behavior if a cast failure occurs.
error
raises an error. If the field is nullable,set_null
sets the value with cast failure to NULL. If the field isn’t nullable, then an error is raised. The default iserror
.
Use the FORMAT => 'csv'
format option or a file extension of .csv
to read and write the comma-separated value (CSV) file format. Instead of the escaping rules used by Data Cloud SQLs standard text format, it produces and recognizes the common CSV escaping system. Optionally, gzip compressed CSV files can also be read with the extension .csv.gz
.
The values in each record are separated by the DELIMITER
character. If the value has the delimiter character, the QUOTE
character, the NULL
string, a carriage return, or line feed character, then the whole value is prefixed and suffixed by the QUOTE
character. Any occurrence within the value of a QUOTE
character or the ESCAPE
character is preceded by the escape character.
The CSV format has no standard way of distinguishing a NULL
value from an empty string. Data Cloud SQL handles this by quoting. A NULL
is represented by the NULL
option string and isn’t quoted, while a non-NULL
value matching the NULL
option string is quoted. For example, with the default settings, a NULL
is represented as an unquoted empty string, while an empty string data value is represented with double quotes (""
). You can use FORCE_NOT_NULL
to prevent NULL
input comparisons for specific columns. You can also use FORCE_NULL
to convert quoted null string data values to NULL
.
FORCE_NULL
and FORCE_NOT_NULL
can be used simultaneously on the same column, which results in converting only quoted null strings to null values while unquoted null strings remain unchanged.
In CSV
format, all characters are significant. A quoted value surrounded by white space, or any characters other than DELIMITER
, include those characters. This scenario can cause errors if you import data from a system that pads CSV
lines with white space to a fixed width. Before importing the data into Data Cloud SQL, preprocess the CSV
file to remove the trailing white space.
Besides the common format options and the options for text format, the CSV format supports.
HEADER => boolean
Specifies whether the file contains a header line with the names of each column in the file. Default is false.
QUOTE => 'quote_character'
Specifies the quoting character to be used if a data value is quoted. The default is double-quote. This character is a one-byte character.
ESCAPE => 'escape_character'
Specifies the character that appears before a data character that matches the
QUOTE
value. The default is the same as theQUOTE
value so that the quoting character is doubled if it appears in the data. This character is a one-byte character.FORCE_NOT_NULL
Don’t match the specified columns’ values against the null string. By default, empty values are read as zero-length strings rather than nulls even if they aren’t quoted.
FORCE_NULL
Match the specified columns’ values against the null string even if it’s been quoted, and if a match is found, set the value to
NULL
. In the default case where the null string is empty, a quoted empty string is converted into NULL.
The FORMAT => 'parquet'
option or a file extension of .parquet
enables reading and writing the Apache Parquet format. Data stored in Parquet is usually smaller than in text format. If only a few columns are chosen, Data Cloud SQL only reads the parts of the file where the selected columns are stored, which speeds up queries. This feature is helpful for reading from an outside location with limited bandwidth, like when accessing Amazon S3 from a machine that isn’t an AWS server.
The Apache Parquet format stores schema information, so Data Cloud SQL can infer the schema for Parquet files.
Data Cloud SQL supports Parquet versions 1 and 2, including the DataPageV2 page format. However, some data types, encodings, and compressions aren’t supported. These restrictions apply when reading Parquet.
-
Nested columns and the nested types
MAP
andLIST
-
The types
BSON
,UUID
, andENUM
-
The physical type
FIXED_LEN_BYTE_ARRAY
without any logical or converted type -
The types
TIME_MILLIS
andTIME_NANOS
. Consider usingTIME_MICROS
instead. -
The deprecated
BIT_PACKED
encoding -
The
DELTA_LENGTH_BYTE_ARRAY
encoding and theBYTE_STREAM_SPLIT
Supported compressions are SNAPPY
, GZIP
, ZSTD
, and LZ4_RAW
If a Parquet file contains columns with unsupported data types or encodings, Data Cloud SQL can still read the other columns in the file, as long as you don’t select any unsupported columns.
Besides the common format options, use these options to write to Parquet files.
ROWS_PER_ROW_GROUP => integer
The number of rows to include in a row group
CODEC => { 'uncompressed' | 'snappy' | 'gzip' | 'zstd' | 'lz4_raw' }
Options to compress the data
WRITE_STATISTICS => { 'none' | 'rowgroup' | 'page' | 'columnindex' }
How the statistics are written.
Rowgroup
writes row group statistics.page
writes row group and page statistics.columnindex
writes row group, column, and offset statistics, without page statistics.
The FORMAT => 'iceberg'
option enables reading the Apache Iceberg format, which is a binary format designed for huge data analytics tables. Apache Iceberg provides metadata information like versioning, partitioning, and statistics on top of storage formats like parquet. Data Cloud SQL uses the metadata information to efficiently access Iceberg tables and only read the data from the parquet files required to answer a query.
To use Iceberg in a query, provide the path to the table root directory, which is the directory that has metadata/
and data/
subdirectories. The location of the Iceberg table can be specified as described in External Locations. You can query Iceberg tables from local disk and from Amazon S3. Here’s an example of how to use Iceberg, assuming:
Data Cloud SQL supports Iceberg v1 and v2. The Iceberg format has these restrictions and features.
-
Data Cloud SQL only supports Apache Parquet as data storage format. Other formats like ORC aren’t supported.
-
The restrictions for parquet file support apply.
-
Merge on Read is supported for positional deletes. Equality deletes aren’t supported.
-
Schema changes like added, dropped, and renamed columns are supported.
To read columnar binary format Apache Arrow, use the FORMAT => 'arrowfile'
or FORMAT => 'arrowstream'
option enables. As data is stored in a columnar format, Data Cloud SQL doesn’t read the whole file if only a subset of columns is selected. Loading partial Arrow files from S3 isn’t supported.
Data Cloud SQL supports the IPC streaming format (file extension .arrows
) with FORMAT => 'arrowstream'
. It also supports the IPC file format, sometimes referred to as Feather V2 (file extensions .arrow
or .feather
) with FORMAT => 'arrowfile'
.
Arrow supports the common format options. The COLUMNS
option is required. There is a 1-to-1 mapping from Arrow types to SQL types. Data Cloud SQL doesn’t perform implicit casts. The COMPRESSION
option isn’t supported.
Arrow Type | Hyper Type | Notes |
---|---|---|
Bool | Bool | |
Int signed, 8 bit | SmallInt | |
Int signed, 16 bit | SmallInt | |
Int signed, 32 bit | Integer | |
Int signed, 64 bit | BigInt | |
Int unsigned, 8 bit | SmallInt | |
Int unsigned, 16 bit | Integer | |
Int unsigned, 32 bit | BigInt | |
Int unsigned, 64 bit | Numeric(20,0) | Not recommended - processing is slower than other Int types. |
FloatingPoint with any precision | DOUBLE PRECISION | Half floats aren’t supported. |
Decimal(precision, scale) | NUMERIC(precision, scale) | Hyper doesn’t support precision > 38. For better performance, we recommend using precision <= 18. |
Date | Date | |
Time | Time | Hyper doesn’t support nano second precision and truncates it to microseconds. |
Timestamp without timezone | TIMESTAMP | Hyper doesn’t support nanosecond precision. |
Timestamp with timezone | TIMESTAMPTZ | Hyper doesn’t support nanosecond precision. |
UTF8 | Text |
Data Cloud SQL supports version 1.3 of the Arrow specification. These Arrow features aren’t supported.
-
Data types: Interval, LargeUtf8, Struct, lists, union, map types, Binary, LargeBinary, FixedSizeBinary, HalfFloat
-
Run-end or dictionary encoding
-
Arrow internal compression, LZ4 compression, on record batches. See
RecordBatch::compression
in the Arrow messages schema. -
Big-Endian encoding
-
Duplicated column names
-
Schemas with zero columns
If an Arrow file contains columns with unsupported data types, Data Cloud SQL can still read the other columns in the file, as long as you don’t select any unsupported columns. This isn’t the case for unsupported encodings because Data Cloud can’t read dictionary-encoded Arrow data.
Because Data Cloud SQL parallelizes queries on the granularity of record batches, we recommend batch sizes of a few thousand rows.