Parquet Sink

The Parquet sink writes data to Apache Parquet files — a columnar format optimized for analytical queries with excellent compression ratios. Use it for cold storage and data warehousing.

Configuration

[sinks.archive]
type = "parquet"
path = "/data/parquet"
compression = "snappy"
rotation = "hourly"

Field	Default	Notes
`path`	—	Output directory (required)
`compression`	`"snappy"`	`"snappy"`, `"zstd"`, `"lz4"`, or `"uncompressed"`
`rotation`	`"hourly"`	`"hourly"` or `"daily"`
`buffer_size`	10,000	Rows buffered before flush
`row_group_size`	100,000	Max rows per row group
`flush_interval`	`"60s"`	Time-based flush interval

Schemas

Data is separated into different Parquet files by type, each with an optimized schema. Fields are ordered for predicate pushdown — frequently filtered columns come first, large payloads last. Events (9 fields): timestamp, batch_timestamp, workspace_id, event_type, event_name, device_id, session_id, source_ip, payload Logs (10 fields): timestamp, batch_timestamp, workspace_id, level, event_type, source, service, session_id, source_ip, payload Snapshots (7 fields): timestamp, batch_timestamp, workspace_id, source, entity, source_ip, payload

File organization

parquet/
└── {workspace_id}/
    └── {date}/
        └── {hour}/
            ├── events.parquet
            ├── logs.parquet
            └── snapshots.parquet

Reading Parquet files

Parquet is a standard format readable by most analytics tools:

# DuckDB
SELECT * FROM 'parquet/1/2025-01-15/10/*.parquet';

# Polars
import polars as pl
df = pl.read_parquet("parquet/1/2025-01-15/10/events.parquet")

# Pandas
import pandas as pd
df = pd.read_parquet("parquet/1/2025-01-15/10/events.parquet")

Also readable by Apache Spark, ClickHouse (file() table function), and PyArrow.

When to use Parquet vs Arrow IPC

Use Parquet for cold data — archival, compliance, data warehousing. Compression ratios are excellent (especially with Zstd) and the format is widely supported. Use Arrow IPC for hot data — real-time dashboards, frequent reads, inter-process communication. Arrow IPC is ~10x faster to read but files are larger.

Getting Started

Tracking

Pipeline

Analytics

Parquet Sink

Configuration

Schemas

File organization

Reading Parquet files

When to use Parquet vs Arrow IPC

Getting Started

Tracking

Pipeline

Analytics

​Configuration

​Schemas

​File organization

​Reading Parquet files

​When to use Parquet vs Arrow IPC

Configuration

Schemas

File organization

Reading Parquet files

When to use Parquet vs Arrow IPC