Skip to main content

Clickhouse

The Clickhouse scraper executes SQL queries against a Clickhouse database or cloud storage systems (AWS S3, Azure Blob Storage) and creates configuration items from the query results. This allows you to treat data stored in Clickhouse or cloud storage as configuration items that can be tracked and monitored.

Use Cases

  • Data Warehouse Integration: Treat analytical data as configuration items
  • Business Logic Tracking: Monitor business rule changes stored in Clickhouse
  • Cloud Storage Monitoring: Track files and data stored in S3 or Azure Blob Storage
  • Data Pipeline Monitoring: Monitor data transformation results as configuration changes
  • Historical Data Analysis: Create configuration items from time-series data
clickhouse-scraper.yaml
apiVersion: configs.flanksource.com/v1
kind: ScrapeConfig
metadata:
name: clickhouse-scraper
spec:
clickhouse:
- query: |
SELECT
concat('ORD-', toString(10000 + number)) as order_id,
['Electronics', 'Clothing', 'Books', 'Home', 'Sports'][rand() % 5 + 1] as category,
['New York', 'London', 'Tokyo', 'Paris', 'Sydney'][rand() % 5 + 1] as city,
round((rand() % 50000) / 100, 2) as amount,
['completed', 'pending', 'cancelled'][rand() % 3 + 1] as status,
toDateTime('2024-01-01 00:00:00') + toIntervalSecond(rand() % 31536000) as order_date
FROM numbers(1000)
type: Order
id: $.order_id
transform:
#full: true
expr: "[config].toJSON()"
FieldDescriptionSchemeRequired
logLevelSpecify the level of logging.string
scheduleSpecify the interval to scrape in cron format. Defaults to every 60 minutes.Cron
fullSet to true to extract changes from scraped configurations. Defaults to false.bool
retentionSettings for retaining changes, analysis and scraped itemsRetention
clickhouseSpecifies the list of Clickhouse configurations to scrape.[]Clickhousetrue

Clickhouse

FieldDescriptionScheme
query*

SQL query to execute against Clickhouse

string

awsS3

AWS S3 configuration for cloud storage access

AWSS3

azureBlobStorage

Azure Blob Storage configuration for cloud storage access

AzureBlobStorage

clickhouseURL

Clickhouse connection URL in format: clickhouse://user:password@host:port/database?param1=value1&param2=value2

string

Mapping

Custom scrapers require you to define the id and type for each scraped item. For example, when you scrape a file containing a JSON array, where each array element represents a config item, you must specify the id and type for those items. You can achieve this by using mappings in your custom scraper configuration.

FieldDescriptionScheme
id*

A static value or JSONPath expression to use as the ID for the resource.

string or JSONPath

name*

A static value or JSONPath expression to use as the name for the resource.

string or JSONPath

type*

A static value or JSONPath expression to use as the type for the resource.

string or JSONPath

class

A static value or JSONPath expression to use as the class for the resource.

string or JSONPath

createFields

A list of JSONPath expressions used to identify the created time of the config. If multiple fields are specified, the first non-empty value will be used.

[]jsonpath

deleteFields

A list of JSONPath expressions used to identify the deleted time of the config. If multiple fields are specified, the first non-empty value will be used.

[]jsonpath

description

A static value or JSONPath expression to use as the description for the resource.

string or JSONPath

format

Format of config item, defaults to JSON, available options are JSON, properties. See Formats

string

health

A static value or JSONPath expression to use as the health of the config item.

string or JSONPath

items

A JSONPath expression to use to extract individual items from the resource. Items are extracted first and then the ID, Name, Type and transformations are applied for each item.

JSONPath

status

A static value or JSONPath expression to use as the status of the config item.

string or JSONPath

timestampFormat

A Go time format string used to parse timestamps in createFields and deleteFields. (Default: RFC3339)

string

Formats

JSON

The scraper stores config items as jsonb fields in PostgreSQL.

Resource providers typically return the JSON used. e.g. kubectl get -o json or aws --output=json.

When you display the config, the UI automatically converts the JSON data to YAML for improved readability.

XML / Properties

The scraper stores non-JSON files as JSON using:

{ 'format': 'xml', 'content': '<root>..</root>' }

You can still access non-JSON content in scripts using config.content.

The UI formats and renders XML appropriately.

Extracting Changes & Access Logs

Custom scrapers ingest changes & access logs from external systems when you enable the full option.

Every single config is expected to have at these 3 top-level fields

  • config
  • changes
  • access_logs
info

They could have more fields or even missing some of these fields. The point is that only these fields are extracted.

Consider a file that contains the following json data.

{
"reg_no": "A123",
"config": {
"meta": "this is the actual config that'll be stored."
},
"changes": [
{
"action": "drive",
"summary": "car color changed to blue",
"unrelated_stuff": 123
}
],
"access_logs": [
{
"config_id": "99024949-9118-4dcb-a3a0-b8f1536bebd0",
"external_user_id": "a3542241-4750-11f0-8000-e0146ce375e6",
"created_at": "2025-01-01"
},
{
"config_id": "9d9e51a7-6956-413e-a07e-a6aeb3f4877f",
"external_user_id": "a5c2e8e3-4750-11f0-8000-f4eaacabd632",
"created_at": "2025-01-02"
}
]
}

A regular scraper saves the entire json as a config. However, with the full option, the scraper extracts the config, changes and access logs.

apiVersion: configs.flanksource.com/v1
kind: ScrapeConfig
metadata:
name: file-scraper
spec:
full: true
file:
- type: Car
id: $.reg_no
paths:
- fixtures/data/car_changes.json

The resulting config is:

{
"meta": "this is the actual config that'll be stored."
}

and the scraper records the following new config change on that config:

{
"action": "drive",
"summary": "car color changed to blue",
"unrelated_stuff": 123
}

and the access logs will be saved to

[
{
"config_id": "99024949-9118-4dcb-a3a0-b8f1536bebd0",
"external_user_id": "a3542241-4750-11f0-8000-e0146ce375e6",
"created_at": "2025-01-01"
},
{
"config_id": "9d9e51a7-6956-413e-a07e-a6aeb3f4877f",
"external_user_id": "a5c2e8e3-4750-11f0-8000-f4eaacabd632",
"created_at": "2025-01-02"
}
]

AWSS3

FieldDescriptionScheme
accessKey

AWS access key for authentication

EnvVar

bucket

S3 bucket name

string

endpoint

Custom S3 endpoint URL

string

path

Path within the S3 bucket

string

region

AWS region

string

secretKey

AWS secret key for authentication

EnvVar

AzureBlobStorage

FieldDescriptionScheme
collection*

Name of the collection in Clickhouse. See Named Collections

string

account

Azure storage account name

string

container

Azure blob container name

string

endpoint

Azure endpoint suffix (default: core.windows.net)

string

path

Path within the container

string

info

An instance of Clickhouse server needs to be deployed for this scraper to function.

Mission Control can deploy an instance by setting config-db.clickhouse.enabled: true in the helm chart.

An external Clickhouse server can also be used via the clickhouseURL parameter

Configuration Examples

Direct Clickhouse Connection

apiVersion: configs.flanksource.com/v1
kind: ScrapeConfig
metadata:
name: clickhouse-orders
spec:
clickhouse:
- clickhouseURL: "clickhouse://user:password@localhost:9000/ecommerce"
query: |
SELECT
order_id,
customer_id,
status,
total_amount,
created_at
FROM orders
WHERE status = 'completed'
ORDER BY created_at DESC
LIMIT 1000
type: Order
id: $.order_id

AWS S3 Integration

apiVersion: configs.flanksource.com/v1
kind: ScrapeConfig
metadata:
name: clickhouse-s3-data
spec:
clickhouse:
- awsS3:
bucket: analytics-data
path: exports/
connection: connection://aws
query: |
SELECT *
FROM s3('s3://analytics-data/exports/*.parquet')
LIMIT 100
type: AnalyticsData
id: $.record_id

Azure Blob Storage Integration

apiVersion: configs.flanksource.com/v1
kind: ScrapeConfig
metadata:
name: clickhouse-azure-logs
spec:
clickhouse:
- azureBlobStorage:
account: mystorageaccount
container: logs
path: application-logs/
collectionName: ApplicationLogs
connectionString:
valueFrom:
secretKeyRef:
name: azure-storage
key: connection-string
query: |
SELECT
timestamp,
level,
message,
application
FROM azure_blob_storage_table
WHERE level IN ('ERROR', 'WARN')
type: LogEntry
id: $.timestamp