comparison - loki (AGPL) #8

Open
opened 2024-12-31 16:59:15 +00:00 by unusualevent · 10 comments
Owner

basically similar to https://grafana.com/oss/loki/ - index by timestamp, host, log file, and tags. body of log is un-indexed, but can set tags based on dynamic regexes, and alert based on rules during ingest. otherwise, can search the logs on a specific date range.

no horizontal scaling, no multi-backends, no AI/ML,

basic baysian filtering, basic community library of rules,

grafana in general has great marketing blogs/videos, and it'd be cool to get good enough at marketing to be like their DevRel: https://grafana.com/blog/2023/01/10/watch-5-tips-for-improving-grafana-loki-query-performance/

could probably support logQL, or a GUI version of it. It's fairly intuitive/simple.

https://grafana.com/docs/loki/latest/query/log_queries/#parser-expression?pg=blog&plcmt=body-txt

basically similar to https://grafana.com/oss/loki/ - index by timestamp, host, log file, and tags. body of log is un-indexed, but can set tags based on dynamic regexes, and alert based on rules during ingest. otherwise, can search the logs on a specific date range. no horizontal scaling, no multi-backends, no AI/ML, basic baysian filtering, basic community library of rules, grafana in general has great marketing blogs/videos, and it'd be cool to get good enough at marketing to be like their DevRel: https://grafana.com/blog/2023/01/10/watch-5-tips-for-improving-grafana-loki-query-performance/ could probably support logQL, or a GUI version of it. It's fairly intuitive/simple. https://grafana.com/docs/loki/latest/query/log_queries/#parser-expression?pg=blog&plcmt=body-txt
Author
Owner

it's not glamorous and scalable, it doesn't trust the S3 storage (age encrypted, minisign signed), sqlite indexing, $10/TB stored, search is throttled, ingest is throttled (.5 TB/month) but can be $10/TB if unthrottled. Both billed in per-TB increments. need to calculate the average EPS and size of events, e.g. 2,000 EPS per host (syslog default), 1500 byte logs = 24mb/s, which saturates in 2 days (assuming no compression), 200 kb would be .5TB over 30 days without compression, other log volume estimators exist: https://logstail.com/log-volume-estimation/ - could be around 30MB/s to 100MB/s for some log forwarders,

starting at $10/month, includes your first 1TB of logs stored and .5TB of ingest/month. Each additional TB stored is $10, unthrottling ingest is $20/TB. (includes both the Vultr overage pricing and room for more nodes/scaling the instance)

it's not glamorous and scalable, it doesn't trust the S3 storage (age encrypted, minisign signed), sqlite indexing, $10/TB stored, search is throttled, ingest is throttled (.5 TB/month) but can be $10/TB if unthrottled. Both billed in per-TB increments. need to calculate the average EPS and size of events, e.g. 2,000 EPS per host (syslog default), 1500 byte logs = 24mb/s, which saturates in 2 days (assuming no compression), 200 kb would be .5TB over 30 days without compression, other log volume estimators exist: https://logstail.com/log-volume-estimation/ - could be around 30MB/s to 100MB/s for some log forwarders, starting at $10/month, includes your first 1TB of logs stored and .5TB of ingest/month. Each additional TB stored is $10, unthrottling ingest is $20/TB. (includes both the Vultr overage pricing and room for more nodes/scaling the instance)
Author
Owner

sqlite index: on ingestion box, tag log batches, compress in chunks (e.g. 15 minute increments, 2,000 EPS, or 128mb, whichever is larger) - old project idea was parquet, drill, minio (no zookeeper or k8s) - like this thing: https://www.jowanza.com/blog/2017/7/1/jathena https://github.com/josep2/Jathena - but it doesn't need as many columns

sqlite index: on ingestion box, tag log batches, compress in chunks (e.g. 15 minute increments, 2,000 EPS, or 128mb, whichever is larger) - old project idea was parquet, drill, minio (no zookeeper or k8s) - like this thing: https://www.jowanza.com/blog/2017/7/1/jathena https://github.com/josep2/Jathena - but it doesn't need as many columns
Author
Owner

index format sqlite tables (for logs vs for user preferences/searches):

need to relate: ingest api key, ingest time, event time, host, log source, tags -> which chunk it's in -> raw log

folders: (each customer gets their own bucket)
/cache_policy/retention_policy/host_api_key/(g/it/)epoch_ms_uuid7/chunk_ms.sqlite.gz
/cache_policy/retention_policy/host_api_key/(g/it/)epoch_ms_uuid7.sqlite.gz

epoch indexes:

  • tags_observed: tag, chunk
  • hosts_observed: host_nick, chunk
  • sources_observed: source_nick, chunk

log_chunk.sqlite.gz:

  • legend: tag
  • hosts: common, host_nick
  • tag_links: tag, event (ingest uuid7)
  • event, host_nic, raw string

ingest process:

  • find host based on ingest API key
  • based on parsed log's rules, pick a cache path (for searching recently)
  • based on parsed log's rules, pick a retention path
  • if current epoch is full, make a new epoch index
  • if current chunk is full, make a new chunk
  • add events to chunk (while annotating epoch index with additions)
  • keep last ~20 gb of chunks on vps, but still upload them
  • keep last ~5 gb of epoch indexes on vps, but still upload them

search process:

  • find host based on search API key
  • use date range to download epoch indexes (if not downloaded)
  • scan for hosts, tags, sources to find chunks needed
  • download required chunks (if not already downloaded)
  • if full, ensure chunk is not dirty and discard one

on machine prune process:

  • look at cache policy folder (either a GB limit or a date limit)
  • require 1GB free space on machine
  • find epochs entirely past time, delete whole folder (require upload first)
  • for the last folder, if still full, find chunk beyond time range

prune process:

  • look at retention folders
  • find epoch's entirely past time, delete whole folder
  • find chunks of the oldest remaining epoch past time, remove
index format sqlite tables (for logs vs for user preferences/searches): need to relate: ingest api key, ingest time, event time, host, log source, tags -> which chunk it's in -> raw log folders: (each customer gets their own bucket) `/cache_policy/retention_policy/host_api_key/(g/it/)epoch_ms_uuid7/chunk_ms.sqlite.gz` `/cache_policy/retention_policy/host_api_key/(g/it/)epoch_ms_uuid7.sqlite.gz` epoch indexes: - tags_observed: tag, chunk - hosts_observed: host_nick, chunk - sources_observed: source_nick, chunk log_chunk.sqlite.gz: - legend: tag - hosts: common, host_nick - tag_links: tag, event (ingest uuid7) - event, host_nic, raw string ingest process: - find host based on ingest API key - based on parsed log's rules, pick a cache path (for searching recently) - based on parsed log's rules, pick a retention path - if current epoch is full, make a new epoch index - if current chunk is full, make a new chunk - add events to chunk (while annotating epoch index with additions) - keep last ~20 gb of chunks on vps, but still upload them - keep last ~5 gb of epoch indexes on vps, but still upload them search process: - find host based on search API key - use date range to download epoch indexes (if not downloaded) - scan for hosts, tags, sources to find chunks needed - download required chunks (if not already downloaded) - if full, ensure chunk is not dirty and discard one on machine prune process: - look at cache policy folder (either a GB limit or a date limit) - require 1GB free space on machine - find epochs entirely past time, delete whole folder (require upload first) - for the last folder, if still full, find chunk beyond time range prune process: - look at retention folders - find epoch's entirely past time, delete whole folder - find chunks of the oldest remaining epoch past time, remove
Author
Owner

forgot the max amount of data that can be stored in a cipher before you need a new key, but ideally the recipient encryption should be readable on multiple boxes (e.g. ingest-A's key, ingest-B's key), or allow ingest-A-old, ingest-A-new

per prefix the target files would be 10,000 (which won't happen, so it's okay to do it without the g/it/ folders schema and just do epoch UUIDs and chunk UUIDs inside those - there may be a multi-list operation if it's over 1,000 files), name needs to be less than 1024 bytes long

/cache_policy/retention_policy/host_key/epoch_ms/chunk_ms.sqlite.gz
/0000d/0000d/01941e30-460a-7171-807b-ef170a61fa12/01941e30-460a-7171-807b-ef170a61fa12/01941e30-460a-7171-807b-ef170a61fa12

123 character count

forgot the max amount of data that can be stored in a cipher before you need a new key, but ideally the recipient encryption should be readable on multiple boxes (e.g. ingest-A's key, ingest-B's key), or allow ingest-A-old, ingest-A-new per prefix the target files would be 10,000 (which won't happen, so it's okay to do it without the `g/it/` folders schema and just do epoch UUIDs and chunk UUIDs inside those - there may be a multi-list operation if it's over 1,000 files), name needs to be less than 1024 bytes long /cache_policy/retention_policy/host_key/epoch_ms/chunk_ms.sqlite.gz /0000d/0000d/01941e30-460a-7171-807b-ef170a61fa12/01941e30-460a-7171-807b-ef170a61fa12/01941e30-460a-7171-807b-ef170a61fa12 123 character count
Author
Owner

cache policy is a bounded-and

  • one of {128mb, 2,000 events, 15 minutes}
  • and one of {512mb, 10,000 events, 1 hour}

^ allow them to tune that.

  • web dash host (manages API keys, registers log processors, rotate log processor keys)
  • log processors (receive syslog, process rules/alerts, encrypt and upload, download and search)
cache policy is a bounded-and - one of {128mb, 2,000 events, 15 minutes} - and one of {512mb, 10,000 events, 1 hour} ^ allow them to tune that. - web dash host (manages API keys, registers log processors, rotate log processor keys) - log processors (receive syslog, process rules/alerts, encrypt and upload, download and search)
Author
Owner

configurable: age key rotations and partitioning
key phases: per-log-processor age key (in multiples of 7, so that's 13 age keys in play in 90 days)
log processor does not keep around older age keys unless asked to search (then the web dash will send over its key)
log processors encrypt data for their own keys (vs multi keys)

e.g. if a log box is hacked, they could get the current 7 days of logs decrypted that that box uploaded (they would have to download some from S3 if it only keeps ~30gb logs)

if logs are partitioned, then each upload host would have to be the one to search. if the keys are given by the web dash (e.g., an epoch collection key, and a chunk encrypting key for that epoch) - then the dash can coordinate letting multiple hosts search (by sending collection keys or doing it itself or checking whether a host has a cached epoch file - dash itself should be booted with a main age key to decrypt its own sqlite

configurable: age key rotations and partitioning key phases: per-log-processor age key (in multiples of 7, so that's 13 age keys in play in 90 days) log processor does not keep around older age keys unless asked to search (then the web dash will send over its key) log processors encrypt data for their own keys (vs multi keys) e.g. if a log box is hacked, they could get the current 7 days of logs decrypted that that box uploaded (they would have to download some from S3 if it only keeps ~30gb logs) if logs are partitioned, then each upload host would have to be the one to search. if the keys are given by the web dash (e.g., an epoch collection key, and a chunk encrypting key for that epoch) - then the dash can coordinate letting multiple hosts search (by sending collection keys or doing it itself or checking whether a host has a cached epoch file - dash itself should be booted with a main age key to decrypt its own sqlite
Author
Owner

some log processors will only be uploaders, so could be writing to a public key they don't have a read on.

some log processors will only be uploaders, so could be writing to a public key they don't have a read on.
Author
Owner

90 day upload API key anyway?, automatically rotate age keys (auto-rotate syslog keys too? - syslog addresses are just upload ones, but could have them expire after 2 years?), maybe 90 min, 2 year max? - refresh (e.g., session) would be 1 day, automatic rotation and reuse detection (refresh token can be renewed a certain number of times, and can also be revoked),

90 day upload API key anyway?, automatically rotate age keys (auto-rotate syslog keys too? - syslog addresses are just upload ones, but could have them expire after 2 years?), maybe 90 min, 2 year max? - refresh (e.g., session) would be 1 day, automatic rotation and reuse detection (refresh token can be renewed a certain number of times, and can also be revoked),
Author
Owner

log collectors by default don't have the private keys they're encrypting to, so they can't by default read until a cache request condition comes in. that could be separated out by requiring a hot standby search node that does have the keys, and operates on a separate address from the other type (e.g., rules dash + key storage, ingest hosts (only public keys), searching hosts (read only), S3 bucket(only encrypted data))

log collectors by default don't have the private keys they're encrypting to, so they can't by default read until a cache request condition comes in. that could be separated out by requiring a hot standby search node that does have the keys, and operates on a separate address from the other type (e.g., rules dash + key storage, ingest hosts (only public keys), searching hosts (read only), S3 bucket(only encrypted data))
Author
Owner

if prefix includes the processing node (/cache_policy/retention_policy/collector_id/host_id/epoch_ms/chunk_ms.sqlite.gz.age) then two simultaneous collectors could upload. when new epoch is made, a .todo file is left in the epoch folder, and the epoch isn't finished until the index is uploaded (so a crash would re-scan the known chunks). similarly if a chunk isn't uploaded and the same size as the one on the collector, it replaces the one on S3 and updates the epoch index. decrypt keys (used during search) are only in cache, not in swap, and deleted after search concludes. unencrypted sqlite loaded in memory(?) while compressed and encrypted ones are on disk? or using a LUKS or gocryptfs volume (password to unlock it given after node boots?)

if prefix includes the processing node (/cache_policy/retention_policy/collector_id/host_id/epoch_ms/chunk_ms.sqlite.gz.age) then two simultaneous collectors could upload. when new epoch is made, a .todo file is left in the epoch folder, and the epoch isn't finished until the index is uploaded (so a crash would re-scan the known chunks). similarly if a chunk isn't uploaded and the same size as the one on the collector, it replaces the one on S3 and updates the epoch index. decrypt keys (used during search) are only in cache, not in swap, and deleted after search concludes. unencrypted sqlite loaded in memory(?) while compressed and encrypted ones are on disk? or using a LUKS or gocryptfs volume (password to unlock it given after node boots?)
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: misc/oskoreia#8
No description provided.