comparison - loki (AGPL) #8
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
basically similar to https://grafana.com/oss/loki/ - index by timestamp, host, log file, and tags. body of log is un-indexed, but can set tags based on dynamic regexes, and alert based on rules during ingest. otherwise, can search the logs on a specific date range.
no horizontal scaling, no multi-backends, no AI/ML,
basic baysian filtering, basic community library of rules,
grafana in general has great marketing blogs/videos, and it'd be cool to get good enough at marketing to be like their DevRel: https://grafana.com/blog/2023/01/10/watch-5-tips-for-improving-grafana-loki-query-performance/
could probably support logQL, or a GUI version of it. It's fairly intuitive/simple.
https://grafana.com/docs/loki/latest/query/log_queries/#parser-expression?pg=blog&plcmt=body-txt
it's not glamorous and scalable, it doesn't trust the S3 storage (age encrypted, minisign signed), sqlite indexing, $10/TB stored, search is throttled, ingest is throttled (.5 TB/month) but can be $10/TB if unthrottled. Both billed in per-TB increments. need to calculate the average EPS and size of events, e.g. 2,000 EPS per host (syslog default), 1500 byte logs = 24mb/s, which saturates in 2 days (assuming no compression), 200 kb would be .5TB over 30 days without compression, other log volume estimators exist: https://logstail.com/log-volume-estimation/ - could be around 30MB/s to 100MB/s for some log forwarders,
starting at $10/month, includes your first 1TB of logs stored and .5TB of ingest/month. Each additional TB stored is $10, unthrottling ingest is $20/TB. (includes both the Vultr overage pricing and room for more nodes/scaling the instance)
sqlite index: on ingestion box, tag log batches, compress in chunks (e.g. 15 minute increments, 2,000 EPS, or 128mb, whichever is larger) - old project idea was parquet, drill, minio (no zookeeper or k8s) - like this thing: https://www.jowanza.com/blog/2017/7/1/jathena https://github.com/josep2/Jathena - but it doesn't need as many columns
index format sqlite tables (for logs vs for user preferences/searches):
need to relate: ingest api key, ingest time, event time, host, log source, tags -> which chunk it's in -> raw log
folders: (each customer gets their own bucket)
/cache_policy/retention_policy/host_api_key/(g/it/)epoch_ms_uuid7/chunk_ms.sqlite.gz
/cache_policy/retention_policy/host_api_key/(g/it/)epoch_ms_uuid7.sqlite.gz
epoch indexes:
log_chunk.sqlite.gz:
ingest process:
search process:
on machine prune process:
prune process:
forgot the max amount of data that can be stored in a cipher before you need a new key, but ideally the recipient encryption should be readable on multiple boxes (e.g. ingest-A's key, ingest-B's key), or allow ingest-A-old, ingest-A-new
per prefix the target files would be 10,000 (which won't happen, so it's okay to do it without the
g/it/
folders schema and just do epoch UUIDs and chunk UUIDs inside those - there may be a multi-list operation if it's over 1,000 files), name needs to be less than 1024 bytes long/cache_policy/retention_policy/host_key/epoch_ms/chunk_ms.sqlite.gz
/0000d/0000d/01941e30-460a-7171-807b-ef170a61fa12/01941e30-460a-7171-807b-ef170a61fa12/01941e30-460a-7171-807b-ef170a61fa12
123 character count
cache policy is a bounded-and
^ allow them to tune that.
configurable: age key rotations and partitioning
key phases: per-log-processor age key (in multiples of 7, so that's 13 age keys in play in 90 days)
log processor does not keep around older age keys unless asked to search (then the web dash will send over its key)
log processors encrypt data for their own keys (vs multi keys)
e.g. if a log box is hacked, they could get the current 7 days of logs decrypted that that box uploaded (they would have to download some from S3 if it only keeps ~30gb logs)
if logs are partitioned, then each upload host would have to be the one to search. if the keys are given by the web dash (e.g., an epoch collection key, and a chunk encrypting key for that epoch) - then the dash can coordinate letting multiple hosts search (by sending collection keys or doing it itself or checking whether a host has a cached epoch file - dash itself should be booted with a main age key to decrypt its own sqlite
some log processors will only be uploaders, so could be writing to a public key they don't have a read on.
90 day upload API key anyway?, automatically rotate age keys (auto-rotate syslog keys too? - syslog addresses are just upload ones, but could have them expire after 2 years?), maybe 90 min, 2 year max? - refresh (e.g., session) would be 1 day, automatic rotation and reuse detection (refresh token can be renewed a certain number of times, and can also be revoked),
log collectors by default don't have the private keys they're encrypting to, so they can't by default read until a cache request condition comes in. that could be separated out by requiring a hot standby search node that does have the keys, and operates on a separate address from the other type (e.g., rules dash + key storage, ingest hosts (only public keys), searching hosts (read only), S3 bucket(only encrypted data))
if prefix includes the processing node (/cache_policy/retention_policy/collector_id/host_id/epoch_ms/chunk_ms.sqlite.gz.age) then two simultaneous collectors could upload. when new epoch is made, a .todo file is left in the epoch folder, and the epoch isn't finished until the index is uploaded (so a crash would re-scan the known chunks). similarly if a chunk isn't uploaded and the same size as the one on the collector, it replaces the one on S3 and updates the epoch index. decrypt keys (used during search) are only in cache, not in swap, and deleted after search concludes. unencrypted sqlite loaded in memory(?) while compressed and encrypted ones are on disk? or using a LUKS or gocryptfs volume (password to unlock it given after node boots?)