Circling back to OnlyJunk.Fans development, I'm pondering about all things ACME and sensitive data and whatnot.
I do not want to store anything on the filesystem if I can avoid it. So LetsEncrypt serialized credentials, the obtained certificates - they'll be in the database. But I can't store them in plain.
So I guess I will need another secret (there's already a session-secret, and the database URL may contain secrets too), and do some AES-GCM-SIV or XChaCha20-Poly1305 magic to store them in the db encrypted.
This will be fun.
Went with XChaCha20-Poly1305, because it promised less footguns. Now I "only" need to figure out how to make using it easy.
@buherator There's chacha20poly1305, which is pure rust, and tink-rust depends on it too. :)
For what I need, the raw library is fine.
@buherator oh, sorry. By "make it easy", I meant building an abstraction on top, so the majority of my software doesn't need to know that the data is stored encrypted.
So... uhh. So I don't want to play with LE's staging server, because my desktop isn't visible from the internet, and I don't want to tunnel something home.
Buuut.. I'd need to install the root cert for that, because instant-acme doesn't seem to expose ways to augment what it trusts.
To install a CA, I'd need to set security.pki.certificateFiles, which would be fine, except I have not properly updated to NixOS 25.11 yet.
As in, my configuration builds, but it doesn't boot, because persistent stuff get mungled up badly. I fixed that on the laptop, and on servers, I need to fix it on the desktop too.
Le sigh.
Fine. Lets do some WebDav things meanwhile.
Rebased the hack onto main, and I'm feeling exhausted.
SELECT path FROM
(SELECT DISTINCT(substring(path SIMILAR '#"foo/[^/]+/?#"%' ESCAPE '#')) AS path
FROM blobs_meta
WHERE path LIKE 'foo%')
WHERE path IS NOT NULL;
Behold.
This yields:
path
----------------
foo/bar/
foo/index.html
foo/baz/
(3 rows)
In other words, the SQL query in the previous toot gives me a list of "directories" and "files" under a prefix, but only a single level of them. "Directories" have a / suffix, files don't.
Decades of exposure to regexps and perl was not in vain: I can still write line noise!
Now, there's a problem: I'd like to have modified timestamps and content types too, ideally, and sizes.
And that's considerably harder to do this way!
I suppose two queries are acceptable here, one for directories, one for files. For directories, I need the modify time of the newest file under that path, for files, I need the file modification time.
The SQL monsters are multiplying.
@algernon
what does the query plan look like for these two?
@wolf480pl I have not written those two yet. I only have a plan for the one that returns the path only.
@wolf480pl No. My test database has like 5 files, so Postgres decided to do a sequential scan + filter instead.
ojf=# explain SELECT path FROM (SELECT DISTINCT(substring(path SIMILAR '#"foo/[^/]+/?#"%' ESCAPE '#')) AS path FROM blobs_meta WHERE host_id = '9eafefe1-a2b0-4a5f-9810-f76b6c001d9e' and path LIKE 'foo%') WHERE path IS NOT NULL;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=1.14..1.18 rows=3 width=32)
Group Key: "substring"(blobs_meta.path, '^(?:){1,1}?(foo/[^/]+/?){1,1}(?:.*)$'::text)
-> Seq Scan on blobs_meta (cost=0.00..1.13 rows=3 width=32)
Filter: ((path ~~ 'foo%'::text) AND ("substring"(path, '^(?:){1,1}?(foo/[^/]+/?){1,1}(?:.*)$'::text) IS NOT NULL) AND (host_id = '9eafefe1-a2b0-4a5f-9810-f76b6c001d9e'::uuid))
(4 rows)
@algernon yeah, not very indicative :/
anyway, what I'm expecting to happen on a large database is that it will:
1. first find the first occurence of "foo/.*" via btree
2. then scan the index until the path doesn't start with "foo/", while checking every single path against the regex
3. deduplicate them
With 2 and 3 being the expensive part, because it goes through every single file under that directory
And the result will be much smaller than the amount of records this query goes through
1/
@algernon
So it'd make sense to only fetch metadata for the paths left after deduplication, except...
I just realized what you meant by "I need modify time for the newest file under that path"
oh god
@algernon
you don't have directories...
however!
for files directly in foo/, getting the modify time of the newest file under their path is still a correct result!
so you should be able to do it without two queries I think
sth like
```
SELECT substring(path SIMILAR '#"foo/[^/]+/?#"%' ESCAPE '#')) AS path, max(modtime) as modtime
FROM blobs_meta
WHERE ...
GROUP BY path
```
should handle both I think, without going through the table twice.
@wolf480pl Out of morbid curiosity, I went ahead and created a Monster.
SELECT * FROM
(SELECT
ROW_NUMBER() OVER (PARTITION BY path ORDER BY modified_at DESC) AS r,
t.*
FROM (SELECT
DISTINCT(substring(path SIMILAR '#"foo/[^/]+/?#"%' ESCAPE '#')) AS path,
content_type,
modified_at
FROM blobs_meta
WHERE
host_id = '9eafefe1-a2b0-4a5f-9810-f76b6c001d9e' AND
path LIKE 'foo%') t
) x
WHERE
x.r = 1;
Which yields:
r | path | content_type | modified_at
---+----------------+--------------+-------------------------------
1 | foo/bar/ | text/html | 2026-02-04 23:38:56.226139+00
1 | foo/baz/ | text/html | 2026-02-04 23:51:54.911743+00
1 | foo/index.html | text/html | 2026-02-04 23:26:08.556731+00
The query plan for this is horrible, though. But it was fun.
@algernon oh, you also want the content_type of the last modified file?
@wolf480pl Yep. And the size, which is conveniently in another table1.
because blobs_meta is shared between the original item, and compressed variants - but the size is obviously different between those.2 ↩︎
this might end up being a terrible idea. It might be easier to roll the blob & blob_meta tables together, and have a separate compressed_blobs table. I'll figure that out once I get to implementing compression. ↩︎
@wolf480pl To be honest, it's probably easiest to do two queries: one for directories, where I don't need neither size nor content type, and I'm okay with having the modified time set to the unix epoch.
Then, a reasonably simple query can take care of directories: SELECT DISTINCT(blah) AS path FROM blobs_meta WHERE path LIKE 'foo/%' AND host_id = $1.
And I can make a separate query for files:
SELECT * FROM blobs_meta
WHERE
path SIMILAR TO '%foo/[^/]+' AND
host_id = '9eafefe1-a2b0-4a5f-9810-f76b6c001d9e';
No need to distinct here, because SIMILAR TO already limits it to the exact set I want.
@wolf480pl Also: this is something that's done rarely: only when using WebDAV (and later probably S3 or SFTP or whatever else I think of) to mess with one's files. It is not on any kind of hot path that runs a gazillion times a day. This is something that will run maybe a few thousand times on a busy day, I think.
@wolf480pl @algernon Depends entirely on what indexes are put on there - if there’s one on just host_id, it’s extremely likely that it will just scan all files for said host because that’s the most precise filter
What you’d hope to get is a range scan between host_id and the prefix, till the end of the prefix on a single index, with a streaming aggregation for deduplication.
Whether postgres will give that to you is another question
@ignaloidas
it should be able to do that you have a btree index on (host_id, path) in that order.
Whether it'll think that this plan has the lowest cost is another question.
@algernon