Conversation

Circling back to OnlyJunk.Fans development, I'm pondering about all things ACME and sensitive data and whatnot.

I do not want to store anything on the filesystem if I can avoid it. So LetsEncrypt serialized credentials, the obtained certificates - they'll be in the database. But I can't store them in plain.

So I guess I will need another secret (there's already a session-secret, and the database URL may contain secrets too), and do some AES-GCM-SIV or XChaCha20-Poly1305 magic to store them in the db encrypted.

This will be fun.

1
0
1

Went with XChaCha20-Poly1305, because it promised less footguns. Now I "only" need to figure out how to make using it easy.

2
0
1

@buherator There's chacha20poly1305, which is pure rust, and tink-rust depends on it too. :)

For what I need, the raw library is fine.

1
0
0
@algernon I'm recommending this because of the "how to make using it easy" part. The repos I linked are just examples, the APIs defined by these libraries are the gist.
1
0
1

@buherator oh, sorry. By "make it easy", I meant building an abstraction on top, so the majority of my software doesn't need to know that the data is stored encrypted.

0
0
1

So... uhh. So I don't want to play with LE's staging server, because my desktop isn't visible from the internet, and I don't want to tunnel something home.

1
0
0

Buuut.. I'd need to install the root cert for that, because instant-acme doesn't seem to expose ways to augment what it trusts.

To install a CA, I'd need to set security.pki.certificateFiles, which would be fine, except I have not properly updated to NixOS 25.11 yet.

As in, my configuration builds, but it doesn't boot, because persistent stuff get mungled up badly. I fixed that on the laptop, and on servers, I need to fix it on the desktop too.

Le sigh.

1
0
0

Fine. Lets do some WebDav things meanwhile.

1
0
0

Rebased the hack onto main, and I'm feeling exhausted.

1
0
0
SELECT path FROM
    (SELECT DISTINCT(substring(path SIMILAR '#"foo/[^/]+/?#"%' ESCAPE '#')) AS path
     FROM blobs_meta
     WHERE path LIKE 'foo%')
  WHERE path IS NOT NULL;

Behold.

1
0
1

This yields:

      path
----------------
 foo/bar/
 foo/index.html
 foo/baz/
(3 rows)

In other words, the SQL query in the previous toot gives me a list of "directories" and "files" under a prefix, but only a single level of them. "Directories" have a / suffix, files don't.

1
0
1

Decades of exposure to regexps and perl was not in vain: I can still write line noise!

1
0
1

Now, there's a problem: I'd like to have modified timestamps and content types too, ideally, and sizes.

And that's considerably harder to do this way!

1
0
0

I suppose two queries are acceptable here, one for directories, one for files. For directories, I need the modify time of the newest file under that path, for files, I need the file modification time.

The SQL monsters are multiplying.

1
0
0

@algernon
what does the query plan look like for these two?

1
0
0

@wolf480pl I have not written those two yet. I only have a plan for the one that returns the path only.

1
0
0

@wolf480pl No. My test database has like 5 files, so Postgres decided to do a sequential scan + filter instead.

ojf=# explain SELECT path FROM (SELECT DISTINCT(substring(path SIMILAR '#"foo/[^/]+/?#"%' ESCAPE '#')) AS path FROM blobs_meta WHERE host_id = '9eafefe1-a2b0-4a5f-9810-f76b6c001d9e' and path LIKE 'foo%') WHERE path IS NOT NULL;
                                                                                       QUERY PLAN                                                   
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=1.14..1.18 rows=3 width=32)
   Group Key: "substring"(blobs_meta.path, '^(?:){1,1}?(foo/[^/]+/?){1,1}(?:.*)$'::text)
   ->  Seq Scan on blobs_meta  (cost=0.00..1.13 rows=3 width=32)
         Filter: ((path ~~ 'foo%'::text) AND ("substring"(path, '^(?:){1,1}?(foo/[^/]+/?){1,1}(?:.*)$'::text) IS NOT NULL) AND (host_id = '9eafefe1-a2b0-4a5f-9810-f76b6c001d9e'::uuid))
(4 rows)
1
0
0

@algernon yeah, not very indicative :/

anyway, what I'm expecting to happen on a large database is that it will:
1. first find the first occurence of "foo/.*" via btree
2. then scan the index until the path doesn't start with "foo/", while checking every single path against the regex
3. deduplicate them

With 2 and 3 being the expensive part, because it goes through every single file under that directory

And the result will be much smaller than the amount of records this query goes through
1/

2
0
0

@algernon
So it'd make sense to only fetch metadata for the paths left after deduplication, except...

I just realized what you meant by "I need modify time for the newest file under that path"

oh god

2
0
0

@algernon
you don't have directories...

however!

for files directly in foo/, getting the modify time of the newest file under their path is still a correct result!

so you should be able to do it without two queries I think

sth like

```
SELECT substring(path SIMILAR '#"foo/[^/]+/?#"%' ESCAPE '#')) AS path, max(modtime) as modtime
FROM blobs_meta
WHERE ...
GROUP BY path
```
should handle both I think, without going through the table twice.

0
0
0

@wolf480pl Out of morbid curiosity, I went ahead and created a Monster.

SELECT * FROM
  (SELECT
     ROW_NUMBER() OVER (PARTITION BY path ORDER BY modified_at DESC) AS r,
     t.*
   FROM (SELECT
           DISTINCT(substring(path SIMILAR '#"foo/[^/]+/?#"%' ESCAPE '#')) AS path,
           content_type,
           modified_at
         FROM blobs_meta
         WHERE
           host_id = '9eafefe1-a2b0-4a5f-9810-f76b6c001d9e' AND
           path LIKE 'foo%') t
  ) x
  WHERE
    x.r = 1;

Which yields:

 r |      path      | content_type |          modified_at
---+----------------+--------------+-------------------------------
 1 | foo/bar/       | text/html    | 2026-02-04 23:38:56.226139+00
 1 | foo/baz/       | text/html    | 2026-02-04 23:51:54.911743+00
 1 | foo/index.html | text/html    | 2026-02-04 23:26:08.556731+00

The query plan for this is horrible, though. But it was fun.

1
0
0

@algernon oh, you also want the content_type of the last modified file?

1
0
0

@wolf480pl Yep. And the size, which is conveniently in another table1.


  1. because blobs_meta is shared between the original item, and compressed variants - but the size is obviously different between those.2 ↩︎

  2. this might end up being a terrible idea. It might be easier to roll the blob & blob_meta tables together, and have a separate compressed_blobs table. I'll figure that out once I get to implementing compression. ↩︎

1
0
0

@wolf480pl To be honest, it's probably easiest to do two queries: one for directories, where I don't need neither size nor content type, and I'm okay with having the modified time set to the unix epoch.

Then, a reasonably simple query can take care of directories: SELECT DISTINCT(blah) AS path FROM blobs_meta WHERE path LIKE 'foo/%' AND host_id = $1.

And I can make a separate query for files:

SELECT * FROM blobs_meta
WHERE
  path SIMILAR TO '%foo/[^/]+' AND
  host_id = '9eafefe1-a2b0-4a5f-9810-f76b6c001d9e';

No need to distinct here, because SIMILAR TO already limits it to the exact set I want.

1
0
0

@wolf480pl Also: this is something that's done rarely: only when using WebDAV (and later probably S3 or SFTP or whatever else I think of) to mess with one's files. It is not on any kind of hot path that runs a gazillion times a day. This is something that will run maybe a few thousand times on a busy day, I think.

0
0
1

@wolf480pl @algernon Depends entirely on what indexes are put on there - if there’s one on just host_id, it’s extremely likely that it will just scan all files for said host because that’s the most precise filter

What you’d hope to get is a range scan between host_id and the prefix, till the end of the prefix on a single index, with a streaming aggregation for deduplication.

Whether postgres will give that to you is another question

1
0
0

@ignaloidas
it should be able to do that you have a btree index on (host_id, path) in that order.

Whether it'll think that this plan has the lowest cost is another question.
@algernon

0
0
0