Yep! Shader compilation is most likely the offender.
I need a unique shader for each instance. So "local to scene" must be enabled. This causes the same shader source to be compiled every time a new instance is created, causing cumulative lag. The more scenes I instance at once, the more shaders are compiled, larger the lag.
The solution in that video suggests creating all shaders at startup to avoid runtime compilation. In their case it's just a couple of shaders. But I need several hundreds of instances that gradually appear/disappear during execution. That would mean creating all instances I'd ever need at startup, and then adding them to main scene as needed.
Too bad a new shader with exactly the same source code is compiled for each instance just because uniform configuration is different per instance.
I pushed a lot of mesh animation onto the vertex shader to avoid having thousands of animated spatial nodes. But as it turns out, thousands of nodes might actually be better performance-wise.
EDIT: Hm... what I said above may not be entirely true. I did some tests and it looks like shader is recompiled only if both, material and shader are set local to scene. If material is local to scene but shader is not then the lag is significantly smaller.