Skip to content

Conversation

milesziemer
Copy link
Contributor

Background

The model assembler can load Smithy files from a jar - model discovery. This works by looking for a 'META-INF/smithy/manifest' file within the jar, which contains the paths of smithy model files within the jar. You can give the model assembler a class loader to use for model discovery, but you can also add a jar to discover models from using a regular addImport. The problem is with the latter case. Because jars are like zip files, you can't read files within the jar using regular file system operations. Instead, you can either use the JarFile api to open the jar, and read specific entries (like a zip file), or you can directly read specific entries via a special URL with the format jar:file:/path/to/jar.jar!/path/in/jar. The latter is the way that our model discovery apis work. The problem was that URL doesn't perform any encoding or decoding itself, so callers/consumers have to take care of it, which we were not doing. For example, if the jar file has a path of /foo/bar%45baz.jar, we would create a URL like: jar:file:/foo/bar%45baz.jar!/META-INF/smithy/manifest, and when the JDK tried to use this URL to read from the file system, it would decode it and attempt to read /foo/[email protected]. Since we weren't encoding the URL to begin with, this was only a problem when the path contained %.

I tried to preserve the existing model discovery api, and just make it work with these paths. The URL class has a section in its javadoc about this footgun, and in more recent jdk versions, URL constructors are deprecated in favor of URI::toURL to avoid it. So that's what I did.

This technically changes the SourceLocation::filename for jars that are addImported, when the jar path has reserved characters (like spaces). Such jars were importable before because URL doesn't validate encoding, and the JDK decoding the URL would be a no-op since it wasn't encoded to begin with. So we could just encode %, meaning only the newly-allowed paths would have any encoding. Disregarding the fact that this side-steps the assumed pre-conditions of URL, it would actually exacerbate an existing inconsistency between SourceLocation::filename of models discovered on the classpath, and models discovered from imports. The JDK encodes the URLs of resources on the classpath, so their models' source locations will be encoded. Encoding URLs of imported jars means there's no difference. You can assume that a source location with jar:file: is a well-formed URL/URI.

Testing

Added new ModelAssembler tests for loading a jar via addImport when the jar's path has encoded characters.
I added one for adding the import via path, and via URL, in case the way we normalize import paths ever changes.

I also added a ModelDiscovery test to assert the new encoding behavior.

Links

Fixes #2576 , Fixes #2103


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Fixes smithy-lang#2576
and smithy-lang#2103, which are both
that the model assembler can't load jars if the jar's path has
percent-encoded characters. This may happen for artifacts resolved by
coursier (see linked issues).

The model assembler can load Smithy files from a jar - model discovery.
This works by looking for a 'META-INF/smithy/manifest'
file within the jar, which contains the paths of smithy model files
within the jar. You can give the model assembler a class loader to use
for model discovery, but you can also add a jar to discover models from
using a regular `addImport`. The problem is with the latter case.
Because jars are like zip files, you can't read files within the jar
using regular file system operations. Instead, you can either use the
JarFile api to open the jar, and read specific entries (like a zip
file), or you can directly read specific entries via a special URL with
the format `jar:file:/path/to/jar.jar!/path/in/jar`. The latter is the
way that our model discovery apis work. The problem was that URL doesn't
perform any encoding or decoding itself, so callers/consumers have to
take care of it, which we were not doing. For example, if the jar file
has a path of `/foo/bar%45baz.jar`, we would create a URL like:
`jar:file:/foo/bar%45baz.jar!/META-INF/smithy/manifest`, and when the
JDK tried to use this URL to read from the file system, it would decode
it and attempt to read `/foo/[email protected]`. Since we weren't encoding the
URL to begin with, this was only a problem when the path contained `%`.

I tried to preserve the existing model discovery api, and just make it
work with these paths. The URL class has a section in its javadoc about
this footgun, and in more recent jdk versions, URL constructors are
deprecated in favor of `URI::toURL` to avoid it. So that's what I did.

This technically changes the `SourceLocation::filename` for jars that
are `addImport`ed, when the jar path has reserved characters (like
spaces). Such jars were importable before because URL doesn't validate
encoding, and the JDK decoding the URL would be a no-op since it wasn't
encoded to begin with. So we could _just_ encode `%`, meaning only the
newly-allowed paths would have any encoding. Disregarding the fact that
this side-steps the assumed pre-conditions of URL, it would actually
exacerbate an existing inconsistency between `SourceLocation::filename`
of models discovered on the classpath, and models discovered from
imports. The JDK encodes the URLs of resources on the classpath, so
their models' source locations will be encoded. Encoding URLs of
imported jars means there's no difference. You can assume that a source
location with `jar:file:` is a well-formed URL/URI.
@milesziemer milesziemer requested a review from a team as a code owner August 28, 2025 00:56
@milesziemer milesziemer requested a review from yefrig August 28, 2025 00:56
I wasn't accounting for the fact that windows paths have a different
string representation than unix paths. This commit fixes that by
normalizing the path to a URI path (using '/') before building the
manifest URI.

While debugging this, I found another difference between the source
location filename of classpath and import discovered models. On windows,
the source location of import discovered models will look like this:
```
jar:file:C:\foo\bar.jar!/META-INF/smithy/foo.smithy
```
which looks pretty cursed to me, but it _still works_. As in, you can do
a `new URL(thatThing)`, and successfully read `foo.smithy`. I'm guessing
the reason this works is similar to why you can read files when the URL
isn't properly encoded, coincidental, or the JDK is doing some extra
work to normalize it. Especially considering that you can do the same
with the filename of classpath discovered models, which look like a
normal URL:
```
jar:file:/C:/foo/bar.jar!/META-INF/smithy/foo.smithy
```
PluginContext was relying on being able to compare the source
location filename with a stringified version of each configured source
path, in order to figure out which shapes/metadata are part of sources.
This was used by the sources plugin to generate the projected version of
the source model, for non-source projections. In my last commit, I
discovered that models in imported jars have a source location filename
like:
```
jar:file:\path\to\...
```
on windows. So they could be compared to source paths by stripping the
'jar:file:' part. Since my changes changed the filename to be a proper
url, this comparison no longer worked.

I updated PluginContext to compare filenames like 'jar:file:' to the URI
paths of sources, instead of the regular paths. These URI paths are
computed eagerly, so we don't have to do it on every comparison. Usually
there are a very low number of sources (they're usually a single
directory).

The fact that the specific format of the filename was being relied on is
mildly concerning, but considering the fact that models discovered on
the classpath already had a different format, I think this is ok. I
would also be surprised if there's a lot of code out there manually
importing jars, instead of providing a classloader.
Junit (or something, maybe not junit directly) seems to be failing to
delete a temp dir I was creating with the TempDir annotation. Maybe just
manually creating/deleting it will work.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Broken Model loading if jar path contain url encoded characters createSmithyJarManifestUrl: incorrect handling of complex paths
1 participant