Skip to content

Conversation

decibel
Copy link
Contributor

@decibel decibel commented Dec 29, 2015

This is an attempt to make it easier to parse JSON.sh output from within bash.

The first thing I looked at was supporting direct assignment to associative arrays (declare -A, not declare -a), since the native output format is very close to what you need for that. That change essentially amounts to piping the original output through egrep -v '^\[]' | tr '\t' =.

Despite the usefulness of that, I'm not a fan of it because as far as I can tell the only way to actually interpret that in bash is by using an eval, which is dangerous. If someone finds a flaw in any of this they could potentially inject any arbitrary code, which would then get eval'd. I'll leave it to the reader to figure out what something like `eval rm -rf /`` would end up doing...

The next thing I looked at was better support for read -r key value. Thanks to the tab delimiter, that was also pretty easy: I just stripped the [] surrounding the path. This seems pretty robust, so long as you change IFS to $'\t'. That works because tabs aren't valid in JSON, and the script seems to detect that pretty robustly (though I didn't exhaustively test that).

The part I don't like about key-value mode is everything stays wrapped in double-quotes. Probably not a big deal for keys (I guess), but I'm worried it might cause problems for values. Especially values that have escapes in them. Maybe there's a clean way to deal with that.

The remaining 3 modes are simple... key-only and value-only produce one key or value per line, as you'd expect. The default mode retains the same behavior as today.

I also added a script of examples. It's a bit verbose and ugly, but at least it gives a good foundation on using the script. It does lean very heavily towards the function interface though, which is probably not a good thing to promote since JSON.sh makes heavy use of globals.

@decibel
Copy link
Contributor Author

decibel commented Dec 31, 2015

I just looked at using xargs to de-quote stuff when using read; it breaks on escaped quotes. I guess that would be fixable by processing valid JSON escape sequences through sed, though that leaves the question of what the output should be. For example, "bad": "=\\\" turns into bad = =\" when processed into an array, which isn't really correct either. Tab escapes (\t) certainly aren't handled correctly either.

Just to be clear, all these problems already existed; the new format option and the usage examples in example.sh just point them out.

The only reason I think any of this matters is because right now you get different results from reading into an array vs doing key-value assignment. I think it would be best if they were at least consistent (and hopefully without resorting to eval).

Thoughts?

@dominictarr
Copy link
Owner

The keys should be converted into a reasonable bash variable, if there are weird values that should probably be converted to something without spaces, hyphens, periods, etc. normally this will be okay. It might also be a good idea to prefix all those variables with a string provided by the user, which would support leading numerals in paths.

you can use bash pattern matching on variables, so you could iterate over all the items in an array by doing for i in foo_*; do ... done; (it's something like that, at least)

what if we replaced all non-aphanumeric characters in paths with a _, and join each item with a double underscore __ ? maybe just strip out non-aphanumeric characters?

values should be surrounded by single quotes, because then they won't be interpreted by bash.
the user could still create a security hole by not using that variable safely, but that isn't our fault, at least.

@dominictarr
Copy link
Owner

oh, yeah, if there are single quotes in the input, that would need to be replaced with '"'"' close the current single quote, then open a double quote, which surrounds a single quote, close the double quote, and then reopen the single quote. Escaping quotes in bash is weird!

@decibel
Copy link
Contributor Author

decibel commented Jan 4, 2016

On 1/2/16 8:14 PM, Dominic Tarr wrote:

The keys should be converted into a reasonable bash variable, if there
are weird values that should probably be converted to something without
spaces, hyphens, periods, etc. normally this will be okay. It might also
be a good idea to prefix all those variables with a string provided by
the user, which would support leading numerals in paths.

you can use bash pattern matching on variables, so you could iterate
over all the items in an array by doing |for i in foo_*; do ... done;|
(it's something like that, at least)

Oh, I hadn't thought about turning each path into a variable. Is there a
way to do that without eval?

what if we replaced all non-aphanumeric characters in paths with a |_|,
and join each item with a double underscore |__| ? maybe just strip out
non-aphanumeric characters?

That would be OK most of the time, but maybe not all the time.

If you're shoving the data into an associative array (see my example.sh)
it's not necessary; bash dequotes most everything for you. There are a
few exceptions though, like tabs.

values should be surrounded by single quotes, because then they won't be
interpreted by bash.

Right, but the problem with using read is that the value itself contains
the quotes, and it's difficult to get rid of them:

decibel@decina:[16:12]$test_var='"json value"'
decibel@decina:[16:12]
$echo $test_var
"json value"

What I think should happen is that the variable is NOT quoted. AFAICT,
bash is smart enough to understand that a variable reference is just
that: a variable, and not something that should be parsed.

the user could still create a security hole by not using that variable
safely, but that isn't our fault, at least.

I don't think they can unless they do something like eval, or echo it in
a process expansion $(). But I suspect the process expansion bit is game
over anyway, even if it is quoted.

Anyway, now that I've thought about it a bit, I think the best option is
to provide an unquote function to the user and let them run the risk of
using it. That way you could do:

JSON.sh | while read -r key value; do
key=$(unquote "$key")
value=$(unquote "$value")
associative_array[$key]="$value"
done

I think that should be safe, because bash understands "$key" is a variable:

echo "$test_var"
"json value"

You could also use keys[] and values[] arrays if you don't want to mess
with associative...

By having this as a separate function we leave it up to the user what
they want to do. In the future we could also have the function
optionally unquote all the other oddball quoting json supports (I
actually looked up code to handle \u, and it's not that horrible.)

For right now though, my inclination is just to produce a simple

unquote() and call it good.

Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

@dominictarr
Copy link
Owner

I suspect most people would run this by doing $(command) or command that is probably the same thing as eval. That is what you get with a stringly typed language like bash though.

If I've learnt anything from doing bash, it's don't try to be too clever. This gets much simpler to do if the keys are alphanumeric, and that is usually the case, unless some total asshole created the JSON you are needing to parse.

I think the most important question: what do you need this for? and who controls the source of the data?

@decibel
Copy link
Contributor Author

decibel commented Mar 1, 2016

(getting back to this...)

In my particular case, the keys are generally pretty clean and values would be provided by the user, so I could probably manage this with either approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants