I'm involved in writing a (Java/Groovy) browser-automation app with Selenium 2 and FireFox driver.
Currently there is an issue with some URLs we find in the wild that are apparently using bad URI syntax. (specifically curly braces ({}), |'s and ^'s).
String url = driver.getCurrentUrl(); // http://example.com/foo?key=val|with^bad{char}acters
When trying to construct a java.net.URI from the string returned by driver.getCurrentUrl() a URISyntaxException is thrown.
new URI(url); // java.net.URISyntaxException: Illegal character in query at index ...
Encoding the whole url before constructing the URI will not work (as I understand it).
The whole url is encoded, and it doesn't preseve any pieces of it that I can parse in any normal fashion. For example, with this uri-safe string, URI can't know the difference between a & as the query-string-param delimeter or %26 (its encoded value) in the content of a single qs-param.
String encoded = URLEncoder.encode(url, "UTF-8") // http%3A%2F%2Fexample.com%2Ffoo%3Fkey%3Dval%7Cwith%5E%7Cbad%7Ccharacters
URI uri = new URI(encoded)
URLEncodedUtils.parse(uri, "UTF-8") // []
Currently the solution is, before constructing the URI, running the following (groovy) code:
["|", "^", "{", "}"].each {
url = url.replace(it, URLEncoder.encode(it, "UTF-8"))
}
But this seems dirty and wrong.
I guess my question is multi-part:
- Why does FirefoxDriver return a String rather than a URI?
- Why is this String malformed?
- What is best practice for dealing with this kind of thing?