It's terrific and unique that Azure Storage supports server-to-server transfers of data from arbitrary URLs on the internet to a blob; however, it cannot access some URLs as it erroneously unescapes characters needed in the URL. It unescapes "https://example.com/foo%2fbar.txt" to "https://example.com/foo/bar.txt" as a small example. I'm looking for any help or guidance to get Azure Storage to PUT from these URLs without Azure Storage unescaping them.
"%2F" may not be equivalent to "/" in URLs:
https://stackoverflow.com/questions/1957115/is-a-slash-equivalent-to-an-encoded-slash-2f-in-the-path-portion-of-a
My story is that I'm looking to build a tool or toolkit to copy Google Takeout data to Azure Storage blobs quickly for archival storage if some human-created or robot-created disaster happens to the Google account. The Google Takeout URLs look like this (this URL is genuine but has long expired):
https://00f74ba44b071b761059aef3fd79738daea1be7829-apidata.googleusercontent.com/download/storage/v1/b/dataliberation/o/20211113T212502Z%2F-4311693717716012545%2F498d83a5-1ab3-4a79-815f-e5cfda855e7a%2F1%2F869777c3-49ff-4d4e-a932-230a6b0b2a78?jk=AFshE3XT7l4gO3olRD23ASyAuaK-Lbi1Z4oc4eMBje8eLdA1mHPk-VeNNMCDno2sDlRKTKD2Nqau1HdkE9nX5f462yylgcSu5kmIknW0lU-1Xx3Mb8OnO5L-DMq3W8xslAI6vlKnqrKaTztfOKSQOfn-5XWf4OuiuDCTdstSSCcsNDMu8b4NX6cnuRhGRdVonqtH3lf9TV7fIBJMchxy3l-i3W_tiGHO7NP9B2Rnvo2uJP7-pgbfxH_ki0DLerQhKK4hRx6KeHWfXL2XT80lLVYwfS2dk5XVAplFIIV7Lp9H7x3HERQzR7_1JshhluQyoG6Vqv7gRYyav8S7PrwkKXStCho5fc85ErZ0dQqJXmvNqCtdWCB8-KzIA5-UgjlLcDzk_mVYMUfcr-_i-R-5tA_Rnb0MmavB94aIj9EfEh0g0B6yCRnAHAIuob6EYFTeCVTs7XXBlqlMKF-P0A5L2d47f0pSQrosQUNshoZKKieSl71vD3kiFDZ4OIg5K-yPlkniodFuyRr-hf5LeBIZhMFNozA2nfGOU3cW3i_sJZgNJNf68UK_l1beTDJ5ZKEZ5ot0jgaQ7w_KlLEonaGJM4Lw7oVby-GbqmlFYe2SI9wwxcXURdW88AW4zipqCMOz_N7cBYC0zm1t4TRSW2-_uvsQWLQRA_9g8avGn8RIKr8i-ISa7sfMaUQEkY4eOtsV7l3JHNeKjmJtxSOJPwg487Cv0htwGt_3Kd6IbyFOb1l0l9wKtkIxkQqliTvAK7VXZUGr1Cdsbbhq1qy3AF1aMVPA1vghV2TOOr5rOzVkRUmTLQzU5WfsYOoNcKjJ7mPvuOirFkKvSHzBQDvZ8_B2RgwT7zMZ7LsjAhG1zS3eDTijUMi9QEM_FYkugRpZ36eg9SZWrEbHCp36y0kL7QK8gZHVP6ePvOqujXG1BCryrxp5UQ9AhZS3szhe54MDf1877LTEmCH5_utBvQqF31dlinmEWiL4YTwiSEwwUToJ38H7gmI-CWErYJsJylmuOSfUoJFpELSRi4Qw4fF-figbaB3w_BNhXvEBdUsMeSNkBkU5u4nwAfG8IJ6TxkyZZKgK4uIhG1R7mr7QaRJ_bizIRVUl&isca=1
Azure Storage unescapes the "%2F"s to a "/" before the "?", resulting in a 404 error. Google's endpoint requires a "%2F" there.
My goal is to get Azure Storage to transfer data directly from Google Takeout's signed URLs massively in parallel. The benchmark I'm trying to achieve is moving 1TB from Google Takeout to Azure Storage in less than a minute. Unfortunately, Azure Storage's HTTP client limitations with this unwanted un-escaping put a halt to this. While it does seem Google uses buckets to back their Google Takeout offering, I do not have access to the data in Google's Takeout buckets other than the presumably signed URLs they give me through Takeout's web interface and cannot use Azure's current implementation of downloading from Google Cloud Storage buckets.
For testing, Google Takeout's UI authenticates too much and takes a long time to generate downloads. Alternatively, you can generate similar URLs by clicking a link in Google Cloud Storage's UI for an "Authenticated URL" of an object but those are still temporary. I'm sure other Google services can generate similar URLs as well. As it is a pain to generate Google URLs to demonstrate the issue continually, I've set up a small demo server here. The server has also been helpful to diagnose what requests Azure Storage are using:
https://put-block-from-url-esc-issue-demo-server-3vngqvvpoq-uc.a.run.app
The URLs of interest are:
https://put-block-from-url-esc-issue-demo-server-3vngqvvpoq-uc.a.run.app/red/blue.txt will show a 404 as that isn't the URL desired.
https://put-block-from-url-esc-issue-demo-server-3vngqvvpoq-uc.a.run.app/red%2Fblue.txt will show a 200 as that is the URL desired.
https://put-block-from-url-esc-issue-demo-server-3vngqvvpoq-uc.a.run.app/normal.txt will show a 200 as it's just a run of the mill URL.
You can find the source for that Go server here:
https://github.com/nelsonjchen/put-block-from-url-esc-issue-demo-server/blob/master/main.go
I have my test server hosted on Google Cloud Run as it is cost-efficient for hosting scale to zero applications, but I see no reason you cannot run your test server locally and expose it on a port to the public internet with something like ngrok.com to Azure Cloud Storage, anywhere else that supports running containerized applications and expose it publically, or any platform that can run a Go application and expose it publically.
The server will also log out to STDOUT any requests.
Here is the source for a small C# client counterpart that you can edit, compile, and run to demonstrate the issue:
https://github.com/nelsonjchen/PutBlockFromUrlEscIssueDemoClient
As mentioned, you cannot store "https://put-block-from-url-esc-issue-demo-server-3vngqvvpoq-uc.a.run.app/red%2Fblue.txt" to a blob in a container, but any other URL served from the server mentioned above will work. If I request "https://put-block-from-url-esc-issue-demo-server-3vngqvvpoq-uc.a.run.app/red%2Fblue.txt" from the demo client, the server sees "https://put-block-from-url-esc-issue-demo-server-3vngqvvpoq-uc.a.run.app/red/blue.txt". I'm able to request "https://put-block-from-url-esc-issue-demo-server-3vngqvvpoq-uc.a.run.app/normal.txt" just fine.
https://put-block-from-url-esc-issue-demo-server-3vngqvvpoq-uc.a.run.app/red%252Fblue.txt does not work as well.
All said, any help would be appreciated. Up to 180TB, Azure Storage is the most cost-efficient way to archive Google Takeout, and with this issue solved, possibly the fastest!