I can't get Azure Storage to support putting data from URLs that need "%2F" verbatim in them into blobs.

Nelson Chen 6 Reputation points
2021-11-26T04:17:27.293+00:00

It's terrific and unique that Azure Storage supports server-to-server transfers of data from arbitrary URLs on the internet to a blob; however, it cannot access some URLs as it erroneously unescapes characters needed in the URL. It unescapes "https://example.com/foo%2fbar.txt" to "https://example.com/foo/bar.txt" as a small example. I'm looking for any help or guidance to get Azure Storage to PUT from these URLs without Azure Storage unescaping them.

"%2F" may not be equivalent to "/" in URLs:

https://stackoverflow.com/questions/1957115/is-a-slash-equivalent-to-an-encoded-slash-2f-in-the-path-portion-of-a

My story is that I'm looking to build a tool or toolkit to copy Google Takeout data to Azure Storage blobs quickly for archival storage if some human-created or robot-created disaster happens to the Google account. The Google Takeout URLs look like this (this URL is genuine but has long expired):

https://00f74ba44b071b761059aef3fd79738daea1be7829-apidata.googleusercontent.com/download/storage/v1/b/dataliberation/o/20211113T212502Z%2F-4311693717716012545%2F498d83a5-1ab3-4a79-815f-e5cfda855e7a%2F1%2F869777c3-49ff-4d4e-a932-230a6b0b2a78?jk=AFshE3XT7l4gO3olRD23ASyAuaK-Lbi1Z4oc4eMBje8eLdA1mHPk-VeNNMCDno2sDlRKTKD2Nqau1HdkE9nX5f462yylgcSu5kmIknW0lU-1Xx3Mb8OnO5L-DMq3W8xslAI6vlKnqrKaTztfOKSQOfn-5XWf4OuiuDCTdstSSCcsNDMu8b4NX6cnuRhGRdVonqtH3lf9TV7fIBJMchxy3l-i3W_tiGHO7NP9B2Rnvo2uJP7-pgbfxH_ki0DLerQhKK4hRx6KeHWfXL2XT80lLVYwfS2dk5XVAplFIIV7Lp9H7x3HERQzR7_1JshhluQyoG6Vqv7gRYyav8S7PrwkKXStCho5fc85ErZ0dQqJXmvNqCtdWCB8-KzIA5-UgjlLcDzk_mVYMUfcr-_i-R-5tA_Rnb0MmavB94aIj9EfEh0g0B6yCRnAHAIuob6EYFTeCVTs7XXBlqlMKF-P0A5L2d47f0pSQrosQUNshoZKKieSl71vD3kiFDZ4OIg5K-yPlkniodFuyRr-hf5LeBIZhMFNozA2nfGOU3cW3i_sJZgNJNf68UK_l1beTDJ5ZKEZ5ot0jgaQ7w_KlLEonaGJM4Lw7oVby-GbqmlFYe2SI9wwxcXURdW88AW4zipqCMOz_N7cBYC0zm1t4TRSW2-_uvsQWLQRA_9g8avGn8RIKr8i-ISa7sfMaUQEkY4eOtsV7l3JHNeKjmJtxSOJPwg487Cv0htwGt_3Kd6IbyFOb1l0l9wKtkIxkQqliTvAK7VXZUGr1Cdsbbhq1qy3AF1aMVPA1vghV2TOOr5rOzVkRUmTLQzU5WfsYOoNcKjJ7mPvuOirFkKvSHzBQDvZ8_B2RgwT7zMZ7LsjAhG1zS3eDTijUMi9QEM_FYkugRpZ36eg9SZWrEbHCp36y0kL7QK8gZHVP6ePvOqujXG1BCryrxp5UQ9AhZS3szhe54MDf1877LTEmCH5_utBvQqF31dlinmEWiL4YTwiSEwwUToJ38H7gmI-CWErYJsJylmuOSfUoJFpELSRi4Qw4fF-figbaB3w_BNhXvEBdUsMeSNkBkU5u4nwAfG8IJ6TxkyZZKgK4uIhG1R7mr7QaRJ_bizIRVUl&isca=1

Azure Storage unescapes the "%2F"s to a "/" before the "?", resulting in a 404 error. Google's endpoint requires a "%2F" there.

My goal is to get Azure Storage to transfer data directly from Google Takeout's signed URLs massively in parallel. The benchmark I'm trying to achieve is moving 1TB from Google Takeout to Azure Storage in less than a minute. Unfortunately, Azure Storage's HTTP client limitations with this unwanted un-escaping put a halt to this. While it does seem Google uses buckets to back their Google Takeout offering, I do not have access to the data in Google's Takeout buckets other than the presumably signed URLs they give me through Takeout's web interface and cannot use Azure's current implementation of downloading from Google Cloud Storage buckets.

For testing, Google Takeout's UI authenticates too much and takes a long time to generate downloads. Alternatively, you can generate similar URLs by clicking a link in Google Cloud Storage's UI for an "Authenticated URL" of an object but those are still temporary. I'm sure other Google services can generate similar URLs as well. As it is a pain to generate Google URLs to demonstrate the issue continually, I've set up a small demo server here. The server has also been helpful to diagnose what requests Azure Storage are using:

https://put-block-from-url-esc-issue-demo-server-3vngqvvpoq-uc.a.run.app

The URLs of interest are:

https://put-block-from-url-esc-issue-demo-server-3vngqvvpoq-uc.a.run.app/red/blue.txt will show a 404 as that isn't the URL desired.

https://put-block-from-url-esc-issue-demo-server-3vngqvvpoq-uc.a.run.app/red%2Fblue.txt will show a 200 as that is the URL desired.

https://put-block-from-url-esc-issue-demo-server-3vngqvvpoq-uc.a.run.app/normal.txt will show a 200 as it's just a run of the mill URL.

You can find the source for that Go server here:

https://github.com/nelsonjchen/put-block-from-url-esc-issue-demo-server/blob/master/main.go

I have my test server hosted on Google Cloud Run as it is cost-efficient for hosting scale to zero applications, but I see no reason you cannot run your test server locally and expose it on a port to the public internet with something like ngrok.com to Azure Cloud Storage, anywhere else that supports running containerized applications and expose it publically, or any platform that can run a Go application and expose it publically.

The server will also log out to STDOUT any requests.

Here is the source for a small C# client counterpart that you can edit, compile, and run to demonstrate the issue:

https://github.com/nelsonjchen/PutBlockFromUrlEscIssueDemoClient

As mentioned, you cannot store "https://put-block-from-url-esc-issue-demo-server-3vngqvvpoq-uc.a.run.app/red%2Fblue.txt" to a blob in a container, but any other URL served from the server mentioned above will work. If I request "https://put-block-from-url-esc-issue-demo-server-3vngqvvpoq-uc.a.run.app/red%2Fblue.txt" from the demo client, the server sees "https://put-block-from-url-esc-issue-demo-server-3vngqvvpoq-uc.a.run.app/red/blue.txt". I'm able to request "https://put-block-from-url-esc-issue-demo-server-3vngqvvpoq-uc.a.run.app/normal.txt" just fine.

https://put-block-from-url-esc-issue-demo-server-3vngqvvpoq-uc.a.run.app/red%252Fblue.txt does not work as well.

All said, any help would be appreciated. Up to 180TB, Azure Storage is the most cost-efficient way to archive Google Takeout, and with this issue solved, possibly the fastest!

Azure Storage Accounts
Azure Storage Accounts
Globally unique resources that provide access to data management services and serve as the parent namespace for the services.
2,722 questions
C#
C#
An object-oriented and type-safe programming language that has its roots in the C family of languages and includes support for component-oriented programming.
10,309 questions
{count} vote

2 answers

Sort by: Most helpful
  1. deherman-MSFT 33,701 Reputation points Microsoft Employee
    2021-11-30T22:04:06.363+00:00

    @Nelson Chen
    Can you try adding a "@" in front of url to make it a verbatim string literal?
    https://learn.microsoft.com/en-us/dotnet/csharp/programming-guide/strings/#declaring-and-initializing-strings

    If that doesn't resolve your issue let me know and I can work with our service team to take a deeper look at this.

    -------------------------------

    Please don’t forget to "Accept the answer" and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.


  2. Jodie Cunningham 0 Reputation points
    2023-02-21T23:25:33.54+00:00

    I got really invested in this and duplicated the issue in the az cli's az rest command. It seems like the error is in the Azure REST API itself. From the az rest command you just receive the normal 404 exception. (Not Found(<?xml version="1.0" encoding="utf-8"?><Error><Code>CannotVerifyCopySource</Code><Message>Not Found)

    What I ran from Linux with Azure CLI installed. I even used the latest API version:

    az login

    source_url="https://put-block-from-url-esc-issue-demo-server-3vngqvvpoq-uc.a.run.app/red%2Fblue.txt" blob_url="https://storageaccountnamegoeshere.blob.core.windows.net/test/test-file.dat"
    access_token=$(az account get-access-token --resource https://storage.azure.com/ --query accessToken -o tsv)
    now=$(env LANG=en_US TZ=GMT date '+%a, %d %b %Y %T %Z')
    az rest --method put --uri $blob_url --headers Authorization="Bearer $access_token" x-ms-date="$now" x-ms-version=2021-12-02 \ Content-Type="text/plain; charset=UTF-8" x-ms-blob-type=BlockBlob x-ms-copy-source="$source_url" Content-Length="0"
    az rest --method get --uri $blob_url --headers Authorization="Bearer $access_token" x-ms-date="$now" x-ms-version=2021-12-02

    HTH.

    0 comments No comments