Getting Massive Amounts of Data into Azure

The First Problem

I was approached by a client of mine that was looking to move about 25,000 files directly into Azure Blob Storage so they could utilize Azure's CDN offering. Normally a small amount of data wouldn't be a big deal, however the total amount of data in this case was around 1TB and most commercial products would choke even listing the files, in addition the files were being stored on a Linux Server. Most of the tools such as cloudberry only work on Windows, setting up a share would degrade performance even further as the client did not have the necessary bandwidth to perform this well.

The Second Problem

I was approached by a second client shortly thereafter that had a large magnetic tape storage setup and a massive amount of data that they no longer wanted to host in their own colo facility. They also wanted to utilize Azure's CDN offering.

Where it gets interesting for Client 2

I receive notification from my superiors/management that they'll be shipping hardware to us for client #2 and that management has secured me a co-located rack, a layer 3 switch and a 1GBPS link to Azure (where's a 10gbps line when you really need one?) and to meet the contractor that'll be racking & stacking the equipment for us at the facility.

I get to the facility and supervise unloading of the equipment. It's in a 20U rolling rack solution heavily shrinkwrapped. I proceed to cut off the shrinkwrap and find a 6U sized robotic tape enclosure, a USB hub and a 1U server. No credentials, no documentation, nothing. After racking & stacking the equipment and connecting the tape drive via infiniband to the server (they were thoughtful enough to provide cabling thankfully) I hooked up a crash cart and powered up the Server.

...And i let out a deep sigh as I saw the Windows 2008 Boot Screen. This would be a lot more difficult than expected. I gave the client a call and got walked through re-connecting the tape drive and installing the tapes into the library as well as re-indexing everything. After asking the client a bit about the solution they inform me that it's a somewhat proprietary solution and while the tape drive itself will work under Linux, the extensions required to actually get the media on/off will not. At this point I call it a day and think on the solution.

A solution (well sort of)

Flashing back a bit to the first client and seeing the abysmal performance of most GUI based Azure Cloud CDN tools I decided to do some research. Turning around one of my monitors and ordering Thai food, we decided to get to work. I'd write the code, he'd test and call out any major mistakes. I decided on using the Azure Python SDK on Linux as it'd have good performance and I'd be able to work with the existing customer environment without making any changes to it.

We fired up a couple of test machines and got to work, realizing this might be more difficult than we additionally thought.

First thing we did was peruse the documentation Here We later found out some of this documentation was incomplete, requiring some re-work.

Development went in steps: Get the files into Blob Storage, get the extension working, test an upload. This part worked reasonably well, however we ran into other issues. Azure does not set mime types on it's own!

Now I've got a programmer's conundrum, how does one set the mime type? This wasn't particularly well documented. In addition we're getting double naming on directories... Strange! The solution for the double naming wasn't too straight forward, but by utilizing something similar to the following we were able to fix it as well as fixing the problems with the MIME types.

Code:

    for subdir, dirs, files in os.walk(dirupload):
     for file in files:
      trimdir = subdir.replace(container + "/<top level path>/", "")
          uploadme = os.path.join(trimdir, file)
          filetosetmimetype = os.path.join(subdir, file)
          mimetypeoffile = mime.from_file(filetosetmimetype)
          blob_service.put_block_blob_from_path(
           container,
           uploadme,
           uploadme
         )

So now we at least have the file being uploaded and the MIME type in Memory, we then have to set the properties to get the MIME type correct. In the same and indent level:

       blob_service.set_blob_properties(
        container,
        uploadme,
        null,
        mimetypeoffile
       )

Lo and behold this all worked! Note: Since this code was written, Microsoft has GREATLY improved their documentation regarding the Azure Python SDK. Note #2: The Python SDK been updated and the mime type can now be set inside the put_block_blob_from_path

The Solution for the Magnetic Issue

Now that I had some code for client 1, I'm able to do something with Client 2. Due to the horrifically low Network I/O speed of trying to share out a magnetic storage disc, and with the proprietary drivers on the tape drive (can't map as a disk!) I had to resort to interesting methods.

Here's what I determined I could do: I've got exactly enough room to work with 2 tapes at a time on the local HDD array of the '08 machine, meaning that I'd have to manually copy 2 tapes at a time to RDP I can spin up a machine in azure on the remote end of the 1GBPS connection and get around 750-850mbit between the '08 machine and this VM. Modify the python script to write out the name of each file uploaded successfully, and delete the source file.

The flow worked out a bit like this: 1. Copy tape to local HDD (2-4 at a time depending on tape) 2. Connect my Azure instance via SAMBA to the '08 Machine 3. Delete the source file off of the local HDD as it gets successfully uploaded, compare the log report to the manifest on each tape.

It's a somewhat manual process, requiring logging in about 2x/week to start fresh uploads. However it's far superior to the alternatives offered at the time.

Results:

The most important part of any IT project is obviously the result. Client #1 managed to have a full upload in about 2 days with their data completely intact. The only hiccup being the MIME type issue. Client #2 took about 8 months to fully upload this massive amount of data. A couple of tape and connectivity issues slowed down the process intermittently. However all their assets were also uploaded into Blob Storage.

J. Markowitz Tech Insights