Week 10

May 28, 2017

Hi Team,

This week I’ve been busy evaluating options for working “in the cloud” along with continuing along the road of shot building.

Table of Contents


Lots of New Features

The infrastructure required for the automatic building of shots involves a series of smaller but related features that are potentially useful outside the context of building shots.

image

  • Feature Support for adding a comment to publishes
  • Feature Support for viewing published comment in Loader
  • Feature Remember selection and windows settings when opening the Loader multiple times
  • Feature Publish History and Animation Curves
  • Feature View and copy path to original source file from Loader
  • Feature Group subsets together
  • Safety Scene Saved validator now mandatory
  • Safety Updating a rig now updates any added/removed meshes from rig
  • Internal Loader API 2.0
  • Performance Asset Database query performance improvements


Sync Delayed Further

Last week, I implemented the initial requirement for synchronisation of files between Mindbender and artists working remotely. Shortly thereafter I made a discovery and realisation.

  • Assuming everyone works in the cloud
  • Assuming all cloud machines share a cloud drive
  • There is no need for transfer or synchronisation of any kind

Each of the cloud providers offer an option to also host data in the cloud. When both machines and data are in the cloud, no transfer is made. It’s the equivalent of everyone working locally, sharing a local drive.

Except these drives are large and fast. Large as in infinitely scalable (at an price reflecting how much you need and use) and fast as in 400mb/sec write, 1000mb/sec read over a shared 10gb network connection in a private network.


Security and Data Protection

Last week I spoke briefly about the issue of cloud storage in conjunction with our line of work. This week it has been made more clear the limitations of this approach and what price we are likely to pay.

In short, money is proportional to the requirements of a job.

For Mindbender to increase its turnover, it will need to take on greater projects and clients providing projects of greater heights are more inclined to care about data protection.

What that means is that in order to increase turnover, Mindbender can either..

  1. Increase the amount of smaller jobs
  2. Increase the size of jobs

Which effectively means that in order to 10x turnover, Mindbender would need to either 10x the number of jobs, or 10x the cost of accomplishing those jobs.

The question to ask ourselves then is, do we want quantity or quality?

Interviews

In order to better drive the point home, I ventured out to capture some data on the relevance of data protection. In the interest of anonymity I will refrain from naming names.

After a chat with a senior figure at an established visual effects facility in London, a studio the size and production tasks of what I expect Mindbender to reach within the next few years (50-100 employees, tv commercials and minor film work), a few things became clear.

  1. They tackle high-profile work
  2. Their clients care about confidentiality
  3. They care about confidentiality

..for example, if even a single still frame of Groot were to leak before the movie was released, a client may face $40 million of financial losses due to damaged hype.

An extreme example from the perspective of Mindbender, but assuming film is where we are headed, perhaps not so extreme after all.

Due to increasing demands from one of their clients, security gates are installed between rooms within their facility with mandatory pictures taken of their faces and a log kept of who went where and when. This in combination with the fact that you cannot enter the facility without a “scan” where you are deliberately unable to view the inside of the facility prior to finishing this initial gate.

Another visual effects facility in Stockholm Sweden of 10-15 employees working with high-profile clients require artists working on a given project to not gain knowledge from other artists working on other projects, towards the same client. The client was concerned enough about leaks that even co-workers were unable to discuss their projects with each other. Black-out curtains were drawn between the islands of desks within which a given project was being conceived.

Needless to say, their data was on-premises whereby the mere idea of hosting anything “in the cloud” was completely out of the question.

The Result

If we do venture into the cloud and enable artists from any location to take part in our production, we accept one major limitation.

We cannot guarantee the environment within which work is performed.

This means we cannot guarantee to our client that we take measures to protect their IP, we cannot manage from where their IP is accessed and by whom (a username and password is all it takes).

Ultimately, it means we cannot take on clients that protect their IP.

Here’s something from Microsoft to combat this.

  • https://azure.microsoft.com/en-gb/overview/azure-ip-advantage/


Shot Building

This week I’ve had a number of eurekas with regards to shot building.

The primary eureka was that fact that we are missing an ingredient. When building a shot, we need it built according to some criteria, such as what it is meant to be used for once built. Rendering? Animation?

Therefore, we cannot simply associate a number of subsets to a given shot. Some subsets are interchangeable, such as a pointcache and a rig + animation curves. The pointcache is relevant when building for rendering, whereas the animated rig is used for layout and continued animation.

So there’s a new type of asset, called Group Asset or Asset Group.

{
    "_id": "iou3r08uio3rt",
  "name": "heroGroup",
  "members": {
    "geometry": [
            "hero:modelDefault.ma",
            "hero:modelLow.ma",
            "hero:rigDefault.ma",
            "shot1:hero01.curves",
            "shot1:hero01.abc"
        ],
    "shader": [
            "hero:lookdevDefault.ma",
            "hero:lookdevAnimation.ma"
        ]
  },
  "presets": {
      "animation": {
          "geometry": "shot1:hero01.curves",
          "shader": "hero:lookdevAnimation.ma"
      },
      "rendering": {
          "geometry": "shot1:hero01.abc",
          "shader": "hero:lookdevDefault.ma"
      },
      "layout": {
          "geometry": "hero:rigDefault.ma",
          "shader": "hero:lookdevAnimation.ma"
      }
  }
}

With this content, I’m able to load an asset alongside things related to said asset, and have a registry of which other subsets are replaceable with any of the subsets that were loaded.

The ingredients are all there, the definition - {asset}:{subset}.{representation} - the presets and two slots, geometry and shader.

Whether these should all reside together like this, and whether they are published to an asset or a shot is still to be discovered and depends on what is the most intuitive during production. That’s one of the things we’ll be experimenting with next week!


First Encounter with a Slow Connection

This week Rafael joined from Brazil and was the first to experience connectivity issues with our always-online approach to serving and managing assets.

Once installed, the Launcher refused to boot and narrowing down exactly why this was took some work.

Here’s what we knew.

  1. The Launcher requires a live connection to Mindbender in Gothenburg
  2. The Launcher successfully connects from London
  3. Rafael had a working Internet connection
  4. The maximum time that Launcher would try and make a connection was 500 ms. Any latency above that and the GUI would barely be usable.

Here’s what we found out.

  1. There was a firewall installed that blocked all Internet access to un-registered applications, such as the Python 3.6 installation shipped with Mindbender Setup.
  2. There was > 500 ms of latency when making a connection from his computer to Mindbender in Gothenburg, which exceeded the amount of time Launcher was set to expect a connection to have been made.

I can improve the responsiveness of Launcher by deferring queries made and offering a buffer symbol to let the user know things are happening but take time. But it raises further questions about the feasibility of whether remote artists in less developed areas are able to partake in a workflow involving the cloud desktop.


Cloud Desktop - Part 2

This week I’ve taken a deeper dive into the land of working in the cloud, here are some of my findings.


Executive Summary

Environment Read Write Upload Download GPU Gene
Local Network 54 mb/sec 105 mb/sec 72 mbit 72 mbit   6.9 - 7.2 fps
Paperspace 270 mb/sec 65 mb/sec 900 mbit 600 mbit Quadro M4000 8GB* 6.9 - 7.2 fps
Amazon     930 mbit 500 mbit GRID 4GB** 3.3 - 3.9 fps
Google     236 mbit 413 mbit GRID 4GB 3.3 - 3.9 fps

Read and write is measured by copying a large file from one disk to another via the local network.

*Dedicated per machine
**Shared amongst all users of the service, such as Amazon.


Every machine confined to the cloud implies the following.

  • We control the data
    • No copying, no synchronisation
  • Graphics performance
    • NVIDIA Quadro M4000 8GB
  • Disk performance
    • 400mb/sec read/write
  • Network performance
    • 500mbit/down, 900mbit/up
  • Scale
    • Add/remove n-number of workstations in < 10 minutes
  • Software
    • We control available software
    • No relying on, or debugging, an artist workstation
    • Pre-installed and configured pipeline
  • Monitoring on all workstations.
    • Dedicate more resources to those who need it
    • Take away resources from those who don’t (to save costs)

Open Question

  • If each connection requires 10mbit of downstream Internet access, then 10 artists working simultaneously should require 100mbit of downstream Internet access.


Local Network

Before looking at the performance of alternatives, let’s establish the metric currently available at the local studio.

Performed on a Sunday morning, expecting no additional traffic.

# Disk performance
$ fallocate -l 5GB ~/5gbfile
$ rsync -avh --info=progress ~/5gbfile /m/5gbfile  # Write
sent 5.00G bytes  received 111 bytes  105.29M bytes/sec
$ rsync -avh --info=progress /m/5gbfile ~/5gbfile  # Read
sent 5.00G bytes  received 111 bytes  54.91M bytes/sec
# Internet performance
$ wget https://raw.githubusercontent.com/sivel/speedtest-cli/master/speedtest.py
$ python speedtest.py
Download: 72.65 Mbit/s
Upload: 73.25 Mbit/s


Amazon Workspaces Graphics

Initialized VP2.0 renderer {
  Version : 2016.3.18.11. Feature Level 5.
  Adapter : GRID K520/PCIe/SSE2
  Vendor ID: 4318. Device ID : 4490
  Driver : .
  API : OpenGL V.4.5.
  Max texture size : 16384 * 16384.
  Max tex coords : 32
  Shader versions supported (Vertex: 5, Geometry: 5, Pixel 5).
  Shader compiler profile : (Best card profile)
  Active stereo support available : 0
  GPU Memory Limit : 4096 MB.
  CPU Memory Limit: 14591.6 MB.
}

Internet performance

Down: 500mbit/sec Up: 930mbit/sec

image

Disk performance

C:\Windows\system32>winsat disk -drive c
...
> Disk  Random 16.0 Read                       5.35 MB/s          5.1
> Disk  Sequential 64.0 Read                   267.73 MB/s          7.6
> Disk  Sequential 64.0 Write                  184.07 MB/s          7.2
> Average Read Time with Sequential Writes     0.946 ms          7.7
> Latency: 95th Percentile                     1.224 ms          8.2
> Latency: Maximum                             22.575 ms          7.9
> Average Read Time with Random Writes         0.981 ms          8.3
> Total Run Time 00:00:27.17


Paperspace

  • Gene rig: 6.9-7.3fps
Initialized VP2.0 renderer {
  Version : 6.3.16.4. Feature Level 4.
  Adapter : Quadro M4000/PCIe/SSE2
  Vendor ID: 4318. Device ID : 5105
  Driver : .
  API : OpenGL V.4.5.
  Max texture size : 16384 * 16384.
  Max tex coords : 8
  Shader versions supported (Vertex: 4, Geometry: 4, Pixel 4).
  Shader compiler profile : (Best card profile)
  Active stereo support available : 0
  GPU Memory Limit : 8192 MB.
  CPU Memory Limit: 29176 MB.
}
OpenCL evaluator is attempting to initialize OpenCL.
Detected 1 OpenCL Platforms: 
 0: NVIDIA Corporation. NVIDIA CUDA. OpenCL 1.2 CUDA 8.0.0.
 Supported extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_d3d10_sharing cl_khr_d3d10_sharing cl_nv_d3d11_sharing cl_nv_copy_opts cl_khr_gl_event
OpenCL evaluator choosing OpenCL platform NVIDIA Corporation.
Choosing OpenCL Device Quadro M4000.  Device Type: GPU  Device is available.

Disk performance

C:\Windows\system32>winsat disk -drive c
...
> Disk  Random 16.0 Read                       188.59 MB/s       7.7
> Disk  Sequential 64.0 Read                   1068.87 MB/s      8.5
> Disk  Sequential 64.0 Write                  454.21 MB/s       8.1
> Average Read Time with Sequential Writes     0.453 ms          8.1
> Latency: 95th Percentile                     0.676 ms          8.5
> Latency: Maximum                             113.290 ms        7.6
> Average Read Time with Random Writes         0.507 ms          8.7
> Total Run Time 00:00:08.94


Google Compute Cloud

I only briefly experimented with Google Compute Cloud due to the limited GPU currently available; barely able to support OpenGL 2.0 (released in 2004). A more capable GPU is rumored to be made available soon (NVIDIA GRID).

C:\Windows\system32>winsat disk -drive c
...
> Disk  Random 16.0 Read                       154.71 MB/s       7.5
> Disk  Sequential 64.0 Read                   1662.21 MB/s      8.9
> Disk  Sequential 64.0 Write                  292.06 MB/s       7.7
> Average Read Time with Sequential Writes     0.556 ms          7.9
> Latency: 95th Percentile                     0.829 ms          8.4
> Latency: Maximum                             3.757 ms          8.6
> Average Read Time with Random Writes         0.547 ms          8.6
> Total Run Time 00:00:06.55


Streaming Protocol Alternatives

One of the obstacles of working with the cloud, aside from IP protection issues, is latency. Both Google and Amazon offer Teradici support out of the box whereas Paperspace wrote their own streaming protocol to combat latency on media-heavy material.

In addition to what is being provided out of the box, there are a number of third-party options.


USB Sharing alternatives

As you can’t plug a Wacom tablet into your cloud computer, we need some other way of enabling use of dedicated devices with our cloud machines.

Paperspace provides an option to stream a USB devices over the Internet and access it via your cloud computer as though it was local. The problem is performance.

I plugged a USB hard disk into it, and the read performance topped out at 200kbit/sec. That’s barely enough to play compressed video, let along work against it.

My Wacom tabled however did not work, and was confirmed a difficult challenge by MetaPipe CTO Ethan Estrada who experienced similar issues at Sony where remote desktops are in active use daily.

Teradici nor Amazon offers any options for USB forwarding but there are alternatives.

  • https://www.flexihub.com
  • http://www.usb-over-network.com


Next Week

I’m coming to Sweden!

If we can overcome the latency of a cloud desktop workflow, we still have IP protection to worry about however IP protection is at risk without cloud desktops by merely employing artists outside of a protected environment. I fear Mindbender may never scale without confining resources and data to a protected space, so we need to have a serious conversation about where Mindbender is going and whether the cost of turning down clients requiring their IP to be protected is worth the gain of working remotely and utilising cloud technology, such as Amazon Workspaces and Google Compute Cloud.

Whether or not we overcome the above issues, next week I also aim to tackle workflow optimisation in the form of shot building, a rendering and compositing workflow - including a Nuke integration.