ezyang’s blog

the arc of software bends towards understanding

A short note about functional linear maps

Some notes collected from a close read of Conal Elliot's Compiling to Categories and The Simple Essence of Automatic Differentiation.

A colleague of mine was trying to define a "tree structure" of tensors, with the hope of thereby generalizing the concept to also work with tensors that have "ragged dimensions." Let's take a look:

Suppose we have a (2, 3) matrix:

tensor([[1, 2, 3],
        [4, 5, 6]])

One way to think about this is that we have a "tree" of some sort, where the root of the tree branches to two subnodes, and then each subnode branches to three nodes:

       /- ROOT -\
  ROW 1          ROW 2
 /  |  \        /  |  \
1   2   3      4   5   6

Suppose you wanted to define this data structure in Haskell. One obvious way of going about doing this is to just say that a matrix is just a bunch of nested lists, [[Float]]. This works, true, but it isn't very illuminating, and it is certainly not type safe. Type safety could be achieved with sized vectors, but we are still left wondering, "what does it mean?"

Often, inductive definitions fall out of how we compose things together, in the same way that the inductive data structure for a programming language tells us how we take smaller programs and put them together to form a larger program. With matrices, we can think of a pictorial way of composing them, by either attaching matrices together vertically or horizontally. That gives us this vocabulary for putting together matrices, which would let us (non-uniquely) represent every matrix (Compiling to Categories, Section 8):

data Matrix
  = Scalar Float
  | Horizontal Matrix Matrix
  | Vertical Matrix Matrix

But what does it mean? Well, every matrix represents a linear map (if A : (n, m) is your matrix, the linear map is the function R^m -> R^n, defined to be f(x) = A x. We'll call a linear map from a to b, Linear a b). So the question we ask now is, what does it mean to "paste" two matrices together? It's a way of composing two linear maps together into a new linear map:

-- A function definition does not a category make!  You have to
-- prove that the resulting functions are linear.

horizontal :: Linear a c -> Linear b c -> Linear (a, b) c
horizontal f g = \(a, b) -> f a + g b

-- In matrix form:
--              [ a ]
-- [ F  |  G ]  [ - ] = [ F a + G b ]
--              [ b ]

vertical :: Linear a c -> Linear a d -> Linear a (c, d)
vertical f g = \a -> (f a, g a)

-- In matrix form:
-- [ F ]         [ F a ]
-- [ - ] [ a ] = [  -  ]
-- [ G ]         [ G a ]

Now we're cooking! Notice that the pasting shows up in the type of the linear map: if we paste horizontally, that just means that the vectors this linear map takes in have to be pasted together (with the tuple constructor); similarly, if we paste vertically, we'll produce output vectors that are the pasted results.

Cool, so we can add some type indexes, and write Linear as a GADT to refine the indices when you apply the constructor:

data Linear a b where
  Scalar :: Float -> Linear Float Float
  Horizontal :: Linear a c -> Linear b c -> Linear (a, b) c
  Vertical :: Linear a c -> Linear a d -> Linear a (c, d)

Is this the end of the story? Not quite. There are many ways you can go about combining linear maps; for example, you could (literally) compose two linear maps together (in the same sense of function composition). It's true that you can paste together any matrix you like with the data type above; how do we decide what should and shouldn't go in our language of linear maps?

To this end, Conal Elliot calls on the language of category theory to adjudicate. A category should define identity and function composition:

identity :: Linear a a
identity a = a

-- In matrix form: the identity matrix

compose :: Linear b c -> Linear a b -> Linear a c
compose g f = \a -> g (f a)

-- In matrix form: matrix multiply

We find that Horizontal and Vertical are the elimination and introduction operations of cocartesian and cartesian categories (respectively).

But this should we just slap Identity and Compose constructors to our data type? Linear map composition is a computationally interesting operation: if we just keep it around as syntax (rather than doing what is, morally, a matrix multiply), then it will be quite expensive to do operations on the final linear map. Where do representable functors come in? I'm not exactly sure how to explain this, and I've run out of time for this post; stay tuned for a follow up.

  • May 15, 2019

Microsoft Surface Book 2

Long time readers of mine may be aware that I used a ThinkPad X61T for the past decade. After the hinge on my second instance of the machine, I decided it was finally time to get a new laptop. And I had one particular model on my eye, after Simon Peyton Jones showed me his new laptop at the last Haskell Implementor's Workshop: the Microsoft Surface Book 2. It fits my primary requirement for a laptop: it's a convertible laptop into tablet mode with a digitizer pen. The pen is not Wacom branded but it has an eraser end and can magnetically attach to the laptop (no enclosure for the pen, but I think that for modern hardware that constraint is unsatisfiable.) Furthermore, there is a Linux enthusiast community around the device, which made me feel that it would be more likely I could get Linux to work. So a few weeks ago, I took the plunge, and laid down three grand for my own copy. It has worked out well, but in the classic Linux style, not without a little bit of elbow grease.

A quick review

The good:

  1. I've managed to get all of the "important" functionality to work. That's Xournal with XInput pen and hibernate (though with some caveats.)
  2. Linux support for other random features has pleasantly surprised me: I managed to get a working CUDA install and drivers (for PyTorch development), ability to boot my Linux partition bare metal as well as from a VM in Windows and I can even detach the screen while booted into Linux.
  3. The keyboard is nice; not as good as a classic Thinkpad keyboard but having actual function keys, but it has real function keys (unlike the Macbook Pro I use at work.)
  4. Two standard USB ports as well as a USB-C port means I don't need dongles for most usage (unlike my Macbook Pro, which only has USB-C ports.)

The bad:

  1. (Updated on March 19, 2019) Suspend is really slow. Although jakeday's setup.sh suggests that suspend is not working, something is working, in the sense that if I close my laptop lid, the laptop goes into a low power state of some sort. But it takes quite a long time to suspend, an even longer time to restart, and you still have to click past the bootloader (which makes me seriously wonder if we are actually suspending).
  2. The laptop un-hibernates itself sometimes when I put it in my backpack. My current hypothesis is that the power button is getting pushed (unlike most laptops, the power button is unprotected on the top of the screen). Probably some fucking around with my ACPI settings might help but I haven't looked closely into it yet.
  3. It's a high DPI screen. There's nothing wrong with this per se (and you basically can't buy a non-high DPI laptop these days), but any software that doesn't understand how to do high DPI (VMWare and Xournal, I'm looking at you) looks bad. The support of Ubuntu Unity for high DPI has gotten much better since the last time I've attempted anything like it, however; if I stick to the terminal and browser, things look reasonable.
  4. The function key is hardwired to toggle fn-lock. This is somewhat irritating because you have to remember which setting it's on to decide if you should hold it to get the other toggle. I'm also feeling the loss of dedicated page-up/page-down key.
  5. Apparently, the NVIDIA GPU downthrottles itself due to thermal sensor shenanigans (something something the fan is on the motherboard and not the GPU so the driver thinks the fan is broken and throttles? Mumble.)
  6. The speakers are... OK. Not great, just OK.
  7. It's too bad Microsoft opted for some custom charger for the Surface Book 2.

Linux setup

I did a stock install of the latest Ubuntu LTS (18.04) dual boot with Windows (1TB hard drive helps!), and then installed jakeday's custom Linux kernel and drivers. Some notes about the process:

  • I spent a while scratching my head as to why I couldn't install Linux dual-boot. Some Googling suggested that the problem was that Windows hadn't really shutdown; it had just hibernated (for quick startup). I didn't manage to disable this, so I just resized the Windows partition from inside Windows and then installed Linux on that partition.

  • Don't forget to allocate a dedicated swap partition for Linux; you won't be able to hibernate without it.

  • The Surface Book 2 has secure boot enabled. You must follow the instructions in SIGNING.md to get signed kernels.

  • One consequence of generating signed kernels, is that if you have both the unsigned and signed kernels installed update-initramfs -u will update the initrd for your unsigned kernel, meaning that you won't see your changes unless you copy the initrd over! This flummoxed me a lot about the next step...

  • If you want to use the NVIDIA drivers for your shiny NVIDIA GPU, you need to blacklist nouveau. There are plenty of instructions on the internet but I can personally vouch for remingtonlang's instructions. Make sure you are updating the correct initrd; see my bullet point above. Once this was fixed, a standard invocation of the CUDA installer got me working nvidia-smi. Note that I manually signed the NVIDIA using the instructions here since I already had generated a private key, and it seemed silly to generate another one because NVIDIA's installer asked me to.

  • Once you install the NVIDIA drivers, you have to be careful about the opposite problem: Xorg deciding it wants to do all its rendering on the NVIDIA card! The usual symptom when this occurs is that your mouse input to Linux is very laggy. If you have working nvidia-smi, you can also tell because Xorg will be a running process on your GPU. In any case, this is bad: you do NOT want to use the dGPU for plain old desktop rendering; you want the integrated one. I found that uncommenting the sample Intel config in /etc/X11/xorg.conf.d fixes the problem:

    Section "Device"
        Identifier  "Intel Graphics"
        Driver      "intel"

    But this doesn't play too nicely with VMWare; more on this below.

  • Sound did not work (it was too soft, or the right speaker wasn't working) until I upgraded to Linux 5.0.1.

  • After enabling XInput on my fork of Xournal, it did not start working until I restarted Xournal. Eraser worked right out of the box.

  • Don't forget to make a swap partition (Ubuntu default installer didn't prompt me to make one, probably because I was installing as dual-boot); otherwise, hibernate will not work.

  • Sometimes, when waking up from hibernate, networking doesn't work. Mercifully, this can be fixed by manually reloading the WiFi kernel module: modprobe mwifiex_pcie and systemctl restart NetworkManager.service. More discussion on this issue.

  • Sometimes, when waking up from hibernate/suspend, I get a big thermometer icon. When I reboot again it goes away but I have lost my hibernate/suspend. Perplexing! I don't know why this happens.

Boot via VM

The sad truth of life is that the Windows tablet experience is much better than the Linux experience--to the point where many would just install Windows and then boot Linux from a virtual machine (or Windows Subsystem for Linux). This was a non-starter for me: a bare metal boot of Linux was necessary to get the best pen input experience. However, why not also make it possible to boot the Linux partition from VMWare running on Windows? This setup is explicitly supported by VMWare, but it took a few days of fiddling to get it to actually work.

  • First, you need VMWare Workstation Pro to actually configure a VM that accesses raw disk (although the resulting VM image can be run from the free VMWare Player). You can sign up for the thirty-day trial to get it configured, and then use Player from then on, if you like. VMWare will offer the raw disk as an option when setting up disk; pick that and select the Linux partitions on your machine.
  • The primary challenge of setting up this system is that a standard install of Linux on the Surface Book 2 doesn't have a traditional Linux boot partition; instead, it has an EFI partition. Most notably, this partition is permanently mounted by Windows on boot up, so you can't remount it for your VM. Your regular partition doesn't have a bootloader, which is why when you turn on your VM, you get kicked into network boot via PXE. The workaround I ended up applying is to make a new, fake disk (vmdk-backed) and install the boot partition onto that (you don't actually need any of the kernels or initrds, since they live on your root filesystem; only /boot/efi is mounted from the EFI partition). Of course, you have to actually setup this boot partition; the way I did it was to chroot into my partition on a rescue CD and then run grub-install /dev/sda1. In the course of fiddling, I also accidentally ran update-grub which blew away my Windows boot option, but re-running this command when booted into Linux bare-metal fixed the problem (because the real /boot/efi will mount and thus Grub will find the Windows boot option.)
  • Some documentation about dual-boot is specific to VMWare Fusion. This is OS X specific, so not relevant to the Microsoft Surface Book 2.
  • Get yourself a bootable Linux CD (I personally use SystemRescueCd) to help debug problems in the installation process.
  • Make sure all of your /etc/fstab entries correspond to real disks, or your Ubuntu startup process will spend a while waiting for a disk that is never going to show up. I had this problem with the /boot/efi mount, because the mount was UUID based; I "fixed" it by changing the mount to be LABEL based and labeling my vmdk accordingly (I suppose it might also have been possible to change the UUID of my vmdk, but I couldn't find any reasonable instructions for doing so on Windows). Note that the volume doesn't actually have to successfully mount (mine doesn't, because I forgot to format it vfat); it just has to exist so system doesn't wait to see if it connects at some later point in time.
  • I don't really understand how Unity decides to provide scaling options, but although it offers magnification on a bare metal boot, those options are not available when run under a VM. I get something tolerably sized (with only slight blurriness) by setting the resolution to 1680 x 1050; play around a bit with it. I have "Stretch Mode" enabled in VMWare.
  • Whether or not you can log into your account depends on your X11 configuration; so if you're like me and uncommented the Intel configuration, I found this bricks my login (and you can unbrick it by commenting out again.) How do make both work? Don't ask me; I'm still figuring it out.

Window manager

I haven't gotten around to setting up xmonad; this is no small part due to the fact that Unity appears to support a very rudimentary form of tiling: Windows-left and Windows-right will move Windows to fill the left/right half of the display, while Windows-up will full screen a Window. I might still try to get xmonad setup on 18.04, but for now it is nice not having to fight with trayer to get the standard icons.

What's next

My two top priorities for improving the state of Linux on the Surface Book 2:

  1. Rewrite Xournal with support for hDPI (how hard could it be lol)
  2. Figure out how to make suspend/hibernate work more reliably

Otherwise, I am very happy with this new laptop. One thing in particular is how much faster my mail client (still sup) runs; previously, scanning for new mail would be a crawl, but on this laptop they stream in like a flash. Just goes to show how much an upgrade going from a 1.6GHz processor to a 4.2GHz processor is :3

  • March 17, 2019

HIW’18: Let’s Go Mainstream with Eta!

My name is Rahul Muttineni, CTO of TypeLead, working on building services around a language named Eta. To get started, I'll give an overview of how the project started, and where it is now.

It started as a HSOC project. It was called GHCVM; back then we had plans of making it both on JVM and CLR... we don't think about CLR anymore. I was mentored by Edward Kmett. We got pretty good response on this, so Jo and I decided to take the risk and work on this full time.

Big thanks to the GHC team, really good work. We've worked with the codebase for two years, and the more and more we work with it, we see how much awesome stuff there is. I've learned a lot by working with the code.

What is Eta? Eta is a fork of GHC. During the GSOC project, it started off as a Haskell program that used the GHC API. Midway in the program, I found that there were certain things that I wanted to do that I couldn't do, and I spent 3-4 days setting up a fork. I'll talk about what those limitations are. Like Haskell, it's a ... language, but the key difference is that it runs on the JVM. That is its own set of challenges, primarily with respect to tail calls. The nice thing about Eta is that it runs on the JVM, and it can run a good chunk of projects just like that. lens... recently, in the last month, we got Yesod working... it's in good shape. The next really great type of Eta is the strongly typed FFI. That works really well with the subtyping in JVM. A good chunk of the talk is about how we got that working. One of the main focuses of Eta is to be focused on industrial use. GHC is focused on industrial use, and research, both. There's a tension between the two... the nice thing we have for Eta is we don't have to face that tension; it's easy to make decisions on how to add new features, because will it help companies? If it is yes, otherwise we don't. (SPJ: That can be a hard question to answer!)

Haskell: Avoid success at all costs. We're not going to sacrifice core principles of language for benefit. Pursue success, at minimum cost. We want to make it successful as much as possible, but we want to make as little sacrifice as possible. That will be a little tricky...

What is Eta? What language features does it support? It started off as a fork of GHC 7.10.3. All extensions that work there, work with Eta as well. The only thing was TemplateHaskell and QuasiQuotes didn't work for a long time. We got it working 3-4 months ago. Biggest change is JavaFFI. GHC 7.10.3 is MINUS C FFI. We could have supported it: Java has JNI, but we tried to avoid it because we didn't want to make platform specific bindings to all the libbraries.

Joe backported a bunch of GHC 8 features: StrictData, ApplicativeDo, OverloadedLabels. Backpack was got recently. There's a very particular reason we had to do it: it has to do with the fact that we don't have green threads by default, and we wanted to give the user a choice of threaded runtime versus blocking runtime.

The compiler? It's a fork of GHC, so all the compiler passes are the same. We just chopped off everything after STG; e.g., C-- is gone. We generate bytecode from STG. We don't do any optimizations right now, and won't need to for some fine. We don't have to because in JVM, it's JIT compiled, so we don't have to optimize as much since JVM will remove a lot of the code that's not used anyway. And the driver: GHC generates object files... we decided to use JAR files. They're just zip files that package up a bunch of class files that store Java bytecodes. We also added one more mode for Uberjars. These are JAR files that are packaged up into one giant package.

I'll talk a little bit about how we implemented the REPL; template haskell. It works through the external-interpreter architecture. In GHC that's called iserv: the process, what it does, is handles running the code. So the compiler will still do the typechecking and everything, but once it's done with all that stuff, GHC will generate, a specific bytecode set, for interpreting Haskell efficiently. Because we already generated JVM bytecodes. We didn't need that custom bytecode set; we just compile with optimizations off; that gives us JVM bytecodes, then we send it to the external process, load it up, and execute them. Implementing the REPL is pretty easy how to get all this working together. JVM has a mechanism called classloading, which is very flexible. You can download bytecodes from the network, get code an runtime. Once you load the class, it's statically compiled code, it's optimized the same, etc.

The build tool we use is Etlas. We didn't want to move too far off of GHC, we stuck with Cabal. At the point we started using it, we forked off of Cabal 2.0. Main difference is that it lets you manage Eta versions. Etlas is almost like Stack, but it's much much closer to Cabal. We took the nice features of Stack and added them to Cabal. The other thing is that it does patch management. What we've been finding as we add more features and backport, Eta is not exactly GHC 7.10, nor is it GHC 8.0, it's a weird intermediate state, so certain packages that won't exactly compile without small changes, so we needed some system to apply those changes before we actually run the build. So we setup a GitHub repository that stores all the patch files. What etlas will do, it will get you the most recent set of patches. Then if you install a package, lens or something, it will download lens, apply the patch, and then it will build. Just recently, we were using base 4.8, and recently we upgraded to base 4.11. But we couldn't update to the new Generics infrastructure, because it slowed down compile times. So there were a bunch of packages that would check if they were GHC 8... and then use new generics. So we had to do a bunch of patching for that. But that's the kind of stuff we have to deal with.

The title of this talk is lets go mainstream with eta. I want to take a moment and say, what does that mean? "The ideas, attitudes, or activities that are shared by most people and regarded as normal or conventional." So at what point does a programming language become consdiered normal or conventional? It has to be used a big company, solve a big real world problem, and people have to believe it works. That's a very complicated question, multifaceted, one part of that answer is, it should make it easier to solve real world problems easier than the status quo. Take for example PHP. PHP came out when there was nothing better to program dynamic web applications. It had just the minimum features required to make it useful to build these. Now everyone here is asking the question: Haskell clearly solves a lot of problems better than the status quo. So why isn't it moving forward? That's a big question, I'm going to talk about how we're approaching it.

The strategy we're using internally, is we put on a "Big Company Hat"; we pretend we're a big company with a lot of employees, millions or billions of lines, and try to figure out what problems they'll face. Some problems are crazy long build times, when trying to build huge software; dynamic where you have to make sure junior developers get up to speed... etc. That's couple to get this conversation started.

After thinking about this a long time, we boiled it down to three basic principles, how we will develop Eta.

1. User Experience
2. Performance
3. Safety

User Experience is mainly, an emotional thing, how you feel when you use Eta technology, how you interact with it, what you feel when you get an error, psychologically. When something has good UX, we feel good. That's a very subjective thing, it can vary between different people, we have to figure out a way to standardize / make it universal. Something we forget as software and tool developers, the person developing the software is human. If they get errors persistently over time, they'll get frustrated. Machines will do what you tell them over and over again.

So what have we done in Eta to concern? We've done something very recently; it's not in master yet. Jo and I spent a week refactoring the codebase to refactor the error reporting part of the typechecker. It stores a list of strings; internally in GHC, there's a pretty printed data type, a list of those. The problem is we can't do postprocessing on that. So, what Jo did was made a giant data type with three hundred data constructors, one for every error message in GHC. That refactor to a week (SPJ: only a week?!) How it is now, it's decoupled, now you have, instead of storing in the typechecking monad, storing strings, you store a data type that stores the relevant data to print out that error message. And then at the final point, you can traverse the data type; based on the presence of other errors, you can decide what to do. Now it's pattern matching on certain error patterns and reporting them nicely. This is one example. We talked about simple errors: refactoring, adding an argument, changing the type, that's one of the most common errors you'll get working with Haskell. So we focused on those first. This shows an example of a type error... 'checker', it's an IO action.

GHC would tell you, couldn't match Int -> IO () with IO (). The problem is, for people who don't know how the typechecker works, they won't be able to understand what the typechecker is doing: going argument by argument. Because of the refactor we've done, it was easy to pattern match on this particular case, and say, hey, if the user forgot to put an argument, you can print out an error message of this form. You print an argument is missing, you highlight. (SM: You might have been missing the first argument, in this case!) That's true. It's tricky; sometimes the suggestion you give, might not. We don't tell people what they did exactly wrong, because we don't know. This is not a perfect thing, but we try to give the best suggestion that we can. And an important feature of this, most of how we decdied this layout, we studied languages like Elm and Purescript, which have done good work in this error. PureScript and Elm both, what they do, for a certain type of error, and you're not sure what to do... e.g., our info is not complete, they can go to a particular link and see other things that could have happened. So we don't have to flood the user with every suggestion, we just have to show the user what probably is the cause for it. And if it's a tricky case, not what we posted, in the link, we'll have the case as well.

(BG: There's other information that might be relevant; expanding type synonyms, etc. Do you have this info?) We're still thinking about that. Probably we'll have extra flags and stuff. Eventually, we'll have a mode that prints out JSON for IDEs, then it's easier to parse on the IDE side. (BG: Incidentally, there's a ticket, a student working with Richard, trying to figure out smoething similar).

Another aspect of UX is we added the REPL. Tried to simplify the entry point, try to make it easy. You want types, kinds, and where to find out more information. This is a statically typed language: you always hhave to be thinking about types. So we :set +t: always print out the types when you print things. One more thing, one of the former Scala engineers, has been learning Haskell, and he made a critique of one aspect of the REPL experience. f is a function of two argumets. In a second statement of the REPL, I applied 1. Find instance, show instance, for a goes to a. He said that... no show instance found, just say that this is a function, and you can't print it. That's a change we did. This was very easy for us to do.

Performance: it can mean many things. We're talking about fast developer feedback loop. Compile time and develop time, reducing that feedback loop. Some work we've done in this direction is reproducible builds. As of now, we have bit-for-bit reproducibility in Eta. That amounted to... nikita already did lots of work on reproducibility, he made HAskell interface reproducible; but the last mile of bit for bit is hard, there's many places. For our code generator, it was a lot simpler, we didn't have to do as much. It was 20 lines of code to make it deterministic. The main source of nondeterminism in GHC is the Unique data type, that changes between different runs depending on environment. What we did, was we added a counter. We used to print the uniques in the Java class name; that will make it nondeterministic. So we made a counter: the order in which the bindings make it to STG is the same.

GHCi is known to take up lots of memory, esp. with IDE. Simon Marlow has a bunch of fixes to that; we also backported those.

Another aspect of performance is the actual runtime performance. We're on the JVM, that puts us at a huge disadvantage. We don't have control over many things. The runtime system... this is Java. It's OO, so the runtime system is implemented in Java. We setup a hierarchy for values, that are defined in Eta. We have Closure, it's a class, parent class of all values, thunks, WNF. The Closure class has two methods. evaluate, evaluate to WHNF, and enter will actually enter... it's similar to GHC runtime system. The initial version was modeled exactly after GHC, except for tail calls. The terminology is similar. It's primarily used when you do the body of function. The main subclasses of Closure are Thunk and Value. Value will be the parent class, of things like functions, partiallly applied functions, and data constructors. Thunk will be the superclass of things like CAFs, single entry thunks, and updatable thunks. CAFs don't have free variables, so there's a special case for that, and you create a blackholing entry every time, to avoid two threads evaluating the same thunk. UpdatableThunk pushes an update frame, when it's finished evaluating, it will update the thunk to point to the newly computed value. And SingleEntryThunk, they're evaluated only once, so you can just evaluate it directly without pushing an update frame. This terminology is borrowed from GHC as well.

VAlues: DataCon, Function and PAPs. In the early days, and even now, every function call that was a tail call, is just a method call. This is the only way to make it remotely efficient. (More on stack soon). For static tail recursive calls: singly recursive or mutually recursive, they get compiled to loops. In most cases, they get a nice tight loop. In the mutual case, what will happen is, we collect all of the SCC, and we make one giant method that goes into a loop. Let's say you're in the even/odd example, what will happen is, when even calls odd, there's a variable called target, an integer. Even will be assigned 0, odd is assigned 1, so then you set 1 and restart. (BG: Do you always have unfoldings available for functions you compiled?) This is mutually recursive functions defined in the same module. (SPJ: They might have very different type arguments.) We cat all the arguments into one. The main problem with this argument, is parsers generated with Happy and Alex, we hit limits. (BG: Crash?) Not stack blowup. JVM has method size limit, so you can only have 65000 bytecodes. That's Eta compiled with itself. That's the only thing that's preventing us from using Eta with Eta. But all you need to do is split method into smaller chunks.

So how do we handle tail calls? When we know it , tail recursive, let's say you don't. Let's say you're using CPS. It's so common in Haskell, any fast parser uses CPS. In early days, Aeson would just blow the stack, it was pretty bad. So, we explored trampolining by default, and it was just awful, it was slow, super slow. What we did is turn it off, and let stack blow up. We found a better solution. The JVM has... the only way to unwind the stack is throwing an exception, or returning, and keep on returning until you return all the way down. It turns out, with exceptions, you can turn off the feature that captures the stack trace: that's the most expensive part of an exception. So we have a general exception. So this trampoline mechanism is optional. So, what we do, we have a function 'trampoline :: a -> a', runtime primitive, what it does is activates a boolean in the context which says, I'm going to trampoline now, and it activates a codepath that turns a counter, and once you reach a certain number, which is configurable, it will unwind the stack, and then continue where it needed to go. Our limit is 400, and then we unwind. It used to be in 1000s, but with Happy and Alex, we needed a smaller number. (BG: Inside that context, how much does it cost? But observably, it's faster. A couple months ago, we got PureScript to work in Eta, and it wasn't bad by default?) (SPJ: So you could turn it on by default: all you're doing is counting.) The counting is how we know how big the stack is. In your main function, you could call trampolineIO, and trampoline your whole program. (SPJ: Maybe it's low overhead, and you can do it all the time.) If it's low, we will do it. (How do you resume? Once you raise the exception, what do you store?) The counter happens at the entry point, and it's guarded bby the boolean. So, that, if the limit is exceeded, it will call another function that takes the context. So we store all the arguments in a context variable that gets passed to every eta function. We stash all the arguments into a function that has the state, then wjhen it unwinds, marked by this function, it will call that, with that function and those arguments.

As I mentioned, it's guarded by a boolean. JVM has an optimization, where it observes the boolean is true for a lot of times, it won't even compile that branch in the native code. So if you don't use trampolining, it doesn't affect you at all; the code for the counter will just not be there.

One nice thing I like about Eta is that you actually get stack traces for exceptions. This is because, to get good perf for Eta, you have to implement most primitives on JVM stack. This is a sample stack. You have a schedule loop, and you hare evaluting some IO acttion. applyV/applyN, these are partial applications. Execute an IO action. And another nice part, we've tried to encode it close to the original name. So you can tell this fucntion call happened in statistics.Regression, rnfAll. If you see, you notice there are line numbers. This is not perfect, and we can definitely make it better later... GHC gives you a lot of debugging info at STG time, but because the JVM doesn't have much flexibility, we can only attach one line number to code, so we have to discard all that info. This will get better; we'll stash that debug information in the classfile itself, and then access it and render a better stack trace. (BG: This is Source Notes?) Yeah.

Concurrency: One nice part is, it's nice or not. If you're evaluating a long chain of thunks, you're going to blow the stack. This happily coincides with GHC also having a space leak. Neil Mitchell wrote a blog post about how to detect space leaks: restrict stack size and then figure out which thunk was being evaluated. If you see a stack trace like this, and you see a huge chain of evaluates, in a long chain, you probably have a space leak.

How do I do interop? The way we did interop was, made a thing called the Java monad. IT's supposed to give you the experience of programming JAva. The basic implementation is inspired from IO monad. Object# c is "this", the object that is being threaded through. Because of this encoding, you get the Java experience: you can call dot on the Java object. It's almost like working with Java inside. The argument is called... that's the type constructor that forced us to fork, instead of use the API. You can't declare primitive types in the API. And we had to introduce a new low level representation. Declare wrapper types, wrapping the iterable interface in Java. We've stolen better syntax, which were type applications... resolve it somehow. I'm declaring an Eta type that wraps a JAva type, @java.lang.Iterable.

You use the java function to run the Java monad. All of these have to be imported. newArrayList, newInteger, but we brought some combinators, that let you call methods. It owrked out with the monad. This is sample code that does the same thing as Java code. it just uses standard monadic combinators. If it's a fixed c, it's an instance.

You can use Eta as a better JAva, with referential transparency! Unlike Kotlin or Scala.

How do we handle subtyping? We define uilt in type families. We have a typeclass named extends. Any time you declare a function that takes a given class and any subtype of that class, you can, instead of actually subtyping, we do it with constraints. Extends' takes the info from Inherits and figures it out. You can use the dot operator on anything that is a subclass of Iterator. We had to extend the typechecker just a little bit: a lot of times the type gets stuck in the form Extends' (List JSTring) (List a) where a is unconstrained.

Imports are tiresome, so we're setting up direct Java Interop; actually use JAva reflection to get info class files, and generate imports. "import java java.lang.Math" works, but doesn't scale. Biggest priority for the rest of the year is Java interop, really good IDE support, documentation, language extensions: UnboxedSums, TypeApplications, DerivingVia, QuantifiedConstraints. We have some new language extensions in mind, AnonymousRecords, RowTypePolymorphism... We'll see how that goes.

I was thinking about ways... we work on the same codebase, how to collaborate? We're interested in compile performance, support for unbboxed sums. Worker wrapper has some glitch, and no one got around to fixing it. At some point, maybe not any time soon, that and mutable fields. Pretty important for us. (BG: Do unboxed sums get a lot of usage? Why unboxed sums? Does Eta code make a lot of use?) No. But a lot of people on JVM are annoyed that Maybe is boxed all the time. But if you have unboxed sums, you can represent it as null. (SPJ: Or you can say, just box it, and you won't notice it. If it's fast enough all the time, focus on what's going to make a difference.)

Q: Did you consider using Graal (it's a new virtual machine that supports partial evaluation and partial escape analysis, good for functional languages)?

A: We have looked into it, it's not completely there yet to use, and we're not sure if it's something we can invest time with. We're keeping up with it. (BG: But you lose the JVM!) That's what's preventing us from going there. Maybe if it gets integrated into a mainline VN we might look at it. (Mainline Java is planning to integrate Graal)

Q: (SM) Are you keeping the fork up to date with master GHC?

A: One thing that is out of bounds for us, and for a long time, is all the dependent Haskell work. Everything else, we keep up. If there's any nice bugfixes... (SM: So you're selectively backporting).

Q: (BG) Have you considered unforking.

A: Not yet, no.

  • September 23, 2018

A year into Backpack

It's been a year since I got my hood and gown and joined Facebook (where I've been working on PyTorch), but while I've been at Facebook Backpack hasn't been sleeping; in fact, there's been plenty of activity, more than I could have ever hoped for. I wanted to summarize some of the goings on in this blog post.

Libraries using Backpack

There's been some really interesting work going on in the libraries using Backpack space. Here are the two biggest ones I've seen from the last few months:

unpacked-containers. The prolific Edward Kmett wrote the unpacked-containers package, which uses the fact that you can unpack through Backpack signatures to give you generic container implementations with hypercharged performance (15-45%) way better than you could get with a usually, boxed representation. A lot of discussion happened in this Reddit thread.

hasktorch. hasktorch, by Austin Huang and Sam Stites, is a tensor and neural network library for Haskell. It binds to the TH library (which also powers PyTorch), but it uses Backpack, giving the post Backpack for deep learning from Kaixi Ruan new legs. This is quite possibly one of the biggest instances of Backpack that I've seen thus far.

Backpack in the Ecosystem

Eta supports Backpack. Eta, a JVM fork of GHC, backported Backpack support into their fork, which means that you can use Backpack in your Eta projects now. It was announced in this Twitter post and there was some more discussion about it at this post.

GSOC on multiple public libraries. Francesco Gazzetta, as part of Google Summer of Code, is working on adding support for multiple public libraries in Cabal. Multiple public libraries will make many use-cases of Backpack much easier to write, since you will no longer have to split your Backpack units into separate packages, writing distinct Cabal files for each of them.

Backpack in GHC and Cabal

By in large, we haven't changed any of the user facing syntax or semantics of Backpack since its initial release. However, there have been some prominent bugfixes (perhaps less than one might expect), both merged and coming down the pipe:

  • #13955: Backpack now supports non-* kinds, so you can do levity polymorphism with Backpack.
  • #14525: Backpack now works with the CPP extension
  • #15138: Backpack will soon support data T : Nat signatures, which can be instantiated with type T = 5. Thank you Piyush Kurur for diagnosing the bug and writing a patch to fix this.
  • A fix for Cabal issue #4754: Backpack now works with profiling

Things that could use help

Stack support for Backpack. In Stack issue #2540 I volunteered to implement Backpack support for Stack. However, over the past year, it has become abundantly clear that I don't actually have enough spare time to implement this myself. Looking for brave souls to delve into this; and I am happy to advise about the Backpack aspects.

Pattern synonym support for Backpack. You should be able to fill a signature data T = MkT Int with an appropriate bidirectional type synonym, and vice versa! This is GHC issue #14478 We don't think it should be too difficult; we have to get the matchers induced by constructors and check they match, but it requires some time to work out exactly how to do it.

  • July 14, 2018

A compile-time debugger that helps you write tensor shape checks

A run-time debugger allows you to see concrete values in a program, make changes to them, and continue running your program. A compile-time debugger allows you to see symbolic values in a program, reason about them, and write the rest of your program, e.g. filling in missing tensor size checks, for example.

Here's an example of a compiler-time debugger in action.

Let's suppose you are writing a simple program to read a pair of tensors from two files and do a matrix multiply on them. "Easy," you think, while writing the following program:

main() {
  x = load("tensor1.t")
  y = load("tensor2.t")
  return matmul(x, y)

However, there is a twist: this matrix multiply is an unchecked matrix multiply. If you pass it tensors which cannot be validly multiplied together, this is undefined behavior. Your compiler has cottoned up to this fact and is refusing to compile your program. You fire up your compile-time debugger, and it drops you to the point of your program right before the error:

# Ahoy there Edward!  I stopped your program, because I could not
# prove that execution of this line was definitely safe:

   main() {
     x = load("tensor1.t")
     y = load("tensor2.t")
->   return matmul(x, y)

# Here's what's in scope:

  _x_size : List(Nat)  # erases to x.size()
  _y_size : List(Nat)  # erases to y.size()
  x : Tensor(_x_size)
  y : Tensor(_y_size)

# I don't know anything else about these values

Let's take a careful look at the variables in scope. Our compile-time debugger tells us the type of a variable x by writing x : t, meaning that x has type t. We have all sorts of ordinary types like natural numbers (Nat) and lists of natural numbers (List(Nat)). More interestingly, a tensor is parametrized by a list of natural numbers, which specify their sizes at each dimension. (For simplicity, the underlying field of the tensor is assumed to be fixed.)

Our debugger has a command line, so we can ask it questions about the types of things in our program (:t is for type):

> :t 1
# Here's the type of 1, which you requested:

> :t [1, 2, 0]
# Here's the type of [1, 2, 0], which you requested:

> :t matmul
# Here's the type of matmul, which you requested:
forall (a, b, c : Nat). (Tensor([a, b]), Tensor([b, c])) -> Tensor([a, c])

The type of matrix multiplication should make sense. We say a matrix multiply takes two 2-D tensors of sizes AxB and BxC, and produces a tensor of size AxC. An equivalent way of phrasing, as was done in the type above, is to say, “for any natural numbers A, B and C, matrix multiply will take a tensor of size AxB and a tensor of BxC, and give you a tensor of size AxC”.

It is also instructive to see what the type of load is:

> :t load
# Here's the type of load, which you requested:
String ~> exists (size : List(Nat)). Tensor(size)

We do not know what the dimensions of a tensor loaded from a file are; all we can say is that there exists some size (list of natural numbers), which describes the tensor in question. Our compile-time debugger has helpfully given us names for the sizes of our tensors in scope, _x_size and _y_size, and has also told us how to compute them at runtime (x.size() and y.size()).

Enough of this. Let's remind ourselves why our program has failed to typecheck:

> matmul(x, y)

# I'm sorry!  I was trying to find values of a, b and c which
# would make the following equations true:
#     [a, b] = _x_size
#     [b, c] = _y_size
# But I don't know anything about _x_size or _y_size (they are skolem
# variables), so I couldn't do it.  Cowardly bailing out!

The compiler is absolutely right. We don't know anything about the size of x or y; they could be 2D, or they could be 100D, or not have matching dimensions.

As an aside: sometimes, it's OK not to know anything about the sizes. Consider the case of adding a tensor to itself:

> add
# Here's the type of add, which you requested!
add : forall (size : List(Nat)). Tensor(size) -> Tensor(size) -> Tensor(size)

> add(x, x)

# This type-checked OK!  I set size = _x_size and all of the arguments
# checked out.  You're good to go.

We don't know anything about _x_size, but add doesn't care; it'll take any List(Nat), and _x_size is certainly one of those.

Back to business. We are going to insert dynamic checks will will refine our knowledge of x and y, until it is obvious that matrix multiply will succeed.

What is a dynamic check? Operationally, a dynamic check tests whether or not some condition is true, and aborts if it is not. If we successfully run the dynamic check, we now have new information about the symbolic types in our scope. So for example, after adding a runtime test that two numbers are equal, we can subsequently assume at compile time that the numbers are equal:

> :t assert_eq_nat!
(x : Nat) -> (y : Nat) ~> x = y

First things first, we'd like to assert that our tensors are 2D tensors:

> assert_eq_nat!(len(_x_size), 2)

# OK!  I added assert_eq_nat!(len(x.size()), 2) to your program, and
# here's what I know now:

  _x_size : List(Nat)
  _y_size : List(Nat)
  x : Tensor(_x_size)
  y : Tensor(_y_size)
  len(_x_size) = 2

# By the way, I can profitably destruct _x_size into its constituent
# parts; would you like to do this? (Y/n)

> Y

# OK, your new context is this:

  _x0, _x1 : Nat
  _y_size : List(Nat)
  x : Tensor([_x0, _x1])
  y : Tensor(_y_size)

# I don't know anything about the new variables _x0 and _x1, but I
# learned enough about _x_size that I was able to eliminate it from
# the context (_x_size = [_x0, _x1])

List length is a very helpful property to test against, since it greatly specifies the shape of the list in question. We can do the same for _y_size:

> assert_eq_nat!(len(_y_size), 2)

# OK!  I added assert_eq_nat!(len(y.size()), 2) to your program, and
# here's what I know now:

  _x0, _x1 : Nat  # erases to x.size(0), x.size(1)
  _y_size : List(Nat)
  x : Tensor([_x0, _x1])
  y : Tensor(_y_size)
  len(_y_size) = 2

# By the way, I can profitably destruct _y_size into its constituent
# parts; would you like to do this? (Y/n)

> Y

# OK, your new context is this:

  _x0, _x1 : Nat  # erases to x.size(0), x.size(1)
  _y0, _y1 : Nat  # erases to y.size(0), y.size(1)
  x : Tensor([_x0, _x1])
  y : Tensor([_y0, _y1])

# I don't know anything about the new variables _y0 and _y1, but I
# learned enough about _y_size  that I was able to eliminate it from
# the context (_y_size = [_y0, _y1])

We're very close now. All we need to do is assert that the inner dimensions are equal:

> assert_eq_nat!(_x1, _y0)

# OK!  I added assert_eq_nat!(x.size(1), y.size(0)) to your program.
# After doing this, I learned _x1 = _y0, so I replaced all occurrences
# of _y0 with _x1. Now the context looks like this.

  _x0, _x1 : Nat  # erases to x.size(0), x.size(1)
  _y1 : Nat  # erases to y1.size(1)
  x : Tensor([_x0, _x1])
  y : Tensor([_x1, _y1])


> matmul(x, y)

# This type-checked OK!  I set a = _x0, b = _x1, c = _y1 and all of the
# arguments checked out.  You're good to go.

Extracting the contents of this session back into our code, we now have:

  main() {
    x = load("tensor1.t")
    y = load("tensor2.t")
    assert_eq_nat!(x.size(), 2)
    assert_eq_nat!(y.size(), 2)
    assert_eq_nat!(x.size(1), y.size(0))
    matmul(x, y)

At this point, I must come clean: the compile time debugger I've described above doesn't actually exist. But it is not all that different from the proof modes of interactive proof assistants the automated theorem proving community works with today. But unlike theorem proving, we have a secret weapon: when the going gets tough, the tough turns into a runtime check. Conventional wisdom says that automated theorem proving requires too idealized a setting to be useful in writing software today. Conventional wisdom is wrong.

  • April 6, 2018

Online/offline continuous integration

Raise your hand if you've ever put one of these commands in your continuous integration scripts:

  • apt install somepackage
  • pip install -r requirements.txt or pip install somepkg
  • conda install blah
  • cabal update or cabal install blah
  • git clone https://github.com/someguy/somerepo
  • wget http://some-website/thingy-latest.tgz

Can you tell what the problem is? These commands are not reproducible: depending on when you run them, they may give different results. More insidiously, most of the time they give you the same result (or, perhaps, a different result that still works for your use case).

I know, we need a reproducible build! The prevailing answer to this problem by tooling authors has been to seize the means of production and replace it with something that is reproducible. If you live the npm/yarn ecosystem, lockfiles ensure all of your dependencies redownload the same way every time (except when it doesn't). If you live in the Stack ecosystem, Stackage distributions ensure that you get the same Hackage package every time you build (except when it doesn't...). If you live in the Nix ecosystem, it means literally replacing the packaging system for everything on your system to achieve reproducibility.

So, it seems:

  1. If you can live entirely within the walled garden of the tools you use, things are pretty reproducible, but you're still on your own when it comes to taking updates on your dependencies.
  2. As soon as you step outside of the garden, it's entirely up to you to ensure reproducibility. The "easy way" out is usually not reproducible.

What if we change the question? We have entered this discussion under the assumption that reproducibility is our terminal value. But it's not: it's the mechanism by which we can achieve other goals. In the setting of continuous integration, what we really care about is a system that gives us signal about whether or not a given change set is correct or breaks things. A non-reproducible build interferes with this goal only in the sense that's its harder to tell if a change set has broken things if some random dependency has self-updated itself and broken your build. If this happens, you are blocked: you won't get clean signal until you fix the dependency problem. Broken window theory demands you drop everything and fix the build.

Clearly, we don't care if our dependencies are getting silently upgraded as development proceeds; in fact, we might prefer it, because "automatic" is less friction than "manual", at least when it works. What we do care about is the ability to block the upgrade if it is known to break us or revert the upgrade if we find out later that it caused some breakage.

Online/offline continuous integration. We traditionally think of the continuous integration build as a single pipeline which, when run from beginning to end, gives us signal about whether or not our code works or not. But I think it is better to think of a CI pipeline as dividing into two phases:

  1. Online environment configuration. In this stage, you download all of the external software that depend on that fiddly third-party world, setting up a complete build environment. Once you are done, you snapshot this environment by some mechanism (e.g., filesystem snapshot or make a Docker image.)
  2. Offline actual build and test. In this stage, within the snapshotted environment from step (1), turn off your Internet connection and run the actual build and test.

The key is that you don't have to run step (1) every build (you didn't want to anyway, for performance reasons.) Instead, the series of immutable snapshots of build environments generated by step (1) gives you the ability to revert or peg to a particular version of all of your dependencies, without having to go and make the universe reproducible. You can have a weekly cronjob rebuilding your environment, running the tests, and only deciding to push the activate snapshot forward if everything passes. You don't have to actually turn off the Internet when you run step (2), but it might help keep you honest.

Think offline. In today's connected world, it's easy to build systems with the assumption that you are always connected to the Internet. Doing so, however, leaves your tool at the mercy of the sound and fury of the real world. By applying a simple principle: "what can I do offline; what must I do online?" we reverse-engineer a design for continuous integration that gives you something almost as good as reproducibility, without forcing you to rewrite the universe. Surely that's worth something.

  • March 12, 2018

Semantic Import Versioning in the wild

The best and worst thing about semantic import versioning is that it makes BC-breaking changes hard.

In the past few days, Russ Cox has made a splash in a series of white papers describing Go and Versioning. In them, he coins a new term, Semantic Import Versioning, distilling it to the following principle:

If an old package and a new package have the same import path, the new package must be backwards compatible with the old package.

I am very happy Russ has come up with a good name for semantic import versioning, because this concept has been out there for quite a long time, but without a concise name or formulation of its the design. In fact, I would even say that semantic import versioning is inevitable when you take on the premise that you will never break user code. It is so inevitable, that semantic import versioning is already practiced in the wild in a variety of places. Here are a few examples:

  • REST APIs often are versioned with explicit version numbers in the request (e.g., in the URI) to let clients specify what version of the API they want. If a client wishes to upgrade to a new version of the API, they must rewrite their API requests to a new URL. REST APIs are forced to semantic import versioning because the traditional mechanism for avoiding breakage, version bounds, are unavailable in this setting.
  • Stripe's REST API pins each of their customers to the version of their API at the time they subscribed; even if Stripe makes a BC-breaking change in the future, the API for a given customer never changes. In this case, the semantic import is still there, but it is implicit (associated with a customer account) rather than explicit (in the client code); consequently, Stripe is willing to break BC a lot more frequently than would otherwise be acceptable for a REST API. Stripe's blog post points out a very important aspect of maintaining libraries under semantic import versioning, which is that you need to put in the engineering effort to sustainably manage all of the semantic imports available to users.
  • Semantic import versioning is widely practiced in programming languages, in the form of language standards/epochs. In C++, the setting of -std=c++xx specifies a particular semantic version to be "imported". It would be unheard of for a compiler to unilaterally break backwards compatibility of -std=c++11 in a new revision of the compiler; similarly, a user must explicitly migrate to a new language standard to take advantage of any new features. Rust epochs have a similar tenor. The choice between Python 2 and Python 3 is another form of semantic import versioning.
  • Semantic imports don't have to just specify a number. Feature flags, such as {-# LANGUAGE #-} pragmas in GHC Haskell, let users opt into BC-breaking changes at their use-sites.
  • In the deep learning world, ONNX models declare a semantic import to a particular version of an operator set. Operator semantics can evolve in BC-compatible ways without bumping the version, but to take a BC-breaking change, you must update the import statement.

One insight I draw from these examples is that what we call an "import version" is really a specification for some series of implementations. To someone who has spent a lot of time thinking about module systems, this is really a step in the right direction: program against interfaces, not implementations.

Another thing we can observe from these examples are the real world consequences of semantic import versioning. One particular effect stands out: semantic import versioning is challenging for maintainers, because it pressures them to maintain multiple major release branches simultaneously (after all, who wants to use pkg/v2 only to have it immediately unmaintained when pkg/v3 comes out). In the traditional release branch model, where one creates a release branch for each major version, only the most well-staffed software development teams can afford to maintain multiple, active release branches (backporting patches is a lot of work!) The friction involved with managing multiple implementations means that less well staffed projects will have a strong pressure to never break backwards compatibility.

This may not sound like a such a bad thing to the "don't break my stuff" grumps in the audience, but a lot of bugs and security problems have stemmed from being literally unable to outlaw harmful and dangerous APIs with BC-breaking changes. The danger of moving the calculus further towards preserving backwards compatibility is a further entrenchment of bad "first try" APIs. So while I do not deny that a genius of Russ's framing is to describe semantic versioning as part of the package path, it also sets up a bad expectation for the feasibility of BC-breaking changes, when what we should be doing is improving the state of tooling so that making a BC-breaking change is "no big deal." To me, the most promising way to reduce the friction of a BC-breaking change is to organize your software development so that a single codebase, under a single build, implements multiple specifications (v1, v2 and v3). As we saw from the examples, compilers can manage this (GCC supports multiple C++ versions), but traditional programming languages make it hard for libraries to do the same thing.

I don't now exactly how to solve this problem, but I do have a few ideas:

  1. Treat specifications as data. This means you can write code that operates over a specification, and for example, automatically generate the boilerplate necessary to forward from one implementation to another.
  2. Don't ask programmers to manually write diffs. I would never ask you to make a source code change by writing a diff by hand, just because this is the best representation for a VCS to store. Instead, you would just make the edit, and expect the system to figure it out. BC-breaking changes to APIs follow the same principle; it is much simpler and easy to understand if you just make the change, rather than write a description of the change
  3. Package level modularity. In a traditional package management system, I can't release a single bundle of source code which presents multiple "package interfaces". Even in vgo, even if I have a shared codebase implementing v1 and v2, I still have to make two releases to publish a new version of code. This is backwards; there is no reason a single unit of code cannot provide multiple interfaces, and package tools should make this possible.

These are maybe a bit too radical to expect Go to adopt them, but perhaps the next generation of programming languages will explore this design space further.

  • February 23, 2018

Systems ML workshop panel

  • JG: Joseph Gonzalez
  • GG: Garth Gibson (CMU)
  • DS: Dawn Song (UC Berkeley)
  • JL: John Langford (Microsoft NY)j
  • YQ: Yangqing Jia (Facebook)
  • SB: Sarah Bird
  • M: Moderator
  • A: Audience

M: This workshop is bringing together ML and systems. Can you put your place on that spectrum? Who is your home community?

YJ: Right in the middle. I'd like to move more towards systems side, but Berkeley Parallel Labs kicked me out. ML is my home base.

JL: ML is where I come from, and where I will be, but I'm interested in systems. My home is NIPS and ICML

DS: My area is AI and security, did computer security in the past, now moving into AI.

GG: Systems.

JG: I started out in ML, working on probabilistic methods. I basically, in middle of PhD, looked at systems. Now I'm moving to being a systems person that does ML.

M: We've seen a proliferation of deep learning / ML frameworks that require a lot of dev effort, money, time to put in. Q, what is the role of academia of doing research in this area. What kind of large scale ML learning can you do.

GG: I liked YJ's answer last time.

YJ: The thing that is astonishing is that academia is the source of so many innovations. With all due respect, we did very good work in Google, but then Alex came out with 2 GPUs and nuked the field. Academia is the amazing place where we find all of the new ideas, and industry scale it out.

JL: Some examples. If you're coming from academia, maybe you don't have research at big company, but it's an advantage as you will spend time about the right algorithm for solving it efficiently. And that's what will win in the long run. Short term, they'll brute force with AutoML. Long run, the learning algorithms are going to be designed where tjhey won't have parameters. A common ML paper is "we eliminate this hyperparameter". When they're more automatic, more efficient, great things will happen. There's an advantage in being resource constrained, as you will solve things in the right way.

Another example is, the study of machine learning tells us that in thefuture we will regard any model that u just learned and deploy as inherently broken adn buggy as data collection is not part of process of training, deploying. It will decay and become irrelevant. The overall paradagim of ML where you're interacting with the world, and learning, that can be studied easy in academia, and that has huge implications about how you're going to design systems,

DS: People often talk about in a startup, the best thing is to not raise a ton of money; if you're resource constrained you're more focused and creative. ML is really broad, there's lots of problems. Right now we learn from lots of data, but lots of talks at NIPS, humans have amazing ability to learn from very few example. These are problems for academia to tackle, given unique resource constraints.

GG: I'll say, it's difficult to concentrate on top accuracy if you don't have enough data, and the data available to students is stuff like DAWNbench which tends to lag. In academia, we build relationships with industry, send students for internships, they get the ability to do big data, while exploring first principles in university. IT's a challenge, but open publishing and open sharing of code world more berable.

JG: The one thing I've struggled with is focusing on human resources. I have grad students; good students, focus on a key problem can make a lot of progress. We struggle with a lot of data. Struggle with RL really is here, we can build simulators to build at this scale. Being able to use simualtion to get data; be creative, find new and interesting problems.

M: Follow-up on process. I think a lot of you have tried to publish ML in your communities. Are they equipped to appreciate work properly; what is a common reason they don't appreciate.

JG: Publishing ML in systems, or vice versa, is hard. It goes both ways. These communities are not equipped to evaluate work in other field. ML in systems, where if you saw here, it was surprising. Or vice versa, wouldn't have done well in systems venue as systems. The failure mode I see, is systems community doesn't appreciate extreme complexity. In ML, I have this very sophisticated thing, and reducing them to their essential components. ML tries to overextend their complexity as an innovation. MOre broadly, each of these communities has their own biases how they look at research. One thing I've noticed, it's gotten better. Systems is better at evaluating, and at this workshop, people are pushing research in an advanced way.

GG: I'm old, so I've seen creation of conference before. So, you start off with an overlap of areas. In my prior life, it was the notion of storage as a research area, rather than app of devices. You start off, send submission in. The PC has two people that know anything about it, and they aren't assigned, and the reviews are sloppy, and you get one conference that do a little better, but other conferences don't read it. I faced this with fault tolerance, database, OS communities, they don't read each other's stuff. You get enough mass, get a conference that focuses in the middle; reviewing and PC that have seen most of the good work in the area. That's hard, but we're on the edge of doing it in SysML. We're doing the right thing to do competitive, on top of state of the art.

M: Is that the only solution, or can we mix up PCs?

GG: I've seen a lot of experiments to try it. You can end up with permanently fractured communities.

JL: Joey and Dawn are an area chair at ICML. I have found the ML community to be friendly to system type things. There's an area chair systems. Hopefully papers get assigned appropriately.

M: We're not good about that at systems.

DS: About ML and security, we have this problem. In security, we also have very small percentage of ML, and the committee, if you submit ML, it's very hard to find people who can review the paper, and as a consequence, the review quality varies highly. Similar in terms of security in ML, similar problems. It's interesting to think about why this happens and how to solve the problem. In general, sometimes the most interesting work is the interdisciplinary areas. ML and systems, security, and examples I see, including machine learning in systems... so, one thing I actually can understand is, within each community, even though the review quality varies, I can see from committee's perspective, really what they want is papers that are more meaningful to community, help people get exposed to this new area, fostering new exploration. That's part of natural progression. As time goes on, there's more cross pollonization.

JG: We are launching a SysML conference. I had a little bit of reservations: ML is getting better at systems, but now I have to decide where I'm going to send a paper. A lot of papers we see in ML is going to have systems.

GG: When you have a new conference area, not all work is sent there. Overlapping, you have a favorite conference, your heros, and you'll send your most exciting work to that root conference. No problem.

YJ: SysML is great, and this is how it comes out. New fields, it warrants new conferences.

M: Do you think ML expert needs to also be a systems expert? Does such a person who lies at that intersection have a different way of looking? Or you come up with a nice algorithm, and you

JL: It's not OK to have a wall.

There's many way learning algorithms can be changed. The problem with having a wall, if you don't understand, throw engineer. But if you can bridge to understand, they're not artifacts, you can break open and modify. That can let you achieve much better solutions.

GG: AGreed, but what happens initially is you reach over to other side, you put it into system, and it's my innovation that redundancy makes fault tolerance, even though it's fairly pedestrian from the other side. If it is a substantial improvement, it is worth doing. We all grow up.

JG: We need a wall, but we're going to constantly tear it down. Matlab in grad school, we made jokes about it, and MKL community would make it fast. Then they said we are going to build ML for distributed computing algorithms, and ML would write class algorithms for system. That waned in the dev of pytorch, TF, etc., which leveled up abstraction. The stack is building up again; systems community to make more efficient. Well, fp could change, and that could affect algorithm. So we're tearing it down again. But systems is about designing the wall.

YJ: It's more like a bar stool. It's a barrier, but we don't have to be both to do anything, but you need it to make it efficient. A story: a training system we looked at, SGD. That person found a very nicely rounded number: 100. But people frown, you should round to 128. Understanding and improving the common core for CS and engineering, that helps a lot for people to have good sense for how to design ML algorithms.

M: There's a lot of talk about democratizing AI, and all of you have helped that process. What is a truly democratic AI landscape look like, and how far are we from that world.

YJ: I plead guilty in participating in framework wars. When reading CS history, one thing that's pretty natural, when field is strating, there's all sorts of standards, protocols. FTP, Gopher, and now in the end HTTP took over, and everything runs on HTTP. Right now, there's all kinds of different abstractions; boiling it down, everyone is doing computation graph, optimization. I look forward to when we have one really nice graph representation, protocol for optimizing graphs. It's not a rosy dream, because in compilers we have that solution, LLVM. I don't know if we'll reach that state but I think one day we'll get there.

JL: You have AI/ML democratized when anyone can use it. What does that mean, a programmer has a library, or language constructs, which that they use routinely and easily; no issues of data getting mismatched or confused or biased. All the bugs people worry about in data science; those are removed from the system because the system is designed right and easy to use. The level beyond that is when somebody is using a system, that system is learning to adapt to you. There's huge room for improvement in how people interact. I don't know how often there's a rewrite rule driving me crazy; why can't it rewrite the way I want. People can signal info to a learning algorithm, and when those can be used effectively tpo assist people, you have democratized AI.

DS: I have a very different view of democratizing AI. I think it's interesting to think about what democratization here really means. For systems people, it's about making it easier for people to do learning, to use these libraries, platforms. But that's really just providing them with tools. For me, I give talks on demccratizing AI, we are looking at it from a completely different perspective. Code: even, whoever controls AI will control the world. So who controls AI? Even if you give everyone the tools, push a button, but they don't have the data to do the training. So who controls the AI today, and tomorrow? It's Facebook, Microsoft, Google... so for me, democratization means something totally different. Today, they collect data, train models, and they control who has action to model, and users can get recommendations, but not direct access to models. We have a project to actually democratize AI, where users can control their data. Combining blockchain and AI, where users can donate their data to a smart contract, where the smart contract will specify the terms; e.g., if you train a model, the user can use the model, and if the model produces profits, the user can get part of the profits. The smart contract can specify various incentive terms; e.g., if the data is vbetter than others, they can get more profits, and other mechanisms. A developer will supply the ML training algorithm, and get benefits when it is trained well. We are decentralizing th epower of AI; users will be able to get direct access to models and use them. In this case, I hope for an alternate future, where big companies can continue with business, but users by pooling their data in a decentralized fashion, will see actual true democratization of AI; they will access the power of AI. Not just use tools.


GG: I think that a lot of what's meant in democratizing AI is how can you move from a small number of people innovating, to a large number. Tool development and standards. We're close to being there. There was an example in the past, was VSLI paint boxes. Up until a certain point, only an EE could really develop hardware at all. They took a lot of effort and time to make sure it could make it through very part without very much crosstalk. a group came together and thought, well, there are some design rules. This lets you build hardware pretty easily. I could paint green/red boxes, hardware months later, worked. It never worked as fast as that EE guy, so there would always be a place for it, but it would let us build a RISC computer, and ship it. We were in the game, we could innvoate, and do it. The tools we're trying to build right now can build on statistical.

JG: When I started PhD, we did integrals and derivatives by hand. Automatic differentiation was a huge step forward. I blame that for the explosion of papers. A first year can build something far more complex than what I could do. That's moving AI forward, on algorithms side.

The data side is interesting, and that is one where I think about in systems. There's a lot of opportunities to think about how security interacts, leveraging hardware to protect it, markets to sell/buy data from sources, and protect the data across a lot of places. I would argue we're making a substantial amount of progress in how we think about algorithms.

M: When I think about democratizing pervasive AI, recent questions that have been consuming our minds, interpretability, fairness, etc. Can you share... any experience where things like interpretability came up and became a problem, issue, do we have to worry about a lot more in ML, or systems-ML.

JG: My grad students come to me and say the models stop working. I don't know how to fix that; the process is very experimental. Tracking experiments is a big part of the process. We cared a lot about interpretable models, and that meant something very particular. Now it's explainable; we don't need to know what it did exactly, but there needs tob e some connection to what we did. Interpretable, explain computation, it could be related or unrelated to the decision. That's two answers about explainability, and how we debug these systems.

GG: SOSP just happened, and they have ten years of... good copies of everything they submitted. At the end of the conference, Peter Chen took all the PDF files, and did a naive bayes classifier, and saw how well he would predict that it would be accepted. And half the things it predicted to be accepted, would be accepted.

So what did they do? They made ad etector for popular authors. And so what you did is those who had succeeded, they will follow behind. I recognize this problem. You might think that you found a good way, but it's actually Nicolai Zeldovich's paper.

DS: There's a big debate. Some think it's really important, and sometimes, as long as the model works, it's fine. Our brain, we can't really explain how we arrive at certain decisions, but it works fine. And it depends on application. Some applications have stronger requirements for explainability; e.g., law and healthcare, whereas in others it's less required. Also as a whole community, there's a lot we don't understand. We can dtalk about causality, transparenty, all related. As a whole community, we don't really understand what explainability means. Not a good definition. All these concepts are related, we're trying to figure out what's the real core. That's a really good open question.

JL: There's two different interpretations. Can you explain to a person? And that's limited; there's no explainable vision models. The other definition is debuggability. If you want to create complex systems, they need to be debuggable. This is nontrivial with a distributed system, it's nomntriival with ML. If you want to create nontrivial ML systems, yo uhave to figure out why they're not behaving the way you want it to.

DS: Do we debug our brains?

JL: Evolution has done this the hard way for a very long way... a lot of people have bugs in their brains. I know I have bugs. I get an ocular migraine sometimes... very annoying. No, we don't debug our brains, and it's a problem.

YJ: I'm suire there's bugs in my brains; I chased chickens in my grandma's house; the chicken has one spot in its back that if you press it, it just ducks and sits there. It shuts off because of fear. WE humans don't do that. But these bugs, are in our brain as well. Chasing for interpretability helps understand how things work. The old days, deep dream; this line of work started with figuring out what the gradients do, and we propagated back, and we found that direct gradient doesn't work; then we added L1 priors, and then we got pictures. This curiosity has lead to the fact that convnets with random weights are codifying the local correlation; we are hardcoding the structured info in CNNs which we didn't know before. So maybe we will not achieve full interpretability, but some amount of interpretability and creativity will help.

(audience questions)

A: I'd really like to hear what Jeff said about ML for systems. As systems, I'm interested in it, but people have said, you can get far with heuristics.

JL: I think it's exciting.

GG: The index databases, when I read it for reviewing, I went, "Wow! Is that possible?" I think things like that will change the way we do systems. The novelty of the application opens a lot of people's minds. Right now we think of the machine learning tools as being expensive things that repeat what humans do easily that computers don't do well. But that's not what DB index is. We can execute it, but we're not better. But to get it half the size and twice the speed, throwing in another way of thinking about compression through a predictor is a fabulous insight.

JG: I tried to publish in this area for a while. For a while, systems didn't like the idea of complex algorithms in the middle of their system. Now, these days, Systems is like, "ML is cool." But where it's easier to have success, you prediction improves the system, but a bad prediction doesn't break the system. So scheduling, that's good. Where models can boost performance but not hurt. The work in ML to solve systems is successful.

DS: ML for systems is super exciting. I'm personally very excited about this domain, esp. for people who have done systems work, and are interested in AI. ML for systems is an amazing domain of ML. I wouldn't be surprised, I would hope to see, in five years, our systems are more ML driven. A lot of systems have a lot of knobs to tune, trial and error setting, where exactly ML can help. On these amazing techniques, RL, bandits, instead of using bandits to serve ads, we can try to autotune systems. Just like we are seeing AI transforming a lot of application domains, and a lot more intelligent system, old systems, the one we built, should be more intelligent. It's a prediction: It hink we are going to see a lot of work in this domain. I think it will transform systems.

M: I work in this quite a bit. We have some successes with bandits in some settings, but there are settings that are really tough: stateful, choices, decisions influence the future, it makes it hard to apply RL, or the RL techniques take a lot of data. There are challenges, but there are successes. There are a lot of papers that apply RL in caching, resource allocation. The real question is why it's not used in production? I don't know if we have an answer to that, papers do it, it seems to be really good, but it's not that mainstream, esp. having RL all over the place. Why isn't it pervasive. That I don't see.

A: Isn't it because it's not verifiable. You want some kind of verification analysis.

GG: It's called a regression sweep. If you deploy on a lot of systems. There's a lot of money, it has to work. If it falls over, that's a lawsuit. I hired a VP of software. OK, now that I'm in charge, things are going to slow down. Every LoC is bugs, if I want low bug, I stop programmers from writing code, by making the bar very high. This is the thing JOy was talking about; they need a really compelling reason with no downsides, and then they have to pass tests before the pass. So anything stochastic has a high bar.

SB: Another thing that is happening, there aren't that many people who have understanding in both areas. It's really hard to do ML in systems without deep expertise in systems. You really need to understand to explain it.

GG: It wasn't that long since we didn't have hosted services.

M: Guardrails, you constrain the ML system to not suggest something bad. We have a scenario in MS, machines are unresponsive. How long to wait? You can do it in ML. The choices are reasonable, they're never more than the max you'd want to wait.

A: On democratization. There's been a lot of talk about optimizing the models so they can bear the cost. Another is decentralizing data... but there's two very big constraints for systems and models. They cost a lot of money, and there's big variance. Because of cost, if some guy gets into programming, and does research, he won't have resources to do it. So they won't go into engineering; they'll intern at Amazon instead. So if there is some community going into lowering the barrier, demoratizing, what solution is there to get people much more easily? Because there's huge economic costs. People are trying to make huge amounts of money, startups, but there's no... systems have faults with decentralization... there's just a big problem colliding and ML.

JG: We teach data, I teach data science at Berkeley. The summary is, what about the costs of getting into DL? There's cost to train models, GPUs, data, how do I get a freshman in college who is excited about this, chromebook, they can do research and explore opportunities. At Berkeley we have exactly this problem. I teach 200 students, a lot of them are freshmen, chromebook ipad as primary computer. We've built tools using Azure... we run a cloud in Azure, and on these devices they can experiment with models. They get to use pretrained models and appreciate how to ... Someone built a Russian Twitterbot detector, and saw value and opportunity in those. And then they got involved in research projects where they had more funds and tools.

JL: The right interfaces make a huge difference, because they prevent you from having bugs that prevent you from doing things. Also, DL, is all the rage, but framing the problem is more important than the representation you do. If you have the right problem, and a dumb representation, you'll still do something interesting. otherwise, it's just not going to work very well at all.

YJ: As industry, don't be afraid of industry and try it out. Back at Berkeley, when Berkeley AI was using GPUs, the requirement was that you have one project per GPU. We students, framed ten different projects, and we just asked for ten GPUs. NVIDIA came to us and asked, what are you donig. We'll just give you 40 GPUs and do research on that. Nowadays, FAIR has residency, and Google AI has residency, all of these things are creating very nice collaborations between industry and academia, and I want to encourage people to try it out. Industry has funds, academia has talent, marrying those together is an everlasting theme.

A: Going back to where do we go forward in terms of conferences, the future of this workshop; has any decision been made, where we go?

SB: This is work in progress. We're interested in feedback and what you think. We've had this workshop evolving for 10 yrs, with NIPS and iCML. Then we did one with SOSP, excciting. We are now doing a separate conference at Stanford in February. We think there's really an important role to play with workshops colocated with NIPS and ICML. We're still planning to conitnue this series of workshops. There's also a growing amount of systems work in ICML and NIPS, natural expansion to accept that work. The field is growing, and we're going to try several venues, and form a community. If people have ideas.

JG: More people should get involved.

M: We plan to continue this; audience is great, participation is great.

It's a panel, so I have to ask you to predict the future. Tell me something you're really excited... 50-100yrs from now. If you're alive then, I will find you and see if your prediction panned out. Or say what you hope will happen...

YJ: Today we write in Python. Hopefully, we'll write every ML model in one line. Classifier, get a cat.

JL: Right now, people are in a phase where they're getting more and more knobs in learning. ML is all about having less knobs. I believe the ML vision of less knobs. I also believe in democratizing AI. You are constantly turning ... around you, and devs can incorporate learning algorithms into systems. It will be part of tech. It's part of hype cycle. NIPS went through a phase transition. At some point it's gotta go down. When it becomes routine, we're democratizing things.

DS: It's hard to give predictions... I guess, right now, we see ML as an example, we see the waves. Not so long ago, there was the wave of NNs, graphical models, now we're back to NNs. I think... I hope that we... there's a plateauing. Even this year, I have been talking to a lot of great ML researchers, even though one can say there has been more papers written this year, when you hear what people talk about in terms of milestones, many people mentioned milestones from past years. AlexNet, ResNet, ... I do hope that we will see new innovation beyond deep learning. I do teach a DL class, but I hope that we see something beyond DL that can bring us... we need something more, to bring us to the next level.

GG: I'm tempted to point out DL is five years ago, and dotcom era was not more than five years... I think, I'm looking forward to a change in the way CS, science in general, does business, having learned from statistical AI. My favorite one is overfitting. I poorly understood overfitting, in vague stories, until ML hammered what this said. I look forward to the time when students tell me, they stopped writing code, because they were adding parameters... and they added a decent random, iid process for testing code. We're no where near there, but I think it's coming.

JG: I'm looking forward to the return of graphical models... actually not. When we're democratizing AI, but what ultimately happens, we're democratizing technology. I can walk up to Alexa and teach it. Or I can teach my Tesla how to park more appropriately. Tech that can adapt to us because it can learn; when I can explain to a computer what I want. (Star Trek but without a transporter.)

  • December 8, 2017

Accelerating Persistent Neural Networks at Datacenter Scale (Daniel Lo)

The below is a transcript of a talk by Daniel Lo on BrainWave, at the ML Systems Workshop at NIPS'17.

Deploy and serve accelerated DNNs at cloud scale. As we've seen, DNNs have enabled amazing applications. Architectures achieve SoTA on computer vision, language translation and speech recognition. But this is challenging to serve in large-scale interactive because there are latency, cost and power constraints. Also, DNNs are growing larger in size and complexity.

We've seen a Cambrian explosion in startups to solve this problem. Research groups have produced DNN processing units, DPUs, custom hardware solutions to prove high throughput efficient serving of DNNs. We categorize them into two categories: fast DPUs, where the algorithms and applications have to be fixed in at design time, because they're fabbing an ASIC, or a soft DPU, FPGA. But for soft DPUs, we haven't seen them deployed at scale.

To address this, we've been working on Project BrainWave. Solution to deploy large scale DNNs with FPGA-acceleration. We've designed it to be fast, flexible and friendly. High throughput, low latency acceleration using FPGAs. Flexibility with adaptive numerical precision, update to latest AI algorithms with reconfigurable FPGAs. And it's user friendly, because we have a full stack solution, compile CNTK/Caffe/TF and compile them down. This is deployed on our configurable cloud, an outer layer of CPUs, a data center that puts everything together, and a layer of reconfigurable FPGAs.

We've been deployed DNN models. LSTM model that takes tens to hundreds of milliseconds CPU. What we see is the 99th percentile for latency; even at 99 we are able to achieve sub-millisecond latencies. When you get to these levels of acceleration, it's negligible in the E2E pipeline.

Next I'll dive into details. It's a full stack solution. starting with a compiler and runtime that takes model sin high level frameworks and compiles them down to our architecture. A flexible ISA for serving DNNs. We have a throughput, low latency serving. We do this all with persistency at scale, to keep models pinned in FPGA memories. Deployed on our wide deployment of Intel FPGAs using hardware microservices.

To begin with, let's talk about hardware microservices. This is something we presented at Micro. The architecture of reconfigurable cloud is FPGAs sit between CPU and network. CPU can use FPGA locally for acceleration, but because FPGAs are connected over network, they can distribute between them. We have a proprietary network protocol for low latency compute.

We'vec disaggregated FPGA compute plane from CPU. So we can aggregate FPGAs together to form larger accelerators, and you don't have to match the rate of FPGAs to CPUs. You can serve a large number of CPUs with a small cluster of FPGAs, or vice versa.

Next I'll talk about the compiler and runtime. Goal is to make it very easy for ML specialists to do this. The typical ML specialist doesn't know how to program this. Models developed in high level frameworks, compile them down to our architecture. If you compile them down first into an intermediate graph based representation. We split them into portions split on FPGAs, and portions on CPU. When we execute, we also have runtime that handles orchestration and scheduling that handles it between parts.

There are two main categories of DNNs we have to optimize for. DNNs that have very high compute to data ratio, convnets, these are well studied. I'm going to focus on the other class of DNNs, those with less compute to data ratio, e.g. dense layers and RNNs.

The conventional approach to accelerating DNNs on FPGAs, you keep all model parameters in DRAM. When a request comes in, you're going to stream the model parameters of DRAM, and return a request. The issue with this is when you have DNN layers that are memory bandwidth bound, you're limited in how fast you can run this by memory bandwidth; you're not getting full compute capabilities of FPGA. Typically the way to solve this is with batching; you send a number of requests and use the model parameters for all requests. WHile you may achieve good throughput, latency will increase. For realtime services, this violates your SLA. What we want to do is provide high performance at low or no batching.

The way we do this is with persisted Dnets. FPGAs have lots of memory on chip: 10MB memory. Since they're on chip, it's high bandwidth. So we're going to keep the model parameters on the chip, so that when we get one request in, we distribute it across the entire FPGA chip.

The obvious question is, what happens if your model doesn't fit on chip? We take advantage of the hardware microcenter. We'll distribute a single model over multiple FPGAs in the datacenter.

Let's look at the architecture and microarchitecture of the processing unit we developed. The BrainWave DPU is a software programmable processor, programmed in single-threaded C, but we've added a number of instructions for serving DNNs, e.g., matrix multiply, convolution, nonlinear activations, embeddings. The processor is designed to use narrow precision format (float16) and easily flexible for extending to newer algorithms.

The microarchitecture of the processor, main portion is dedicated to matrix vector unit; matrix vector multiply, consisting of a number kernels on a tile of a larger matrix. Tiling gives us flexibility while maintaining performance. Other compute units are multifunction units; vector-vector operations, such as element-wise multiply, add and activation functions. Tying it all together is an on-chip network that lets us keep all the compute together at time.

Most of the chip is dedicated to matrix vector unit. It's composed of hundreds of multilane dot product units. Each of these dot product units is consists of tens of adds and muls. To keep them fed with data, each dot product unit is fed by a set of dedicated block rams.

Next, I'd like to show performance results for this architecture. Two years ago, we had a deployment of Stratix V FPGAs. It shows the effective teraflops of this format. 16 bit integer.. we've been playing with our own format Microsoft Floating Point. 4.5Tflops at MSFP5.8. These Stratix are pretty old.

(Demo for latest generation of FPGAs)

Looking at throughput oriented DPU, the latency is 65.81ms. With brainwave, latency is 0.98ms. Under 1 millisecond.

This was done on initial engineering silicon. For production silicon, we're expecting to get 12TOps at 16-bit integer. 90TOps for MSFP8. One question is how does numeric output affects output. Here is the normalized accuracy for three in-house text models, using GRU and LSTM. The orange bar shows what happens when you go to MSFP9, but we've developed a way to fine tune networks for this precision, and you see we recover our accuracy. We're working with MSFP8 and see similar results.

Project BrainWave is our project for accelerating DNNs at cloud scale. We hope it will be fast, friendly and cloud-scale, and expand capabilities of AI in the cloud, providing a way to run higher dimensional RNN networks for NLP and other great applications. We're planning to release to third parties, stay tuned.

Q: When you decrease batch size, what hardware are you evaluating? Hardware utilization as we decrease?

A: We stay highly utilized even as we decrease batch size; even at high batch size, we're still sending requests one by one. (Only one step will be processed?) Right.

Q: Regarding the FP9 and FP8, nine and eight being the number of bits used? (Yes) Is it in any way related to Flexpoint at Intel?

A: We developed this independently of flexpoint, and I'm not able to talk about our numeric format.

Q: In MS, do you really write Verilog for your FPGA, or do you use high level synthesis tool?

A: For this, we are writing System Verilog

Q: Batchnorm layers, which require batch computation; how do you put that onto the FPGA?

A: Part of the work of the compiler is to do splitting between CPU and FPGA. So things that are not amenable to FPGA, including batchnorm, we're still running them on CPU.

  • December 8, 2017

MOCHA: Federated Multi-Tasks Learning (Virginia Smith)

The below is a transcript of a talk by Virginia Smith on MOCHA, at the ML Systems Workshop at NIPS'17.

The motivation for this work comes from the way we think about solving ML problems in practice is changing. The typical ML workflow looks like this. You start iwth dataset and problem to solve. Say you want to build a classifier to identify high quality news articles. Next step is to select an ML model to solve the problem. Under the hood, to fit the model to your data, you have to select an optimization algorithm. The goal is to find an optimal model that minimizes some function over your data.

In practice, there's a very important part of the workflow that is missing. For new datasets, interesting and systems, the system and properties of system, play a large role in the optimization algorithm we select to fix. To give an example, in the past several years, data that is so large that must be distributed over multiple machines, in a datacenter environment. I've been thinking about how to perform fast distributed optimization in this setting, when data is so large.

But more and more frequently, data is not coming nicely packaged in datacenter. It's coming from mobile phones, devices, distributed across country and globe. Training ML in this setting is challenging. For one, whereas in datacenter you have hundreds to thousands, here you have millions and billions. Also, in datacenter, devices are similar capability; here, you have phones that are old, low battery, not connected to wifi. This can change ability to perform computation at any given iteration.

Additionally, there's heterogeneity in data itself. For privacy and computation reasons, data can become very unbalanced in network. And it can be non-IID, so much so that there can be interesting underlying structure to the data at hand. I'm excited because these challenges break down into both systems and statistical challenges. The one second summary of this work, thinking about both systems and statistical in this federated setting; the punchline is that systems setting plays a role not only in optimization algorithm but also the model we select to fit. IT plays a more important role in this overall workflow.

I'm going to go through how we holistically tackle systems and statistical challenges.

Starting with statistical. The goal is we have a bunch of devices generating data, could be unbalanced; some devices have more data than others. One approach used in past is fit a single model across all of this data. All of the data can be aggregated; you find one model that best achieves accuracy across all of the data simultaneously. The other extreme is you find a model for each of the data devices, and not share information. From systems point of view this is great, but statistically, you might have devices that are only ... that are poor in practice. What we're proposing is something between these two extremes. We want to find local models for each device, but share information in a structured way. This can be captured in a framework called multitask learning.

The goal is to fit a separate loss function for each device. These models can be aggregated in this matrix W, and the function of the regularizer, is to force some structure omega on it. This omega is a task relationship matrix, capturing interesting relationships, e.g., all the tasks are related and you want to learn weights, or most of the tasks are related and there are a few outliers, or there are clusters and groups, or there are more sophisticated relationships like asymmetric relationships. These can all be captured in multitask.

We developed a benchmarking set of real federated data. This includes trying to predict human activity from mobile phone, predict if eating or drinking, land mine, and vehicle sensor; distributed sensor to determine if a vehicle is passing by.

For these various datasets, we compared global, local and MTL. The goal is to fit a SVD model. For each data set, we looked at the average error across tasks, where each model is a task. What you can see is average error, for SVD, is significantly lower than global and local approaches. This makes sense because MTL is much more expressive; it lets you go between these extremes. What's interesting is that in these real data sets, it really helps. Reduction by half. This is a significant improvement in practice.

Given that we like to be using multitask learning to model data in federated environment, the next problem is figure out how to train this in distributed setting, thinking about massive distributed. In particular, the goal is to solve the following optimization objective. In looking how to solve this objective, we note that it's often common to solve for W and omega in an alternating fashion. When you solve for omega, it's centrally, you just need access to models. But W must be distributed because data is solved across devices. The key component how to solve this in practice is the W update. The challenge of doing this is communication is extremely expensive. And because of heterogeneity, you may have massive problems with stragglers and fault tolerance; e.g., someone who turns their phone off.

The high level idea for how we're doing this, take a communication efficient method that works well in data center, and modify it to work in federated setting. It will handle MTL as well as stragglers and fault tolerance.

What is the method we're using? The method we're using is COCOA, which is a state of the art method for empirical risk minimization problems. The thing that's nice about COCOa is it spans prior work of mini-batch and one-shot communication, by making communication a first class parameter of the method. Make it flexible as possible. It does it by not solving the primal formulation, but the dual. The dual is nice because we can easily approximate it by forming a quadratic approximation to the objective; and this more easily decomposes across machines.

To distribute this to federate setting, a key challenge is figuring out how to generalize it to the MTL framework. A second challenge; in COCOA, the subproblems are assumed to be solved to some accuracy theta. This is nice because theta varies from 0 to 1, where 0 is exact solve, and 1 is inexact. This can be thought of as how much time you do local communication versus communication. However, in fact, this is not as flexible as it should be in the federated setting. There is only one theta that is set for all iterations, a ll nodes. And because theta cannot be set exactly to one, it cannot handle fault tolerance, where there's no work performed at any iteration. Making this communication parameter much more flexible in practice.

JHow are we doing this? we developed MOCHA. The goal is to solve multitask learning framework; W and Omega in an alternating fashion. In particular, we're able to form the following dual formulation, similar to COCOA, so it decomposes. In comparison, we make this much more flexible assumption on subproblem parameter. This is important because of stragglers: statistical reasons, unbalance, different distributions, it can be very different in how difficult it is to solve subproblems. Additionally, there can be stragglers due to systems issues. And issues of fault tolerance. So this looks like a simple fix: we make this accuracy parameter more flexible: allow it to vary by node and iteration t, and let it be exactly 1. The hard thing is showing it converges to optimal solution.

Following this new assumption, and you can't have a device go down every single round, we show the following convergence guarantee. For L-Lipschitz loss, we get a convergence at 1/epsilon; for smooth models (logistic regression) we get a linear rate.

How does this perform in practice? The method is quite simple. The assumption is we have data stored at m different devices. We alternate between solving Omega, and W stored on each. While we're solving w update, it works by defining these local subproblems for machines, and calling solver that does approximate solution. This is flexible because it can vary by node and iteration.

In terms of comparing this to other methods, what we've seen is the following. Comparing MOCHA to CoCoA, compared to Mb-SDCA and Mb-SGD. We had simulation, with real data to see what would happen if we do it on wifi. We have simulated time and how close are to optimal. What you can see is that MoCHA is converging much more quickly to optimal solution, because MoCHA doesn't have the problem of statistical heterogeneity, and it's not bogged down by stragglers. This is true for all of the different types of networks; LET and 3G. The blue line and MOCHA and CoCOA, they work well in high communication settings, because they are more flexible. But compared to CoCOA, MOCHA is much more robust to statistical heterogeneity.

What's interesting is that if we impose some systems heterogeneity, some devices are slower than others, we looked at imposing low and high systems heterogeneity, MOCHA with this additional heterogeneity, it's a two orders of magnitude speedup to reach optimal solution.

And for MOCHA in particular, we looked at issue of fault tolerance. What we're showing here, we're increasing the probability a device will drop out at any distribution. Going up until there's half devices, we're still fairly robust to MOCHA converging, in almost the same amount of time. But what we see with green dotted line, of the same device drops out every iteration, it doesn't converge. This shows the assumption we made makes sense in practice.

The punchline is that in terms of thinking this new setting, training ML on these massive networks of devices, this is both a statistical and systems issue. We've addressed it in a holistic matter. Code at http://cs.berkeley.edu/~vsmith I also want to reiterate about SysML conference in February.

Q: When you compare global and local? Why is it always better than global?

A: The motivation why you want to use local model over global model, is that if you have a local data a lot, you might perform better. It boosts the overall sample size. I have some additional experiments where we took the original data, and skewed it even further than it already was. We took the local data, and there was less data locally, and they have global approaches. That's just a function of the data in the devices.

Q: I really like how your method has guarantees, but I'm wondering about an approach where you create a metalearning algorithm locally and have it work locally?

A: That's worth looking into empirically, since you can do fine tuning locally. What we were trying to do first was converge to exact optimal solution, but you might want to just work empirically well, would be good to compare to this setting.

  • December 8, 2017