Existential Pontification and Generalized Abstract Digressions

## Systems ML workshop panel

• JG: Joseph Gonzalez
• GG: Garth Gibson (CMU)
• DS: Dawn Song (UC Berkeley)
• JL: John Langford (Microsoft NY)j
• SB: Sarah Bird
• M: Moderator
• A: Audience

M: This workshop is bringing together ML and systems. Can you put your place on that spectrum? Who is your home community?

YJ: Right in the middle. I'd like to move more towards systems side, but Berkeley Parallel Labs kicked me out. ML is my home base.

JL: ML is where I come from, and where I will be, but I'm interested in systems. My home is NIPS and ICML

DS: My area is AI and security, did computer security in the past, now moving into AI.

GG: Systems.

JG: I started out in ML, working on probabilistic methods. I basically, in middle of PhD, looked at systems. Now I'm moving to being a systems person that does ML.

M: We've seen a proliferation of deep learning / ML frameworks that require a lot of dev effort, money, time to put in. Q, what is the role of academia of doing research in this area. What kind of large scale ML learning can you do.

GG: I liked YJ's answer last time.

YJ: The thing that is astonishing is that academia is the source of so many innovations. With all due respect, we did very good work in Google, but then Alex came out with 2 GPUs and nuked the field. Academia is the amazing place where we find all of the new ideas, and industry scale it out.

JL: Some examples. If you're coming from academia, maybe you don't have research at big company, but it's an advantage as you will spend time about the right algorithm for solving it efficiently. And that's what will win in the long run. Short term, they'll brute force with AutoML. Long run, the learning algorithms are going to be designed where tjhey won't have parameters. A common ML paper is "we eliminate this hyperparameter". When they're more automatic, more efficient, great things will happen. There's an advantage in being resource constrained, as you will solve things in the right way.

Another example is, the study of machine learning tells us that in thefuture we will regard any model that u just learned and deploy as inherently broken adn buggy as data collection is not part of process of training, deploying. It will decay and become irrelevant. The overall paradagim of ML where you're interacting with the world, and learning, that can be studied easy in academia, and that has huge implications about how you're going to design systems,

DS: People often talk about in a startup, the best thing is to not raise a ton of money; if you're resource constrained you're more focused and creative. ML is really broad, there's lots of problems. Right now we learn from lots of data, but lots of talks at NIPS, humans have amazing ability to learn from very few example. These are problems for academia to tackle, given unique resource constraints.

GG: I'll say, it's difficult to concentrate on top accuracy if you don't have enough data, and the data available to students is stuff like DAWNbench which tends to lag. In academia, we build relationships with industry, send students for internships, they get the ability to do big data, while exploring first principles in university. IT's a challenge, but open publishing and open sharing of code world more berable.

JG: The one thing I've struggled with is focusing on human resources. I have grad students; good students, focus on a key problem can make a lot of progress. We struggle with a lot of data. Struggle with RL really is here, we can build simulators to build at this scale. Being able to use simualtion to get data; be creative, find new and interesting problems.

M: Follow-up on process. I think a lot of you have tried to publish ML in your communities. Are they equipped to appreciate work properly; what is a common reason they don't appreciate.

JG: Publishing ML in systems, or vice versa, is hard. It goes both ways. These communities are not equipped to evaluate work in other field. ML in systems, where if you saw here, it was surprising. Or vice versa, wouldn't have done well in systems venue as systems. The failure mode I see, is systems community doesn't appreciate extreme complexity. In ML, I have this very sophisticated thing, and reducing them to their essential components. ML tries to overextend their complexity as an innovation. MOre broadly, each of these communities has their own biases how they look at research. One thing I've noticed, it's gotten better. Systems is better at evaluating, and at this workshop, people are pushing research in an advanced way.

GG: I'm old, so I've seen creation of conference before. So, you start off with an overlap of areas. In my prior life, it was the notion of storage as a research area, rather than app of devices. You start off, send submission in. The PC has two people that know anything about it, and they aren't assigned, and the reviews are sloppy, and you get one conference that do a little better, but other conferences don't read it. I faced this with fault tolerance, database, OS communities, they don't read each other's stuff. You get enough mass, get a conference that focuses in the middle; reviewing and PC that have seen most of the good work in the area. That's hard, but we're on the edge of doing it in SysML. We're doing the right thing to do competitive, on top of state of the art.

M: Is that the only solution, or can we mix up PCs?

GG: I've seen a lot of experiments to try it. You can end up with permanently fractured communities.

JL: Joey and Dawn are an area chair at ICML. I have found the ML community to be friendly to system type things. There's an area chair systems. Hopefully papers get assigned appropriately.

M: We're not good about that at systems.

DS: About ML and security, we have this problem. In security, we also have very small percentage of ML, and the committee, if you submit ML, it's very hard to find people who can review the paper, and as a consequence, the review quality varies highly. Similar in terms of security in ML, similar problems. It's interesting to think about why this happens and how to solve the problem. In general, sometimes the most interesting work is the interdisciplinary areas. ML and systems, security, and examples I see, including machine learning in systems... so, one thing I actually can understand is, within each community, even though the review quality varies, I can see from committee's perspective, really what they want is papers that are more meaningful to community, help people get exposed to this new area, fostering new exploration. That's part of natural progression. As time goes on, there's more cross pollonization.

JG: We are launching a SysML conference. I had a little bit of reservations: ML is getting better at systems, but now I have to decide where I'm going to send a paper. A lot of papers we see in ML is going to have systems.

GG: When you have a new conference area, not all work is sent there. Overlapping, you have a favorite conference, your heros, and you'll send your most exciting work to that root conference. No problem.

YJ: SysML is great, and this is how it comes out. New fields, it warrants new conferences.

M: Do you think ML expert needs to also be a systems expert? Does such a person who lies at that intersection have a different way of looking? Or you come up with a nice algorithm, and you

JL: It's not OK to have a wall.

There's many way learning algorithms can be changed. The problem with having a wall, if you don't understand, throw engineer. But if you can bridge to understand, they're not artifacts, you can break open and modify. That can let you achieve much better solutions.

GG: AGreed, but what happens initially is you reach over to other side, you put it into system, and it's my innovation that redundancy makes fault tolerance, even though it's fairly pedestrian from the other side. If it is a substantial improvement, it is worth doing. We all grow up.

JG: We need a wall, but we're going to constantly tear it down. Matlab in grad school, we made jokes about it, and MKL community would make it fast. Then they said we are going to build ML for distributed computing algorithms, and ML would write class algorithms for system. That waned in the dev of pytorch, TF, etc., which leveled up abstraction. The stack is building up again; systems community to make more efficient. Well, fp could change, and that could affect algorithm. So we're tearing it down again. But systems is about designing the wall.

YJ: It's more like a bar stool. It's a barrier, but we don't have to be both to do anything, but you need it to make it efficient. A story: a training system we looked at, SGD. That person found a very nicely rounded number: 100. But people frown, you should round to 128. Understanding and improving the common core for CS and engineering, that helps a lot for people to have good sense for how to design ML algorithms.

M: There's a lot of talk about democratizing AI, and all of you have helped that process. What is a truly democratic AI landscape look like, and how far are we from that world.

YJ: I plead guilty in participating in framework wars. When reading CS history, one thing that's pretty natural, when field is strating, there's all sorts of standards, protocols. FTP, Gopher, and now in the end HTTP took over, and everything runs on HTTP. Right now, there's all kinds of different abstractions; boiling it down, everyone is doing computation graph, optimization. I look forward to when we have one really nice graph representation, protocol for optimizing graphs. It's not a rosy dream, because in compilers we have that solution, LLVM. I don't know if we'll reach that state but I think one day we'll get there.

JL: You have AI/ML democratized when anyone can use it. What does that mean, a programmer has a library, or language constructs, which that they use routinely and easily; no issues of data getting mismatched or confused or biased. All the bugs people worry about in data science; those are removed from the system because the system is designed right and easy to use. The level beyond that is when somebody is using a system, that system is learning to adapt to you. There's huge room for improvement in how people interact. I don't know how often there's a rewrite rule driving me crazy; why can't it rewrite the way I want. People can signal info to a learning algorithm, and when those can be used effectively tpo assist people, you have democratized AI.

DS: I have a very different view of democratizing AI. I think it's interesting to think about what democratization here really means. For systems people, it's about making it easier for people to do learning, to use these libraries, platforms. But that's really just providing them with tools. For me, I give talks on demccratizing AI, we are looking at it from a completely different perspective. Code: even, whoever controls AI will control the world. So who controls AI? Even if you give everyone the tools, push a button, but they don't have the data to do the training. So who controls the AI today, and tomorrow? It's Facebook, Microsoft, Google... so for me, democratization means something totally different. Today, they collect data, train models, and they control who has action to model, and users can get recommendations, but not direct access to models. We have a project to actually democratize AI, where users can control their data. Combining blockchain and AI, where users can donate their data to a smart contract, where the smart contract will specify the terms; e.g., if you train a model, the user can use the model, and if the model produces profits, the user can get part of the profits. The smart contract can specify various incentive terms; e.g., if the data is vbetter than others, they can get more profits, and other mechanisms. A developer will supply the ML training algorithm, and get benefits when it is trained well. We are decentralizing th epower of AI; users will be able to get direct access to models and use them. In this case, I hope for an alternate future, where big companies can continue with business, but users by pooling their data in a decentralized fashion, will see actual true democratization of AI; they will access the power of AI. Not just use tools.

(applause)

GG: I think that a lot of what's meant in democratizing AI is how can you move from a small number of people innovating, to a large number. Tool development and standards. We're close to being there. There was an example in the past, was VSLI paint boxes. Up until a certain point, only an EE could really develop hardware at all. They took a lot of effort and time to make sure it could make it through very part without very much crosstalk. a group came together and thought, well, there are some design rules. This lets you build hardware pretty easily. I could paint green/red boxes, hardware months later, worked. It never worked as fast as that EE guy, so there would always be a place for it, but it would let us build a RISC computer, and ship it. We were in the game, we could innvoate, and do it. The tools we're trying to build right now can build on statistical.

JG: When I started PhD, we did integrals and derivatives by hand. Automatic differentiation was a huge step forward. I blame that for the explosion of papers. A first year can build something far more complex than what I could do. That's moving AI forward, on algorithms side.

The data side is interesting, and that is one where I think about in systems. There's a lot of opportunities to think about how security interacts, leveraging hardware to protect it, markets to sell/buy data from sources, and protect the data across a lot of places. I would argue we're making a substantial amount of progress in how we think about algorithms.

M: When I think about democratizing pervasive AI, recent questions that have been consuming our minds, interpretability, fairness, etc. Can you share... any experience where things like interpretability came up and became a problem, issue, do we have to worry about a lot more in ML, or systems-ML.

JG: My grad students come to me and say the models stop working. I don't know how to fix that; the process is very experimental. Tracking experiments is a big part of the process. We cared a lot about interpretable models, and that meant something very particular. Now it's explainable; we don't need to know what it did exactly, but there needs tob e some connection to what we did. Interpretable, explain computation, it could be related or unrelated to the decision. That's two answers about explainability, and how we debug these systems.

GG: SOSP just happened, and they have ten years of... good copies of everything they submitted. At the end of the conference, Peter Chen took all the PDF files, and did a naive bayes classifier, and saw how well he would predict that it would be accepted. And half the things it predicted to be accepted, would be accepted.

So what did they do? They made ad etector for popular authors. And so what you did is those who had succeeded, they will follow behind. I recognize this problem. You might think that you found a good way, but it's actually Nicolai Zeldovich's paper.

DS: There's a big debate. Some think it's really important, and sometimes, as long as the model works, it's fine. Our brain, we can't really explain how we arrive at certain decisions, but it works fine. And it depends on application. Some applications have stronger requirements for explainability; e.g., law and healthcare, whereas in others it's less required. Also as a whole community, there's a lot we don't understand. We can dtalk about causality, transparenty, all related. As a whole community, we don't really understand what explainability means. Not a good definition. All these concepts are related, we're trying to figure out what's the real core. That's a really good open question.

JL: There's two different interpretations. Can you explain to a person? And that's limited; there's no explainable vision models. The other definition is debuggability. If you want to create complex systems, they need to be debuggable. This is nontrivial with a distributed system, it's nomntriival with ML. If you want to create nontrivial ML systems, yo uhave to figure out why they're not behaving the way you want it to.

DS: Do we debug our brains?

JL: Evolution has done this the hard way for a very long way... a lot of people have bugs in their brains. I know I have bugs. I get an ocular migraine sometimes... very annoying. No, we don't debug our brains, and it's a problem.

YJ: I'm suire there's bugs in my brains; I chased chickens in my grandma's house; the chicken has one spot in its back that if you press it, it just ducks and sits there. It shuts off because of fear. WE humans don't do that. But these bugs, are in our brain as well. Chasing for interpretability helps understand how things work. The old days, deep dream; this line of work started with figuring out what the gradients do, and we propagated back, and we found that direct gradient doesn't work; then we added L1 priors, and then we got pictures. This curiosity has lead to the fact that convnets with random weights are codifying the local correlation; we are hardcoding the structured info in CNNs which we didn't know before. So maybe we will not achieve full interpretability, but some amount of interpretability and creativity will help.

(audience questions)

A: I'd really like to hear what Jeff said about ML for systems. As systems, I'm interested in it, but people have said, you can get far with heuristics.

JL: I think it's exciting.

GG: The index databases, when I read it for reviewing, I went, "Wow! Is that possible?" I think things like that will change the way we do systems. The novelty of the application opens a lot of people's minds. Right now we think of the machine learning tools as being expensive things that repeat what humans do easily that computers don't do well. But that's not what DB index is. We can execute it, but we're not better. But to get it half the size and twice the speed, throwing in another way of thinking about compression through a predictor is a fabulous insight.

JG: I tried to publish in this area for a while. For a while, systems didn't like the idea of complex algorithms in the middle of their system. Now, these days, Systems is like, "ML is cool." But where it's easier to have success, you prediction improves the system, but a bad prediction doesn't break the system. So scheduling, that's good. Where models can boost performance but not hurt. The work in ML to solve systems is successful.

DS: ML for systems is super exciting. I'm personally very excited about this domain, esp. for people who have done systems work, and are interested in AI. ML for systems is an amazing domain of ML. I wouldn't be surprised, I would hope to see, in five years, our systems are more ML driven. A lot of systems have a lot of knobs to tune, trial and error setting, where exactly ML can help. On these amazing techniques, RL, bandits, instead of using bandits to serve ads, we can try to autotune systems. Just like we are seeing AI transforming a lot of application domains, and a lot more intelligent system, old systems, the one we built, should be more intelligent. It's a prediction: It hink we are going to see a lot of work in this domain. I think it will transform systems.

M: I work in this quite a bit. We have some successes with bandits in some settings, but there are settings that are really tough: stateful, choices, decisions influence the future, it makes it hard to apply RL, or the RL techniques take a lot of data. There are challenges, but there are successes. There are a lot of papers that apply RL in caching, resource allocation. The real question is why it's not used in production? I don't know if we have an answer to that, papers do it, it seems to be really good, but it's not that mainstream, esp. having RL all over the place. Why isn't it pervasive. That I don't see.

A: Isn't it because it's not verifiable. You want some kind of verification analysis.

GG: It's called a regression sweep. If you deploy on a lot of systems. There's a lot of money, it has to work. If it falls over, that's a lawsuit. I hired a VP of software. OK, now that I'm in charge, things are going to slow down. Every LoC is bugs, if I want low bug, I stop programmers from writing code, by making the bar very high. This is the thing JOy was talking about; they need a really compelling reason with no downsides, and then they have to pass tests before the pass. So anything stochastic has a high bar.

SB: Another thing that is happening, there aren't that many people who have understanding in both areas. It's really hard to do ML in systems without deep expertise in systems. You really need to understand to explain it.

GG: It wasn't that long since we didn't have hosted services.

M: Guardrails, you constrain the ML system to not suggest something bad. We have a scenario in MS, machines are unresponsive. How long to wait? You can do it in ML. The choices are reasonable, they're never more than the max you'd want to wait.

A: On democratization. There's been a lot of talk about optimizing the models so they can bear the cost. Another is decentralizing data... but there's two very big constraints for systems and models. They cost a lot of money, and there's big variance. Because of cost, if some guy gets into programming, and does research, he won't have resources to do it. So they won't go into engineering; they'll intern at Amazon instead. So if there is some community going into lowering the barrier, demoratizing, what solution is there to get people much more easily? Because there's huge economic costs. People are trying to make huge amounts of money, startups, but there's no... systems have faults with decentralization... there's just a big problem colliding and ML.

JG: We teach data, I teach data science at Berkeley. The summary is, what about the costs of getting into DL? There's cost to train models, GPUs, data, how do I get a freshman in college who is excited about this, chromebook, they can do research and explore opportunities. At Berkeley we have exactly this problem. I teach 200 students, a lot of them are freshmen, chromebook ipad as primary computer. We've built tools using Azure... we run a cloud in Azure, and on these devices they can experiment with models. They get to use pretrained models and appreciate how to ... Someone built a Russian Twitterbot detector, and saw value and opportunity in those. And then they got involved in research projects where they had more funds and tools.

JL: The right interfaces make a huge difference, because they prevent you from having bugs that prevent you from doing things. Also, DL, is all the rage, but framing the problem is more important than the representation you do. If you have the right problem, and a dumb representation, you'll still do something interesting. otherwise, it's just not going to work very well at all.

YJ: As industry, don't be afraid of industry and try it out. Back at Berkeley, when Berkeley AI was using GPUs, the requirement was that you have one project per GPU. We students, framed ten different projects, and we just asked for ten GPUs. NVIDIA came to us and asked, what are you donig. We'll just give you 40 GPUs and do research on that. Nowadays, FAIR has residency, and Google AI has residency, all of these things are creating very nice collaborations between industry and academia, and I want to encourage people to try it out. Industry has funds, academia has talent, marrying those together is an everlasting theme.

A: Going back to where do we go forward in terms of conferences, the future of this workshop; has any decision been made, where we go?

SB: This is work in progress. We're interested in feedback and what you think. We've had this workshop evolving for 10 yrs, with NIPS and iCML. Then we did one with SOSP, excciting. We are now doing a separate conference at Stanford in February. We think there's really an important role to play with workshops colocated with NIPS and ICML. We're still planning to conitnue this series of workshops. There's also a growing amount of systems work in ICML and NIPS, natural expansion to accept that work. The field is growing, and we're going to try several venues, and form a community. If people have ideas.

JG: More people should get involved.

M: We plan to continue this; audience is great, participation is great.

It's a panel, so I have to ask you to predict the future. Tell me something you're really excited... 50-100yrs from now. If you're alive then, I will find you and see if your prediction panned out. Or say what you hope will happen...

YJ: Today we write in Python. Hopefully, we'll write every ML model in one line. Classifier, get a cat.

JL: Right now, people are in a phase where they're getting more and more knobs in learning. ML is all about having less knobs. I believe the ML vision of less knobs. I also believe in democratizing AI. You are constantly turning ... around you, and devs can incorporate learning algorithms into systems. It will be part of tech. It's part of hype cycle. NIPS went through a phase transition. At some point it's gotta go down. When it becomes routine, we're democratizing things.

DS: It's hard to give predictions... I guess, right now, we see ML as an example, we see the waves. Not so long ago, there was the wave of NNs, graphical models, now we're back to NNs. I think... I hope that we... there's a plateauing. Even this year, I have been talking to a lot of great ML researchers, even though one can say there has been more papers written this year, when you hear what people talk about in terms of milestones, many people mentioned milestones from past years. AlexNet, ResNet, ... I do hope that we will see new innovation beyond deep learning. I do teach a DL class, but I hope that we see something beyond DL that can bring us... we need something more, to bring us to the next level.

GG: I'm tempted to point out DL is five years ago, and dotcom era was not more than five years... I think, I'm looking forward to a change in the way CS, science in general, does business, having learned from statistical AI. My favorite one is overfitting. I poorly understood overfitting, in vague stories, until ML hammered what this said. I look forward to the time when students tell me, they stopped writing code, because they were adding parameters... and they added a decent random, iid process for testing code. We're no where near there, but I think it's coming.

JG: I'm looking forward to the return of graphical models... actually not. When we're democratizing AI, but what ultimately happens, we're democratizing technology. I can walk up to Alexa and teach it. Or I can teach my Tesla how to park more appropriately. Tech that can adapt to us because it can learn; when I can explain to a computer what I want. (Star Trek but without a transporter.)

• December 8, 2017

## Accelerating Persistent Neural Networks at Datacenter Scale (Daniel Lo)

The below is a transcript of a talk by Daniel Lo on BrainWave, at the ML Systems Workshop at NIPS'17.

Deploy and serve accelerated DNNs at cloud scale. As we've seen, DNNs have enabled amazing applications. Architectures achieve SoTA on computer vision, language translation and speech recognition. But this is challenging to serve in large-scale interactive because there are latency, cost and power constraints. Also, DNNs are growing larger in size and complexity.

We've seen a Cambrian explosion in startups to solve this problem. Research groups have produced DNN processing units, DPUs, custom hardware solutions to prove high throughput efficient serving of DNNs. We categorize them into two categories: fast DPUs, where the algorithms and applications have to be fixed in at design time, because they're fabbing an ASIC, or a soft DPU, FPGA. But for soft DPUs, we haven't seen them deployed at scale.

To address this, we've been working on Project BrainWave. Solution to deploy large scale DNNs with FPGA-acceleration. We've designed it to be fast, flexible and friendly. High throughput, low latency acceleration using FPGAs. Flexibility with adaptive numerical precision, update to latest AI algorithms with reconfigurable FPGAs. And it's user friendly, because we have a full stack solution, compile CNTK/Caffe/TF and compile them down. This is deployed on our configurable cloud, an outer layer of CPUs, a data center that puts everything together, and a layer of reconfigurable FPGAs.

We've been deployed DNN models. LSTM model that takes tens to hundreds of milliseconds CPU. What we see is the 99th percentile for latency; even at 99 we are able to achieve sub-millisecond latencies. When you get to these levels of acceleration, it's negligible in the E2E pipeline.

Next I'll dive into details. It's a full stack solution. starting with a compiler and runtime that takes model sin high level frameworks and compiles them down to our architecture. A flexible ISA for serving DNNs. We have a throughput, low latency serving. We do this all with persistency at scale, to keep models pinned in FPGA memories. Deployed on our wide deployment of Intel FPGAs using hardware microservices.

To begin with, let's talk about hardware microservices. This is something we presented at Micro. The architecture of reconfigurable cloud is FPGAs sit between CPU and network. CPU can use FPGA locally for acceleration, but because FPGAs are connected over network, they can distribute between them. We have a proprietary network protocol for low latency compute.

We'vec disaggregated FPGA compute plane from CPU. So we can aggregate FPGAs together to form larger accelerators, and you don't have to match the rate of FPGAs to CPUs. You can serve a large number of CPUs with a small cluster of FPGAs, or vice versa.

Next I'll talk about the compiler and runtime. Goal is to make it very easy for ML specialists to do this. The typical ML specialist doesn't know how to program this. Models developed in high level frameworks, compile them down to our architecture. If you compile them down first into an intermediate graph based representation. We split them into portions split on FPGAs, and portions on CPU. When we execute, we also have runtime that handles orchestration and scheduling that handles it between parts.

There are two main categories of DNNs we have to optimize for. DNNs that have very high compute to data ratio, convnets, these are well studied. I'm going to focus on the other class of DNNs, those with less compute to data ratio, e.g. dense layers and RNNs.

The conventional approach to accelerating DNNs on FPGAs, you keep all model parameters in DRAM. When a request comes in, you're going to stream the model parameters of DRAM, and return a request. The issue with this is when you have DNN layers that are memory bandwidth bound, you're limited in how fast you can run this by memory bandwidth; you're not getting full compute capabilities of FPGA. Typically the way to solve this is with batching; you send a number of requests and use the model parameters for all requests. WHile you may achieve good throughput, latency will increase. For realtime services, this violates your SLA. What we want to do is provide high performance at low or no batching.

The way we do this is with persisted Dnets. FPGAs have lots of memory on chip: 10MB memory. Since they're on chip, it's high bandwidth. So we're going to keep the model parameters on the chip, so that when we get one request in, we distribute it across the entire FPGA chip.

The obvious question is, what happens if your model doesn't fit on chip? We take advantage of the hardware microcenter. We'll distribute a single model over multiple FPGAs in the datacenter.

Let's look at the architecture and microarchitecture of the processing unit we developed. The BrainWave DPU is a software programmable processor, programmed in single-threaded C, but we've added a number of instructions for serving DNNs, e.g., matrix multiply, convolution, nonlinear activations, embeddings. The processor is designed to use narrow precision format (float16) and easily flexible for extending to newer algorithms.

The microarchitecture of the processor, main portion is dedicated to matrix vector unit; matrix vector multiply, consisting of a number kernels on a tile of a larger matrix. Tiling gives us flexibility while maintaining performance. Other compute units are multifunction units; vector-vector operations, such as element-wise multiply, add and activation functions. Tying it all together is an on-chip network that lets us keep all the compute together at time.

Most of the chip is dedicated to matrix vector unit. It's composed of hundreds of multilane dot product units. Each of these dot product units is consists of tens of adds and muls. To keep them fed with data, each dot product unit is fed by a set of dedicated block rams.

Next, I'd like to show performance results for this architecture. Two years ago, we had a deployment of Stratix V FPGAs. It shows the effective teraflops of this format. 16 bit integer.. we've been playing with our own format Microsoft Floating Point. 4.5Tflops at MSFP5.8. These Stratix are pretty old.

(Demo for latest generation of FPGAs)

Looking at throughput oriented DPU, the latency is 65.81ms. With brainwave, latency is 0.98ms. Under 1 millisecond.

This was done on initial engineering silicon. For production silicon, we're expecting to get 12TOps at 16-bit integer. 90TOps for MSFP8. One question is how does numeric output affects output. Here is the normalized accuracy for three in-house text models, using GRU and LSTM. The orange bar shows what happens when you go to MSFP9, but we've developed a way to fine tune networks for this precision, and you see we recover our accuracy. We're working with MSFP8 and see similar results.

Project BrainWave is our project for accelerating DNNs at cloud scale. We hope it will be fast, friendly and cloud-scale, and expand capabilities of AI in the cloud, providing a way to run higher dimensional RNN networks for NLP and other great applications. We're planning to release to third parties, stay tuned.

Q: When you decrease batch size, what hardware are you evaluating? Hardware utilization as we decrease?

A: We stay highly utilized even as we decrease batch size; even at high batch size, we're still sending requests one by one. (Only one step will be processed?) Right.

Q: Regarding the FP9 and FP8, nine and eight being the number of bits used? (Yes) Is it in any way related to Flexpoint at Intel?

A: We developed this independently of flexpoint, and I'm not able to talk about our numeric format.

Q: In MS, do you really write Verilog for your FPGA, or do you use high level synthesis tool?

A: For this, we are writing System Verilog

Q: Batchnorm layers, which require batch computation; how do you put that onto the FPGA?

A: Part of the work of the compiler is to do splitting between CPU and FPGA. So things that are not amenable to FPGA, including batchnorm, we're still running them on CPU.

• December 8, 2017

## MOCHA: Federated Multi-Tasks Learning (Virginia Smith)

The below is a transcript of a talk by Virginia Smith on MOCHA, at the ML Systems Workshop at NIPS'17.

The motivation for this work comes from the way we think about solving ML problems in practice is changing. The typical ML workflow looks like this. You start iwth dataset and problem to solve. Say you want to build a classifier to identify high quality news articles. Next step is to select an ML model to solve the problem. Under the hood, to fit the model to your data, you have to select an optimization algorithm. The goal is to find an optimal model that minimizes some function over your data.

In practice, there's a very important part of the workflow that is missing. For new datasets, interesting and systems, the system and properties of system, play a large role in the optimization algorithm we select to fix. To give an example, in the past several years, data that is so large that must be distributed over multiple machines, in a datacenter environment. I've been thinking about how to perform fast distributed optimization in this setting, when data is so large.

But more and more frequently, data is not coming nicely packaged in datacenter. It's coming from mobile phones, devices, distributed across country and globe. Training ML in this setting is challenging. For one, whereas in datacenter you have hundreds to thousands, here you have millions and billions. Also, in datacenter, devices are similar capability; here, you have phones that are old, low battery, not connected to wifi. This can change ability to perform computation at any given iteration.

Additionally, there's heterogeneity in data itself. For privacy and computation reasons, data can become very unbalanced in network. And it can be non-IID, so much so that there can be interesting underlying structure to the data at hand. I'm excited because these challenges break down into both systems and statistical challenges. The one second summary of this work, thinking about both systems and statistical in this federated setting; the punchline is that systems setting plays a role not only in optimization algorithm but also the model we select to fit. IT plays a more important role in this overall workflow.

I'm going to go through how we holistically tackle systems and statistical challenges.

Starting with statistical. The goal is we have a bunch of devices generating data, could be unbalanced; some devices have more data than others. One approach used in past is fit a single model across all of this data. All of the data can be aggregated; you find one model that best achieves accuracy across all of the data simultaneously. The other extreme is you find a model for each of the data devices, and not share information. From systems point of view this is great, but statistically, you might have devices that are only ... that are poor in practice. What we're proposing is something between these two extremes. We want to find local models for each device, but share information in a structured way. This can be captured in a framework called multitask learning.

The goal is to fit a separate loss function for each device. These models can be aggregated in this matrix W, and the function of the regularizer, is to force some structure omega on it. This omega is a task relationship matrix, capturing interesting relationships, e.g., all the tasks are related and you want to learn weights, or most of the tasks are related and there are a few outliers, or there are clusters and groups, or there are more sophisticated relationships like asymmetric relationships. These can all be captured in multitask.

We developed a benchmarking set of real federated data. This includes trying to predict human activity from mobile phone, predict if eating or drinking, land mine, and vehicle sensor; distributed sensor to determine if a vehicle is passing by.

For these various datasets, we compared global, local and MTL. The goal is to fit a SVD model. For each data set, we looked at the average error across tasks, where each model is a task. What you can see is average error, for SVD, is significantly lower than global and local approaches. This makes sense because MTL is much more expressive; it lets you go between these extremes. What's interesting is that in these real data sets, it really helps. Reduction by half. This is a significant improvement in practice.

Given that we like to be using multitask learning to model data in federated environment, the next problem is figure out how to train this in distributed setting, thinking about massive distributed. In particular, the goal is to solve the following optimization objective. In looking how to solve this objective, we note that it's often common to solve for W and omega in an alternating fashion. When you solve for omega, it's centrally, you just need access to models. But W must be distributed because data is solved across devices. The key component how to solve this in practice is the W update. The challenge of doing this is communication is extremely expensive. And because of heterogeneity, you may have massive problems with stragglers and fault tolerance; e.g., someone who turns their phone off.

The high level idea for how we're doing this, take a communication efficient method that works well in data center, and modify it to work in federated setting. It will handle MTL as well as stragglers and fault tolerance.

What is the method we're using? The method we're using is COCOA, which is a state of the art method for empirical risk minimization problems. The thing that's nice about COCOa is it spans prior work of mini-batch and one-shot communication, by making communication a first class parameter of the method. Make it flexible as possible. It does it by not solving the primal formulation, but the dual. The dual is nice because we can easily approximate it by forming a quadratic approximation to the objective; and this more easily decomposes across machines.

To distribute this to federate setting, a key challenge is figuring out how to generalize it to the MTL framework. A second challenge; in COCOA, the subproblems are assumed to be solved to some accuracy theta. This is nice because theta varies from 0 to 1, where 0 is exact solve, and 1 is inexact. This can be thought of as how much time you do local communication versus communication. However, in fact, this is not as flexible as it should be in the federated setting. There is only one theta that is set for all iterations, a ll nodes. And because theta cannot be set exactly to one, it cannot handle fault tolerance, where there's no work performed at any iteration. Making this communication parameter much more flexible in practice.

JHow are we doing this? we developed MOCHA. The goal is to solve multitask learning framework; W and Omega in an alternating fashion. In particular, we're able to form the following dual formulation, similar to COCOA, so it decomposes. In comparison, we make this much more flexible assumption on subproblem parameter. This is important because of stragglers: statistical reasons, unbalance, different distributions, it can be very different in how difficult it is to solve subproblems. Additionally, there can be stragglers due to systems issues. And issues of fault tolerance. So this looks like a simple fix: we make this accuracy parameter more flexible: allow it to vary by node and iteration t, and let it be exactly 1. The hard thing is showing it converges to optimal solution.

Following this new assumption, and you can't have a device go down every single round, we show the following convergence guarantee. For L-Lipschitz loss, we get a convergence at 1/epsilon; for smooth models (logistic regression) we get a linear rate.

How does this perform in practice? The method is quite simple. The assumption is we have data stored at m different devices. We alternate between solving Omega, and W stored on each. While we're solving w update, it works by defining these local subproblems for machines, and calling solver that does approximate solution. This is flexible because it can vary by node and iteration.

In terms of comparing this to other methods, what we've seen is the following. Comparing MOCHA to CoCoA, compared to Mb-SDCA and Mb-SGD. We had simulation, with real data to see what would happen if we do it on wifi. We have simulated time and how close are to optimal. What you can see is that MoCHA is converging much more quickly to optimal solution, because MoCHA doesn't have the problem of statistical heterogeneity, and it's not bogged down by stragglers. This is true for all of the different types of networks; LET and 3G. The blue line and MOCHA and CoCOA, they work well in high communication settings, because they are more flexible. But compared to CoCOA, MOCHA is much more robust to statistical heterogeneity.

What's interesting is that if we impose some systems heterogeneity, some devices are slower than others, we looked at imposing low and high systems heterogeneity, MOCHA with this additional heterogeneity, it's a two orders of magnitude speedup to reach optimal solution.

And for MOCHA in particular, we looked at issue of fault tolerance. What we're showing here, we're increasing the probability a device will drop out at any distribution. Going up until there's half devices, we're still fairly robust to MOCHA converging, in almost the same amount of time. But what we see with green dotted line, of the same device drops out every iteration, it doesn't converge. This shows the assumption we made makes sense in practice.

The punchline is that in terms of thinking this new setting, training ML on these massive networks of devices, this is both a statistical and systems issue. We've addressed it in a holistic matter. Code at http://cs.berkeley.edu/~vsmith I also want to reiterate about SysML conference in February.

Q: When you compare global and local? Why is it always better than global?

A: The motivation why you want to use local model over global model, is that if you have a local data a lot, you might perform better. It boosts the overall sample size. I have some additional experiments where we took the original data, and skewed it even further than it already was. We took the local data, and there was less data locally, and they have global approaches. That's just a function of the data in the devices.

Q: I really like how your method has guarantees, but I'm wondering about an approach where you create a metalearning algorithm locally and have it work locally?

A: That's worth looking into empirically, since you can do fine tuning locally. What we were trying to do first was converge to exact optimal solution, but you might want to just work empirically well, would be good to compare to this setting.

• December 8, 2017

## A Machine Learning Approach to Database Indexes (Alex Beutel)

The below is a transcript of a talk by Alex Beutel on machine learning database indexes, at the ML Systems Workshop at NIPS'17.

DB researchers think about there research differently. You have a system that needs to work for all cases. Where as in ML, we have a unique circumstance, I'll build a model that works well. In DB, you have to fit all.

To give an example of this is a B-tree. A B-tree works for range queries. We have records, key, we want to find all records for range of keys. 0-1000, you build tree on top of sorted array. To quickly look up starting point in range. What if all my data, all of the keys, from zero to million... it becomes clear, you don't need the whole tree above. You can use the key itself as an offset into the array. Your lookup is O(1), O(1) memory, no need for extra data structure.

Now, we can't go for each app, we can't make a custom implementation to make use of some pattern. DB scale to any application, we don't want to rebuild it any time.

But ML excels in this situation. It works well for a wide variety of distributions, learn and make use of them effectively.

This is the key insight we came to. Traditional data structures make no assumptions about your data. They work under any distribution, and generally scale O(n). Interestingly, learning, these data distributions, can offer a huge win. What we're trying to go to, is instead of scaling to size of data, we scale to complexity of it. With linear data, it's O(1). For other distributions, can we leverage this?

There are three dat structures underlying databases. There are B-Trees; range queries, similarity search. Main index. Hash maps for point lookups; individual records. This is more common throughout CS. And bloom filters, are really common for set-inclusion queries. Do I have a key. If your record is stored on disk, checking first if there's a record with that key is worthwhile. We're going to focus entirely on B-trees.

B-trees take a tree like structure with high branching factor. What makes it really effective is that it's cache efficient. You can store top level nodes in your cache where it's fast to look it up, maybe others in main memory, and the actual memory on disk. By caching the hierarchy appropriately, it makes it efficiently. At a high level, a B-tree maps a key to a page, some given place in memory. Once it finds that page, it will do some local search to find the particular range of that key. That could be a scan or binary search; we know the range will be the position from start of page to page size.

An abstract level, the Btree is just a model. It's taking the position of the key, and trying to estimate the position. What we have in this case, we want to search in this error range to find the ultimate record. At a high level, it would mean that we can't use any model. We need err_min and err_max. But we have all the data. If you have all the data, you know at index construction time, you know all the data you're executing against, and you can calculate what the model's min and max error is.

One interesting thing is this is just a regression problem. What you're really modeling is just the CDF. On the X axis on this plot here, the X axis is your keys, Ys your position. This is modeling where your probability mass is located; where your data is in the keyspace. CDFs are studied somewhat, but not a ton, in the literature. This is a nice new implication of research.

We thought, OK, let's try this out straightaway. Train a model, see how fast it is. We looked at 200M server logs, timestamp key, 2 layer NN, 32-width, relatively small by ML. We train to predict position, square error. A B-Tree executes in 300ns. Unfortunately, with the model, it takes 80000ns. By most ML model speeds, this is great. If you're looking at executing on server, great. But this doesn't work for a database.

There are a bunch of problems baked into this. TF is really designed for large models. Think about translation or superresolution images; these are hefty tasks. We need to make this fast for database level speed. Second, b-trees are great for overfitting. There's no risk of over-fitting in this context. They're also cache efficient; that's not looked at in ML. The last thing is local search in the end. Is that really the most effective way of ultimately finding that key? I'm skipping that part because it's fairly detailed, I'll focus on first three.

The first part is just the raw speed fo execution of ML model. This was built really by Tim, this Learning Index Framework program. What it does is it lets you create different indexes under different configurations. For one thing, it lets you do code compilation for TF, ideas from Tupleware, where you can take a linear model and execute it extremely quickly. We can also train simple models. Use TF for more complex gradient descent based learning; extract weights, and have inference graph be codegenned. And we can do a lot of autotuning, to find what the best model architecture is. We know ahead of time what the best training is. We can make pretty smart decisions about what works best.

The next problem is accuracy and sepeed. If I have 100M records, I narrow down quickly from 1.5M to 24K, with each step down this tree. Each one of those steps is 50-60 cycles to look through that page, and to find what the right branch is. So we have to get to an accurracy of 12000, within 500 mul/add, to beat these levels of hierarchy, which are in cache. This is a steep task. The question is what is the right model? a really wide network? Single hidden layer? This scales nicely, we can fit in 256 layer reasonably. We could go deeper... the challenge is we have width^2, which need to be parallelized somehow. The challenge is, how do we effectively scale this. We want to add capacity to the model, make it more and more accurate, with increased size, without becoming to.

We took a different approach, based on mixed experts. We'll have a key, have a really simple classifier. We get an estimate. Then we can use that estimate to find it at the next stage. Narrow down the CDF range, and try to be more accurate in the subset of space. It will still get key as input; given key, give position, but more narrow space of keys. We build this down, and we'll walk down this hierarchy. This decouples model size and complexity. We have a huge model, overfitting, but we don't have to execute all of the sparsity that you would have to do from a pure ML view. We can decouple it usefully. The nice thing we can do is fall back to B-trees for subsets that are difficult to learn in a model. The LIF framework lets us substitute it in easily. In the worst case, B-tree. Best case, more efficient.

The quick results version here, is we find we have four different data sets. Most are integer data sets; last one is string data set. We're trying to save memory and speed; we save memory hugely; these are really simple models. Linear with simple layer, with possibly two stages. We're able to get a significant speedup in these cases. Server logs one is interesting. It looks at a high level very linear, but there's actually daily patterns to this data accessed. Maps is more linear; it's longitudes of spaces. We created synthetic data that's log normal, and here we see we can model it effectively. Strings is an interesting challenge going forward; your data is larger and more complicated, building models that are efficient over a really long string is different; the overall patterns are harder to have intuition about. One thing really worth noting here, it's not using GPUs or TPUs; it's pureely CPU comparison. Apples-to-apples.

This is mostly going into the B-tree part. This is a regression model looking at CDF of data. We can use these exact same models for hash maps. With bloom filters, you can use binary classifiers. I have a bunch of results in the poster in the back.

A few minutes to talk about rooms for improvement. There are a bunch of directions that we're excited to explore. Obvious one is GPUs/TPUs. It's cPUs because that's when B-trees are most effective; but scaling is all about ML. Improving throughput and latency for models with GPUs, exciting going forward. Modeling themselves; there's no reason to believe hierarchy of models is the right or best choice; it's interesting to build model structures that match your hardware. Memory efficient, underlying architecture of GPUs. In the scale of ns we need for database. Multidimensional indexes; ML excels in high numbers of dimension; most things are not looking at a single integer feature. There's interesting question about how you map to multidimensional indexes that are difficult to scale. If we have a CDF, you can approximately sort it right there. And inserts and updates, assumed read-only databases. Large class of systems, but we get more data. How do we balance overfitting with accuracy; can we add some extra auxiliary data structures to balance this out?

Q: One thing is that when... this problem, we solved pretty well without ML. When we introduce ML, we should introduce new metrics. We shouldn't make our system more fragile, because distribution changes. What would be the worst case when distribution changes?

A: As the data becomes updated... in the case of inference and updates, there's a question about generalization. I think you could look at it from the ML point of view: statistically, test model today on tomorrows inserts. (It's a method. If I use this method, and then train it with data that I don't yet have... and do.) The typical extrapolation to future generalization of ML. Guarantees are hard. There will be a worst case that is awful... but the flip side, that's the ML side... generalization. There's also a point of view, I couple this with classic data structure. we coupled modeling with classic data structures: search, bloom filter case, so you don't actually have this work. You catch worst case.

Let me add to that. If you assume that the inserts follow the same distribution as trained model, then the inserts become all one operation. They're even better. Suppose they don't follow the same distribution? you can still do delta indexing. Most systems do do delta indexing. So inserts are not a big problem.

Q: (Robert) Most of the inputs were one or two real numbers, and outputs are a single real number. how does it work if you use a low degree polynomial, or a piecewise linear classifier on the different digits?

A: In the case of strings, it's not a single input. (Treat it as integer?) Well, it's possibly a thousand characters long. It's not the best representation. Different representations work really well. The last thing I want to say, piecewise linear could work, but when you run 10k, 100k submodels, it's slow. Hierarchy helps. Polynomials are interesting, depends on data source.

Q: Can you comment how bad your worst case is? Average numbers?

A: We specifically always have a spillover. The worst case is defaulting to typical database. We haven't had a case where you do worse, because we'll default to B-tree. (Deterministic execution?) Not inference time.

• December 8, 2017

## Ray: A Distributed Execution Framework for Emerging AI Applications (Ion Stoica)

The below is a transcript of a talk by Ion Stoica on Ray, at the ML Systems Workshop at NIPS'17.

We've been working on it at Berkeley for more than one year. Over the past years, there's been tremendous progress in AI. Ad targeting, image&speech, many more. Many applications are based on supervised learning with DNNs. Supervised plus unsupervised are the two dominant approaches.

However, the next generation of AI applications will be very different. They're deployed in mission critical scenarios, need to continually learn from a rapidly changing env. Robotics, self driving cars, unmanned drones, dialogue systems. Implementing this new generation of AI applications requires a broader range of techniques. Stochastic optimization, parallel simulations, many more.

Ray provides a unified platform for implementing these approaches. To motivate Ray, I'll use reinforcement learning. RL learns by interacting with env. A policy mapping from state/observation to action that maximizes a certain reward. What are the reqs of RL? Many applications exhibit nested parallelism: search, where they use data parallel SGD, which then calls a component that does policy evaluation with a model to simulate, that runs in parallel on multiple CPUs. Second, these workloads can be highly heterogenous in hardware and time. Many of these computations require not only CPUs, but GPUs TPUs and FPGAs. Second, this computation can take wildly different times. Simulate a chess game: 3 moves to lose, or 50 moves to win or draw. And in robotics, we need to process in real time, processing the data from sensors in parallel, tens of ms.

Meeting these requirements is not easy. To meet these requirements, you need a system that is flexible and performant. Flexible: it should create and schedule tasks dynamically, and support arbitrary dependencies. Perf: it should scale to hundreds of nodes, sub-millisecond latency, millions of task, and efficiently share numeric data.

Next, I'm going to say how we achieve these challenges. Flexibility? We provide a very flexible model: dynamic tasks graphs. On top of this, we give the two models: parallel tasks and actors.

To talk about parallel tasks, here is Python code: one reads an array from a file, and the other adds two arrays. The code is simple: it creates two arrays a and b from file1 and file2, and sum them up. So now, parallelizing this program is quite easy. If we want to parallelize a function, in order to do that, we need to add a ray.remote decorator to each function. When we invoke these functions, you need to invoke remote method. Remove doesn't return object itself, just the object id. This is very similar to the futures abstraction. To get the actual object, you must invoke ray.get on the object id.

To get a better idea of how Ray is executing, let's execute a simple program. Assumes files stored on different nodes. When read_array on file1, it schedules read_array on the appropriate node. The remote call returns immediately, before the actual read finishes. This allows the driver to run the second task in parallel, running on the node on file 2, and launch the add remote function. All functions have been scheduled remotely, but none of them have finished. To actually get the result, you have to call ray.get on the result. This is a blocking call, you'll wait for the entire computation graph to be executed.

Tasks are very general, but they are not enough. Consider that you want to run a simulator, and this simulator is closed source. In this case, you do not have access to the state. You have state, action, simulations, to set up state in simulator, you cannot do it. So to get around this, there is another use case, where the state is too expensive to create. For example, DNNs on GPUs, in this case, you want to initialize it once, and reinitialize for each simulation.

In order to address these use cases, we add Actor abstraction. An actor is just a remote class. If you have a Counter, you mark it ray.remote, and the when you create the class or invoke methods, you use remote keyword. This is a computation graph for this very simple example. Notice the method invocations also return object identifiers. To get the results, you need to call ray.get on object identifiers. Ray also allows you to specify the number of resources, for actors and tasks.

To put things together, and provide a more realistic example, evaluation strategy, a scalable form of RL, by Salimans et al in OpenAI. In a nutshell, evolution strategy, tries lots of policies, and tries to see which runs best. This is highly parallel. So here is pseudocode for parallel strategies. A worker that does simulation and returns the reward, create twenty workers, and then 200, do 200 simulations, update policy. Again, if you want to parallelize this code, we have to add a bunch of remote, and now on the right hand side, you'll notice I'm also sharing the computation graph. When you invoke now, the Worker.remote, you create 20 remote workers to do it in parallel. And you invoke with the remote keyword. Again, notice that in this case, the results are not the rewards themselves, but they're ids to the reward objects. In order to get the rewards to get policy, you have to call ray.get.

This hopefully gives you a flavor how to program in Ray. Next time, I switch gears, presents system design of Ray; how Ray gets high performance and scalability.

Like many classic computing frameworks, it has a driver, and a bunch of workers. Driver runs a program, worker runs task remotely. You can run and write a bunch of actors. The drivers actors on the same node, they share the data, on shared memory, and the workers and actors of cross nodes, share through distributed object store we built. Each node has a local scheduler, so when a driver wants to run another task, the local scheduler tries to schedule it locally. If it cannot schedule it locally, it invokes global scheduler, and it will schedule another node that has resources. Actor, remote method. Finally, what we do, and one essential part of the design, is we have a Global Control State. It takes all of the state of the system, and centralizes it. The metadata for the objects, in objects table, function. This allows system to be stateless. All these other components can fail, you can bring them up, get the most recent data from global control state. It also allows us to parallelize the global scheduler, because these replicas are going to share the same state in the GCS.

Another nice effect of having a GCS is that it makes it easy to build a bunch of profiling and debugging tools.

This design is highly scalable. Let me try to convince you why this is. To make GcS scalable, we just shard it. All these keys are pseudorandom, so it's easy to shard and load balance. The scheduler as you see is distributed; each node has a local scheduler, and Ray tries to schedule tasks which are spawned by a worker/driver on another task that is locally. The global scheduler, becomes a bottleneck, we can also replicate it. Finally, in systems, even if scheduler is super scalable, in Spark, there's another bottleneck: only the driver can launch new tasks. In order to get around that, we allow in Ray the workers and actors to launch tasks. Really, there is no single bottleneck point.

A few words about implementation. The GCS is implemented with Redis. For object store, we leverage Apache Arrow. For fault tolerance, we use lineage based fault tolerance like Spark. Actors are part of task graph; methods are treated as tasks, so we have a uniform model for providing fault tolerance.

So now some evaluation results. This plot represents the number of tasks per second, and you can see the number of nodes; it scales linearly. You can schedule over 1.8 M/s. Latency of local task execution is 300us, the latency of remote task is 1ms. This plot illustrates fault tolerance. You may ask why you care about fault tolerance? The problem is you need in your program that the simulation may not finish; this makes the program far more complicated, even if you're willing to ignore some results. Here, on this axis, you have the time in seconds, you have two y axes, number of nodes in system, and the throughput. As you can see, the number of nodes is starting at 50, then 25, then to 10, and goes back to 50. In the red area, you show the number of tasks per second; it follows as you may expect, the number of nodes in the system. If you look a little bit, there are some drops; every time, you have a drop in the number of tasks. It turns out this is because of the object reconstruction. When some nodes go away, you lose the objects on the node, so you have to reconstruct them. Ray and Spark reconstruct them transparently. With blue, you can see the re-executed tasks. If you add them, you get a very nice filling curve.

Finally, for evolution strategies, we compared with reference ES from... we followed the OpenAI, and on the X axis, you have number of CPUs, mean time to solve the particular problem; simulator, learning to run, there are three points to notice. One is, as expected, as you add more CPUs, the time to solve goes down. The second is that Ray is actually better than the reference ES, better results, even though the reference ES is specialized for beating. Third, for a very large number of CPUs, ref couldn't do it, but Ray could do better and better. I should add that Ray takes half the amount of code, and was implemented in a couple of hours.

Related work: look, in this area, there are a huge number of systems, that's why you are here, lots of systems. Ray is complimentary to TF, MXNet, PyTorch, etc. We use these systems to implement DNNs. We integrate with TF and PyT. There are more general systems, like MPI and Spark; these have limited support for nested parallelism; computation model, and they have much coarser grained tasks.

To conclude, Ray is a system for high performance and flexibility and scalability. We have two libraries on top of Ray: RLlib and Ray Tune. It's open source, please try, we'd love your feedback. Robert, Philip, Alex, Stephanie, Richard, Eric, Heng, William, and many thanks to my colleague Michael Jordan.

Q: In your system, you also use actor; actor is built up on shared memory. Do you have separate mailbox for actors? How do you do that?

A: No, the actors communicate by passing the argument to the shared object store.

Q: What is the granularity of parallelism? Is it task atomic, or do you split task?

A: The task granularity is given by what is the overhead for launching a task and scheduling the task. The task you see, we are targeting task, low and few ms. The task is not implementing something like activation function. we leave that job to much better frameworks. And a task is executing atomically, a method, in the actors, are serialized.

Q: Question about fault tolerance: in Spark, when you don't have a response for some time, it says this node died. Here, the task is much more, because NN, something like that. So we don't have the same time.

A: We do not do speculation; implicit speculation in Ray, for the reason you mentioned.

Q: Can you give me more details on the reference implementation, doesn't scale

A: The reference implementation, it's the OpenAI implementation, Robert here can provide you a lot more detailed answers to that question.

• December 8, 2017

## Backpack for deep learning

This is a guest post by Kaixi Ruan.

Backpack is a module system for Haskell, released recently in GHC 8.2.1. As this is a new feature, I wanted to know how people use it. So I searched Twitter every day, and the other day I saw this tweet:

Are there other examples than String/Bytestring/Text? So far I haven’t seen any; it seems like backpack is just for glorified string holes.

There were a number of good responses, but I want to give another use case from deep learning.

In deep learning, people are interested in doing computations on tensors.  Tensors can have different value types: int, float, double etc. Additionally, ensor computations can be done on the CPU or GPU. Although there many different types of tensor,  the computations for each type of tensor are the same, i.e, they share the same interface. Since Backpack lets you program against one interface which can have multiple implementations, it is the perfect tool for implementing a tensor library.

Torch is a widely used library, implemented in C, for deep learning. Adam Paszke has a nice article about Torch. We can write some Haskell bindings for Torch, and then use Backpack to switch between implementations of float and int tensors. Here is a program that uses tensors via a Backpack signature:

```unit torch-indef where
signature Tensor where
import Data.Int
data Tensor
data AccReal
instance Show AccReal
instance Num AccReal
read1dFile :: FilePath -> Int64 -> IO Tensor
dot :: Tensor -> Tensor -> IO AccReal
sumall :: Tensor -> IO AccReal
module App where
import Tensor
app = do
d <- dot x y
s <- sumall x
print (d + s)
return ()
```

We have a simple main function which reads two 1D tensors from files, does dot product of the two, sums all entries of the first tensor, and then finally prints out the sum of these two values. (This program is transcribed from Adam’s article, the difference is that Adam’s program uses Float Tensor, and we keep the Tensor type abstract so with Backpack we can do both float and int). The program uses functions like dot, which are defined in the signature.

Here is an implementation of dot and types for float tensors. The C functions are called using Haskell’s FFI:

```import Foreign
import Foreign.C.Types
import Foreign.C.String
import Foreign.ForeignPtr

foreign import ccall "THTensorMath.h THFloatTensor_dot"
c_THFloatTensor_dot :: (Ptr CTHFloatTensor) -> (Ptr CTHFloatTensor) -> IO CDouble

type Tensor = FloatTensor
type AccReal = Double

dot :: Tensor -> Tensor -> IO AccReal
dot (FT f) (FT g) = withForeignPtr f \$ \x ->
withForeignPtr g \$ \y -> do
d <- c_THFloatTensor_dot x y
return (realToFrac d)
```

As you can see, Backpack can be used to structure a deep learning library which has multiple implementations of operations for different types. If you wrote bindings for all of the functions in Torch, you would have a deep learning library for Haskell; with Backpack, you could easily write models that were agnostic to the types of tensors they operate on and the processing unit (CPU or GPU) they run on.

You can find the full sample code on GitHub.

• August 17, 2017

## Proposal: Suggest explicit type application for Foldable length and friends

tl;dr If you use a Foldable function like length or null, where instance selection is solely determined by the input argument, you should make your code more robust by introducing an explicit type application specifying which instance you want. This isn't necessary for a function like fold, where the return type can cross-check if you've gotten it right or not. If you don't provide this type application, GHC should give a warning suggesting you annotate it explicitly, in much the same way it suggests adding explicit type signatures to top-level functions.

Recently, there has been some dust kicked up about Foldable instances causing "bad" code to compile. The prototypical example is this: you've written length (f x), where f is a function that returns a list [Int]. At some future point in time, a colleague refactors f to return (Warnings, [Int]). After the refactoring, will length (f x) continue to type check? Yes: length (f x) will always return 1, no matter how long the inner list is, because it is using the Foldable instance for (,) Warnings.

The solution proposed in the mailing list was to remove Foldable for Either, a cure which is, quite arguably, worse than the disease. But I think there is definitely merit to the complaint that the Foldable instances for tuples and Either enable you to write code that typechecks, but is totally wrong.

Richard Eisenberg described this problem as the tension between the goals of "if it compiles, it works!" (Haskell must exclude programs which don't work) and general, polymorphic code, which should be applicable in as many situations as possible. I think there is some more nuance here, however. Why is it that Functor polymorphic code never causes problems for being "too general", but Foldable does? We can construct an analogous situation: I've written fmap (+2) (f x), where f once again returns [Int]. When my colleague refactors f to return (Warnings, [Int]), fmap now makes use of the Functor instance (,) Warnings, but the code fails to compile anyway, because the type of (+1) doesn't line up with [Int]. Yes, we can still construct situations with fmap where code continues to work after a type change, but these cases are far more rare.

There is a clear difference between these two programs: the fmap program is redundant, in the sense that the type is constrained by both the input container, the function mapping over it, and the context which uses the result. Just as with error-correcting codes, redundancy allows us to detect when an error has occurred; when you reduce redundancy, errors become harder to detect. With length, the only constraint on the selected instance is the input argument; if you get it wrong, we have no way to tell.

Thus, the right thing to do is reintroduce redundancy where it is needed. Functions like fold and toList don't need extra redundancy, because they are cross-checked by the use of their return arguments. But functions like length and null (and arguably maximum, which only weakly constrains its argument to have an Ord instance) don't have any redundancy: we should introduce redundancy in these places!

Fortunately, with GHC 8.0 provides a very easy way of introducing this redundancy: an explicit type application. (This was also independently suggested by Faucelme.) In this regime, rather than write length (f x), you write length @[] (f x), saying that you wanted length for lists. If you wanted length for maps, you write length @(Map _) (f x). Now, if someone changes the type of f, you will get a type error since the explicit type application no longer matches.

Now, you can write this with your FTP code today. So there is just one more small change I propose we add to GHC: let users specify the type parameter of a function as "suggested to be explicit". At the call-site, if this function is used without giving a type application, GHC will emit a warning (which can be disabled with the usual mechanism) saying, "Hey, I'm using the function at this type, maybe you should add a type application." If you really want to suppress the warning, you could just type apply a type hole, e.g., length @_ (f x). As a minor refinement, you could also specify a "default" type argument, so that if we infer this argument, no warning gets emitted (this would let you use the list functions on lists without needing to explicitly specify type arguments).

That's it! No BC-breaking flag days, no poisoning functions, no getting rid of FTP, no dropping instances: just a new pragma, and an opt-in warning that will let people who want to avoid these bugs. It won't solve all Foldable bugs, but it should squash the most flagrant ones.

What do people think?

• March 21, 2017

## Prio: Private, Robust, and Scalable Computation of Aggregate Statistics

I want to take the opportunity to advertise some new work from a colleague of mine, Henry Corrigan-Gibbs (in collaboration with the venerable Dan Boneh) on the subject of preserving privacy when collecting aggregate statistics. Their new system is called Prio and will be appearing at this year's NSDI.

The basic problem they tackle is this: suppose you're Google and you want to collect some statistics on your users to compute some aggregate metrics, e.g., averages or a linear regression fit:

A big problem is how to collect this data without compromising the privacy of your users. To preserve privacy, you don't want to know the data of each of your individual users: you'd like to get this data in completely anonymous form, and only at the end of your collection period, get an aggregate statistic.

This is an old problem; there are a number of existing systems which achieve this goal with varying tradeoffs. Prio tackles one particularly tough problem in the world of private aggregate data collection: robustness in the face of malicious clients. Suppose that you are collecting data for a linear regression, and the inputs your clients send you are completely anonymous. A malicious client could send you a bad data point that could skew your entire data set; and since you never get to see the individual data points of your data set, you would never notice:

Thus, Prio looks at the problem of anonymously collecting data, while at the same time being able to validate that the data is reasonable.

The mechanism by which Prio does this is pretty cool, and so in this post, I want to explain the key insights of their protocol. Prio operates in a regime where a client secret shares their secret across a pool of servers which are assumed to be non-colluding; as long as at least one server is honest, nothing is revealed about the client's secret until the servers jointly agree to publish the aggregate statistic.

Here is the problem: given a secret share of some hidden value, how can we efficiently check if it is valid? To answer this question, we first have to explain a little bit about the world of secret sharing.

A secret sharing scheme allows you to split a secret into many pieces, so that the original secret cannot be recovered unless you have some subset of the pieces. There are amazingly simple constructions of secret sharing: suppose that your secret is the number x in some field (e.g., integers modulo some prime p), and you want to split it into n parts. Then, let the first n-1 shares be random numbers in the field, the last random number be x minus the sum of the previous shares. You reconstruct the secret by summing all the shares together. This scheme is information theoretically secure: with only n-1 of the shares, you have learned nothing about the underlying secret. Another interesting property of this secret sharing scheme is that it is homomorphic over addition. Let your shares of x and y be $[x]_i$ and $[y]_i$: then $[x]_i + [y]_i$ form secret shares of x + y, since addition in a field is commutative (so I can reassociate each of the pairwise sums into the sum for x, and the sum for y.)

Usually, designing a scheme with homomorphic addition is easy, but having a scheme that supports addition and multiplication simultaneously (so that you can compute interesting arithmetic circuits) is a bit more difficult. Suppose you want to compute an arithmetic circuit on some a secret shared value: additions are easy, but to perform a multiplication, most multiparty computation schemes (Prio uses Beaver's MPC protocol) require you to perform a round of communication:

While you can batch up multiplications on the same "level" of the circuit, so that you only to do as many rounds as the maximum depth of multiplications in the circuit, for large circuits, you may end up having to do quite a bit of communication. Henry tells me that fully homomorphic secret sharing has been the topic of some research ongoing research; for example, this paper about homomorphic secret sharing won best paper at CRYPTO last year.

Returning to Prio, recall that we had a secret share of the user provided input, and we would like to check if it is valid according to some arithmetic circuit. As we've seen above, we could try using a multi-party computation protocol to compute shares of the output of the circuit, reveal the output of the circuit: if it says that the input is valid, accept it. But this would require quite a few rounds of communication to actually do the computation!

Here is one of the key insights of Prio: we don't need the servers to compute the result of the circuit--an honest client can do this just fine--we just need them to verify that a computation of the circuit is valid. This can be done by having the client ship shares of all of the intermediate values on each of the wires of the circuit, having the servers recompute the multiplications on these shares, and then comparing the results with the intermediate values provided to us by the client:

When we transform the problem from a computation problem to a verification one, we now have an embarrassingly parallel verification circuit, which requires only a single round to multiply each of the intermediate nodes of the circuit.

There is only one final problem: how are we to check that the recomputed multiplies of the shares and the client provided intermediate values are consistent? We can't publish the intermediate values of the wire (that would leak information about the input!) We could build a bigger circuit to do the comparison and combine the results together, but this would require more rounds of communication.

To solve this problem, Prio adopts an elegant trick from Ben-Sasson'12 (Near-linear unconditionally-secure multiparty computation with a dishonest minority): rather than publish the entire all of the intermediate wires, treat them as polynomials and publish the evaluation of each polynomial at a random point. If the servers behave correctly, they reveal nothing about the original polynomials; furthermore, with high probability, if the original polynomials are not equal, then the evaluation of the polynomials at a random point will also be not equal.

This is all very wonderful, but I'd like to conclude with a cautionary tale: you have to be very careful about how you setup these polynomials. Here is the pitfall: suppose that a malicious server homomorphically modifies one of their shares of the input, e.g., by adding some delta. Because our secret shares are additive, adding a delta to one of the share causes the secret to also be modified by this delta! If the adversary can carry out the rest of the protocol with this modified share, when the protocol finishes running, he finds out whether or not the modified secret was valid. This leaks information about the input: if your validity test was "is the input 0 or 1", then if you (homomorphically) add one to the input and it is still valid, you know that it definitely was zero!

Fortunately, this problem can be fixed by randomizing the polynomials, so that even if the input share is shifted, the rest of the intermediate values that it computes cannot be shifted in the same way. The details are described in the section "Why randomize the polynomials?" I think this just goes to show how tricky the design of cryptographic systems can be!

In any case, if this has piqued your interest, go read the paper! If you're at MIT, you can also go see Henry give a seminar on the subject on March 22 at the MIT CSAIL Security Seminar.

• March 17, 2017

## Designing the Backpack signature ecosystem

Suppose you are a library writer interested in using Backpack. Backpack says that you can replace a direct dependency on a function, type or package with one or more signatures. You typecheck against a signature and your end user picks how they want to eventually implement the signature.

Sounds good right? But there's a dirty little secret: to get all of this goodness, you have to write a signature--you know, a type signature for each function and type that you want to use in your library. And we all know how much Haskellers hate writing signatures. But Backpack has a solution to this: rather than repeatedly rewrite signatures for all your packages, a conscientious user can put a signature in a package for reuse in other packages.

For the longest time, I thought that this was "enough", and it would be a simple matter of sitting down and writing some tutorials for how to write a signature package. But as I sat down and started writing signature packages myself, I discovered that there was more than one way to set things up. In the post, I want to walk through two different possible designs for a collection of signature packages. They fall out of the following considerations:

• How many signature packages for, e.g., bytestring, should there be? There could be exactly one, or perhaps a separate package for each API revision?
• Should it be possible to post a new version of a signature package? Under what circumstances should this be allowed?
• For developers of a library, a larger signature is more convenient, since it gives you more functionality to work with. For a client, however, a smaller signature is better, because it reduces the implementation burden. Should signature packages be setup to encourage big or small signatures by default?

### A signature package per release

Intuitively, every release of a package is also associated with a "signature" specifying what functions that release supports. One could conclude, then, that there should be a signature package per release, each package describing the interface of each version of the package in question. (Or, one could reasonably argue that GHC should be able to automatically infer the signature from a package. This is not so easy to do, for reasons beyond the scope of this post.)

However, we have to be careful how we perform releases of each of these signatures. One obvious but problematic thing to do is this: given bytestring-0.10.8.1, also release a bytestring-sig-0.10.8.1. The problem is that in today's Haskell ecosystem, it is strongly assumed that only one version of a package is ever selected. Thus, if I have one package that requires bytestring-sig == 0.10.8.1, and another package that requires bytestring-sig == 0.10.8.2, this will fail if we try to dependency solve for both packages at the same time. We could make this scheme work by teaching Cabal and Stack how to link against multiple versions of a signature package, but at the moment, it's not practical.

An easy way to work around the "multiple versions" problem is to literally create a new package for every version of bytestring. The syntax for package names is a bit irritating (alphanumeric characters plus hyphens only, and no bare numbers between a hyphen), but you could imagine releasing bytestring-v1008, bytestring-v1009, etc., one for each version of the API that is available. Once a signature package is released, it should never be updated, except perhaps to fix a mistranscription of a signature.

Under semantic versioning, packages which share the same major version are supposed to only add functionality, not take it away. Thus, these successive signature packages can also be built on one another: for example bytestring-v1009 can be implemented by inheriting all of the functions from bytestring-v1008, and only adding the new functions that were added in 0.10.9.

### A signature package per major release series

There is something very horrible about the above scheme: we're going to have a lot of signature packages: one per version of a package! How awful would it be to have in the Hackage index bytestring-v900, bytestring-v901, bytestring-v902, bytestring-v1000, bytestring-v1002, bytestring-v1004, bytestring-v1006 and bytestring-v1008 as package choices? (With perhaps more if there exist patch releases that accidentally changed the API.) Thus, it is extremely tempting to try to find ways to reduce the number of signature packages we need to publish.

Here is one such scheme which requires a signature package only for major releases; e.g., for bytestring, we would only have bytestring-v9 and bytestring-v10:

• The latest version of bytestring-v9 should correspond to the "biggest" API supported by the 0.9 series. Thus, bytestring-v9, every minor version release of bytestring, there is a new release of bytestring-v9: e.g., when bytestring-0.9.1.0 is released, we release bytestring-v9-1.0. Each of the releases increases the functionality recorded in the signature, but is not permitted to make any other changes.
• When depending on the signature package, we instead provide a version bound specifying the minimum functionality of the signature required to build our package; e.g., bytestring-v9 >= 1.0. (Upper bounds are not necessary, as it assumed that a signature package never breaks backwards compatibility.)

There is one major difficulty: suppose that two unrelated packages both specify a version bound on bytestring-v9. In this case, the ultimate version of the signature package we pick will be one that is compatible with both ranges; in practice, the latest version of the signature. This is bad for two reasons: first, it means that we'll always end up requiring the client to implement the full glory of bytestring-v9, even if we are compatible with an earlier version in the release series. Second, it means that whenever bytestring-v9 is updated, we may bring more entities into scope: and if that introduces ambiguity, it will cause previously compiling code to stop compiling.

Fortunately, there is a solution for this problem: use signature thinning to reduce the required entities to precisely the set of entities you need. For example, suppose that bytestring-v9-0.0 has the following signature:

```signature Data.ByteString where
data ByteString
empty :: ByteString
null :: ByteString -> Bool
```

As a user, we only needed ByteString and empty. Then we write in our local ByteString signature:

```signature Data.ByteString (ByteString, empty) where
```

and now no matter what new functions get added to bytestring-v9-0.0, this signature will only ever require ByteString and empty. (Another way of thinking about signature thinning is that it is a way to centralize explicit import lists.) Notice that this scheme does not work if you don't have a separate package per major release series, since thinning can't save you from a backwards incompatible change to the types of one of the functions you depend on.

These signature thinning headers can be automatically computed; I've written a tool (ghc-usage) which does precisely this. Indeed, signature thinning is useful even in the first design, where they can be used to reduce the requirements of a package; however, with a signature package per major release, they are mandatory; if you don't use them, your code might break.

### Conclusion

So, what design should we adopt? I think the first scheme (a signature package per release) is more theoretically pure, but I am very afraid of the "too many packages" problem. Additionally, I do think it's a good idea to thin signatures as much as possible (it's not good to ask for things you're not going to use!) which means the signature thinning requirement may not be so bad. Others I have talked to think the first scheme is just obviously the right thing to do.

Which scheme do you like better? Do you have your own proposal? I'd love to hear what you think. (Also, if you'd like to bikeshed the naming convention for signature packages, I'm also all ears.)

### Appendix

After publishing this post, the comments of several folks made me realize that I hadn't motivated why you would want to say something about the API of bytestring-0.10.8; don't you just want a signature of strings? So, to address this comment, I want to describe the line of reasoning that lead me down this path.

I started off with a simple goal: write a signature for strings that had the following properties:

1. Be reasonably complete; i.e., contain all of the functions that someone who wanted to do "string" things might want, but
2. Be reasonably universal; i.e., only support functions that would be supported by all the major string implementations (e.g., String, strict/lazy Text, strict/lazy Word8/Char8 ByteString and Foundation strings.)

It turned out that I needed to drop quite a number of functions to achieve universality; for example, transpose, foldl1, foldl1', mapAccumL/R, scanl, replicate, unfoldr, group, groupBy, inits, tails are not implemented in Foundation; foldr', foldr1', scanl1, scanr, scanr1, unfoldN, spanEnd, breakEnd, splitOn, isInfixOf are not implemented by the lazy types.

This got me thinking that I could provide bigger signatures, if I didn't require the signature to support all of the possible implementations; you might have a signature that lets you switch between only the strict variants of string types, or even a signature that just lets you swap between Word8 and Char8 ByteStrings.

But, of course, there are combinatorially many different ways one could put signatures together and it would be horrible to have to write (and name) a new signature package for each. So what is the minimal unit of signature that one could write? And there is an obvious answer in this case: the API of a specific module (say, Data.ByteString) in a specific version of the package. Enter the discussion above.

### Appendix 2

Above, I wrote:

But, of course, there are combinatorially many different ways one could put signatures together and it would be horrible to have to write (and name) a new signature package for each. So what is the minimal unit of signature that one could write? And there is an obvious answer in this case: the API of a specific module (say, Data.ByteString) in a specific version of the package.

I think there is an alternative conclusion to draw from this: someone should write a signature containing every single possible function that all choices of modules could support, and then have end-users responsible for paring these signatures down to the actual sets they use. So, everyone is responsible for writing big export lists saying what they use, but you don't have to keep publishing new packages for different combinations of methods.

I'm pursuing this approach for now!

• March 11, 2017

## How to integrate GHC API programs with Cabal

GHC is not just a compiler: it is also a library, which provides a variety of functionality that anyone interested in doing any sort of analysis on Haskell source code. Haddock, hint and ghc-mod are all packages which use the GHC API.

One of the challenges for any program that wants to use the GHC API is integration with Cabal (and, transitively, cabal-install and Stack). The most obvious problem that, when building against packages installed by Cabal, GHC needs to be passed appropriate flags telling it which package databases and actual packages should be used. At this point, people tend to adopt some hacky strategy to get these flags, and hope for the best. For commonly used packages, this strategy will get the job done, but for the rare package that needs something extra--preprocessing, extra GHC flags, building C sources--it is unlikely that it will be handled correctly.

A more reliable way to integrate a GHC API program with Cabal is inversion of control: have Cabal call your GHC API program, not the other way around! How are we going to get Cabal/Stack to call our GHC API program? What we will do is replace the GHC executable which passes through all commands to an ordinary GHC, except for ghc --interactive, which we will then pass to the GHC API program. Then, we will call cabal repl/stack repl with our overloaded GHC, and where we would have opened a GHCi prompt, instead our API program gets run.

With this, all of the flags which would have been passed to the invocation of ghc --interactive are passed to our GHC API program. How should we go about parsing the flags? The most convenient way to do this is by creating a frontend plugin, which lets you create a new major mode for GHC. By the time your code is called, all flags have already been processed (no need to muck about with DynFlags!).

Enough talk, time for some code. First, let's take a look at a simple frontend plugin:

```module Hello (frontendPlugin) where

import GhcPlugins
import DriverPhases

frontendPlugin :: FrontendPlugin
frontendPlugin = defaultFrontendPlugin {
frontend = hello
}

hello :: [String] -> [(String, Maybe Phase)] -> Ghc ()
hello flags args = do
liftIO \$ print flags
liftIO \$ print args
```

This frontend plugin is taken straight from the GHC documentation (but with enough imports to make it compile ;-). It prints out the arguments passed to it.

Next, we need a wrapper program around GHC which will invoke our plugin instead of regular GHC when we are called with the --interactive flag. Here is a simple script which works on Unix-like systems:

```import GHC.Paths
import System.Posix.Process
import System.Environment

main = do
args <- getArgs
let interactive = "--interactive" `elem` args
args' = do
arg <- args
case arg of
"--interactive" ->
["--frontend", "Hello",
"-plugin-package", "hello-plugin"]
_ -> return arg
executeFile ghc False (args' ++ if interactive then ["-user-package-db"] else []) Nothing
```

Give this a Cabal file, and then install it to the user package database with cabal install (see the second bullet point below if you want to use a non-standard GHC via the -w flag):

```name:                hello-plugin
version:             0.1.0.0
author:              Edward Z. Yang
maintainer:          ezyang@cs.stanford.edu
build-type:          Simple
cabal-version:       >=1.10

library
exposed-modules:     Hello
build-depends:       base, ghc >= 8.0

executable hello-plugin
main-is:             HelloWrapper.hs
build-depends:       base, ghc-paths, unix
```

Now, to run your plugin, you can do any of the following:

• cabal repl -w hello-plugin
• cabal new-repl -w hello-plugin
• stack repl --system-ghc --with-ghc hello-plugin

To run the plugin on a specific package, pass the appropriate flags to the repl command.

The full code for this example can be retrieved at ezyang/hello-plugin on GitHub.

Here are a few miscellaneous tips and tricks:

• To pass extra flags to the plugin, add --ghc-options=-ffrontend-opt=arg as necessary (if you like, make another wrapper script around this!)
• If you installed hello-plugin with a GHC that is not the one from your PATH, you will need to put the correct ghc/ghc-pkg/etc executables first in the PATH; Cabal's autodetection will get confused if you just use -w. If you are running cabal, another way to solve this problem is to pass --with-ghc-pkg=PATH to specify where ghc-pkg lives (Stack does not support this.)
• You don't have to install the plugin to your user package database, but then the wrapper program needs to be adjusted to be able to find wherever the package does end up being installed. I don't know of a way to get this information without writing a Custom setup script with Cabal; hopefully installation to the user package database is not too onerous for casual users.
• cabal-install and stack differ slightly in how they go about passing home modules to the invocation of GHCi: cabal-install will call GHC with an argument for every module in the home package; Stack will pass a GHCi script of things to load. I'm not sure which is more convenient, but it probably doesn't matter too much if you know already know which module you want to look at (perhaps you got it from a frontend option.)
• February 8, 2017