MVars are an amazingly flexible synchronization primitive, which can serve as locks, one-place channels, barriers, etc. or be used to form higher-level abstractions. As far as flexibility is concerned, MVars are the superior choice of primitive for the runtime system to implement—as opposed to just implementing, say, a lock.
However, I was recently thinking about GHC’s BlockedIndefinitelyOnMVar exception, and it occurred to me that a native implementation of locks could allow perfect deadlock detection, as opposed to the approximate detection for MVars we currently provide. (I must emphasize, however, that here, I define deadlock to mean a circular waits-for graph, and not “thread cannot progress further.”)
Read more...
One of the appealing things about GHC is that the compiler is surprisingly hackable, even when you don’t want to patch the compiler itself. This hackability comes from compiler plugins, which let you write custom optimization passes on Core, as well as foreign primops, which let you embed low-level C– to manipulate the low-level representation of various primitives. These hooks let people implement and distribute features that would otherwise be to unstable or speculative to put into the compiler proper.
Read more...
GHC’s block allocator is a pretty nifty piece of low-level infrastructure. It offers a much more flexible way of managing a heap, rather than trying to jam it all in one contiguous block of memory, and is probably something that should be of general interest to anyone who is implementing low-level code like a runtime. The core idea behind it is quite old (BIBOP: Big Bag of Pages), and is useful for any situation where you have a number of objects that are tagged with the same descriptor, and you don’t want to pay the cost of the tag on each object.
Read more...
Here at ICFP, sometimes the so-called “hallway track” is sometimes just as important as the ordinary track. Johan Tibell was wanting to avoid an out-of-line call to allocate function in GHC when a small array of statically known size was allocated. But he found the way that GHC’s new code generator handles heap allocation a bit confusing, and so we skipped out of one session today to work it out. In this post, I would like to explain how the code generation monad figures out what the heap offsets in the code are, by way of a kind of cute (and also slightly annoying) trick involving a “monadic” fixpoint.
Read more...
One of the problems with academic publishing is that it’s hard to keep old papers up-to-date. This is the certainly case for this 1995 Sansom paper on profiling non-strict, higher-order functional languages. While the basic ideas of the paper still hold, the actual implementation of cost centers in GHC has changed quite a bit, perhaps the most dramatic change being the introduction of cost center stacks. So while the old paper is good for giving you the basic idea of how profiling in GHC works, if you really want to know the details, the paper offers little guidance.
Read more...
One day, you’re strolling along fields of code, when suddenly you spot a syntax construct that you don’t understand.
Perhaps you’d ask your desk-mate, who’d tell you in an instant what it was.
Perhaps your programming toolchain can tell you. (Perhaps the IDE would you mouse over the construct, or you’re using Coq which let’s you Locate custom notations.)
Perhaps you’d pull up the manual (or, more likely, one of many tutorials) and scan through looking for the syntax construct in question.
Read more...
Adam Belay (of Dune fame) was recently wondering why Haskell’s MVars are so slow. “Slow?” I thought, “aren’t Haskell’s MVars supposed to be really fast?” So I did some digging around how MVars worked, to see if I could explain.
Let’s consider the operation of the function takeMVar in Control.Concurrent.MVar. This function is very simple, it unpacks MVar to get the underlying MVar# primitive value, and then calls the primop takeMVar#:
takeMVar :: MVar a -> IO a
takeMVar (MVar mvar#) = IO $ \ s# -> takeMVar# mvar# s#
Primops result in the invocation of stg_takeMVarzh in PrimOps.cmm, which is where the magic happens. For simplicity, we consider only the multithreaded case.
Read more...
I’d like to talk about some nitty-gritty details of GHC’s thread scheduling, discovered over the course of working on stride scheduling for GHC. Most of these choices are merely implementation details and are not part of any specification. While these choices shouldn’t be relied upon, they are worth knowing, since many of these details were accreted over the course of many performance bugs, benchmark tests and other battles. In this post, I’ll attempt to give some historical insight into why many choices were made. These insights should generalize to any system that would like to implement green threads, lightweight threads that use less memory than traditional operating system threads. For space reasons, I’m not going to talk about STM or sparks (though they are also quite interesting).
Read more...
There is a bit of scaffolding involved with making Core-to-Core transforming GHC plugins, so I made a little project, based off of Max Bolingbroke’s examples, which is a nice, clean template project which you can use to create your own GHC plugins. In particular, it has documentation and pointers to the GHC source as well as a handy-dandy shell script rename.sh MyProjectName which will let you easily rename the template project into whatever name you want. You can find it on GitHub. I’ll probably be adding more to it as I go along; let me know about any bugs too.
Read more...
Hac Phi was quite productive (since I managed to get two blog posts out of it!) On Saturday I committed a new module GHC.Stats to base which implemented a modified subset of the API I proposed previously. Here is the API; to use it you’ll need to compile GHC from Git. Please test and let me know if things should get changed or clarified!
-- | Global garbage collection and memory statistics.
data GCStats = GCStats
{ bytes_allocated :: Int64 -- ^ Total number of bytes allocated
, num_gcs :: Int64 -- ^ Number of garbage collections performed
, max_bytes_used :: Int64 -- ^ Maximum number of live bytes seen so far
, num_byte_usage_samples :: Int64 -- ^ Number of byte usage samples taken
-- | Sum of all byte usage samples, can be used with
-- 'num_byte_usage_samples' to calculate averages with
-- arbitrary weighting (if you are sampling this record multiple
-- times).
, cumulative_bytes_used :: Int64
, bytes_copied :: Int64 -- ^ Number of bytes copied during GC
, current_bytes_used :: Int64 -- ^ Current number of live bytes
, current_bytes_slop :: Int64 -- ^ Current number of bytes lost to slop
, max_bytes_slop :: Int64 -- ^ Maximum number of bytes lost to slop at any one time so far
, peak_megabytes_allocated :: Int64 -- ^ Maximum number of megabytes allocated
-- | CPU time spent running mutator threads. This does not include
-- any profiling overhead or initialization.
, mutator_cpu_seconds :: Double
-- | Wall clock time spent running mutator threads. This does not
-- include initialization.
, mutator_wall_seconds :: Double
, gc_cpu_seconds :: Double -- ^ CPU time spent running GC
, gc_wall_seconds :: Double -- ^ Wall clock time spent running GC
-- | Number of bytes copied during GC, minus space held by mutable
-- lists held by the capabilities. Can be used with
-- 'par_max_bytes_copied' to determine how well parallel GC utilized
-- all cores.
, par_avg_bytes_copied :: Int64
-- | Sum of number of bytes copied each GC by the most active GC
-- thread each GC. The ratio of 'par_avg_bytes_copied' divided by
-- 'par_max_bytes_copied' approaches 1 for a maximally sequential
-- run and approaches the number of threads (set by the RTS flag
-- @-N@) for a maximally parallel run.
, par_max_bytes_copied :: Int64
} deriving (Show, Read)
-- | Retrieves garbage collection and memory statistics as of the last
-- garbage collection. If you would like your statistics as recent as
-- possible, first run a 'performGC' from "System.Mem".
getGCStats :: IO GCStats