How to pick your string library in Haskell
August 2, 2010Notice. Following a critique from Bryan O’Sullivan, I’ve restructured the page.
“How do the different text handling libraries compare, and when should we use which package?” asks Chris Eidhof. The latter question is easier to answer. Use bytestring for binary data—raw bits and bytes with no explicit information as to semantic meaning. Use text for Unicode data representing human written languages, usually represented as binary data equipped with a character encoding. Both (especially bytestring) are widely used and are likely to become—if they are not already—standards.
There are, however, a lot more niche string handling libraries on Hackage. Having not used all of them in substantial projects, I will refrain on judging them on stability or implementation; instead, we’ll categorize them on the niche they fill. There are several axes that a string library or module may be categorized on:
- Binary or text? Binary is raw bits and bytes: it carries no explicit information about what a
0or0x0Ameans. Text is meant to represent human language and is usually binary data equipped with a character encoding. This is the most important distinction for a programmer to know about. - If text, ASCII, 8-bit or Unicode? ASCII is simple but English-only; 8-bit (e.g. Latin-1) is ubiquitous and frequently necessary for backwards compatibility; Unicode is the “Right Way” but somewhat complicated. Unicode further asks, What in-memory encoding? UTF-16 is easy to process while UTF-8 can be twice as memory efficient for English text. Most languages pick Unicode and UTF-16 for the programmer.
- Unpacked or packed? Unpacked strings, the native choice, are just linked lists of characters. Packed strings are classic C arrays, allowing efficient processing and memory use. Most languages use packed strings: Haskell is notable (or perhaps notorious) in its usage of linked lists.
- Lazy or strict? Laziness is more flexible, allowing for things like streaming. Strict strings must be held in memory in their entirety, but can be faster when the whole string would have needed to be computed anyway. Packed lazy representations tend to use chunking to reduce the number of generated thunks. Needless to say, strict strings are the classic interpretation, although lazy strings have useful applications for streaming.
Based on these questions, here are where the string libraries of Hackage fall:
- Binary:
- Packed:
- Lazy: Data.ByteString.Lazy
- Strict: Data.ByteString
- Packed:
- Text:
- ASCII or 8-bit:
- Packed and lazy: Data.ByteString.Lazy.Char8
- Packed and strict: Data.ByteString.Char8, Data.CompactString.ASCII or Data.CompactString with Latin1
- Unicode:
- UTF-32, unpacked and lazy:
[Char](not a library, but if it was, this is where it would fall) - UTF-16:
- Packed and lazy: Data.Text.Lazy
- Packed and strict: Data.Text or Data.CompactString.UTF16
- UTF-8:
- Unpacked and lazy: Codec.Binary.UTF8.Generic contains generic operations that can be used to process
[Word8]. - Packed and lazy: Data.ByteString.Lazy.UTF8
- Packed and strict: Data.CompactString.UTF8 or Data.ByteString.UTF8
- Unpacked and lazy: Codec.Binary.UTF8.Generic contains generic operations that can be used to process
- Compact (UTF-8-like), packed and strict: Data.CompactString with Compact
- UTF-32, unpacked and lazy:
- ASCII or 8-bit:
Beyond in-memory encoding, there is also a question of source and target encodings: hopefully something normal, but occasionally you get Shift_JIS text and you need to do something to it. You can convert it to Unicode with encoding (handles String or strict/lazy ByteString with possibility for extension with ByteSource and ByteSink) or iconv (handles strict/lazy ByteString).
Unicode joke.
Well done, mortal! But now thou must face the final Test...--More--
Wizard the Evoker St:10 Dx:14 Co:12 In:16 Wi:11 Ch:12 Chaotic
Dlvl:BMP $:0 HP:11(11) Pw:7(7) AC:9 Xp:1/0 T:1
Alt text. Yeah, I got to the Supplementary Special-purpose Plane, but then I got killed by TAG LATIN CAPITAL LETTER A. It looked like a normal A so I assumed it was just an Archon…
Do you see any sign of a convergence in Unicode support to Text (or CompactString which I have never heard of before) ?
In my experience, a lot of current Haskell projects use ByteString for Unicode when by your article they should be using something else.
Hmmm. I notice that the home page for CompactString:
http://twan.home.fmf.nl/compact-string/
states that:
The library is not finished yet.
and the last news update on that page is dated March 2007. Could it be that CompactString has been abandoned?
Why did you choose [Char] for ASCII/8-bit, unpacked, lazy text and [Word8] for UTF-8, unpacked, lazy text? Either you are concerned for efficiency in the size of the character or not (since a Char is 32 bits). Can’t have it both ways. ;)
Also, isn’t it a bit misleading to suggest using [Word8] and [Word16] for UTF-8 and UTF-16? It’s not as simple as using the list operations, since one would still have to write an entire library to deal with the variable-length character encoding.
Kevin, I’m not sure what you mean by convergence. Text is all about Unicode support. Whether or not this is enough support for an international application is another question (one that, in my opinion, can only be answered by trying to build something like MediaWiki in Haskell.)
Sean, I was planning to go into these (strange) choices in more detail on my next post. The short answer is that Char is about as pure Unicode you can get, since we’re devoting a whole 32-bits to representing the character. UTF-8, however, is an encoding and thus if we want to deal with this representation, we need to deal with a binary representation; it turns out the most convenient grouping is 8-bit words. I’m not sure if GHC inflates unpacked 8-bit words into 32-bits, so I should check that.
It is a bit misleading to recommend Word8 and Word16 for unpacked lazy encoded strings in UTF-8 and UTF-16. But it works and it has the properties written on the tin (one corollary to this is, if you care about performance, you never want unpacked strings!) As for not being able to use list operations, you’d be surprised. I wrote a Unicode aware library in PHP that used UTF-8 has its internal string representation (since PHP only gives you strict binary strings.) I didn’t need to deal with the variable length: the operation I cared about, strpos, is guaranteed to work on well-formed UTF-8. More on this later.
I think we will converge on Text for (Unicode) text processing. Bryan O’Sullivan and I will look into getting rid of some performance problems as soon as we find some time.
For binary data I’d go with ByteString.
What about using Text for UTF-8, etc.? utf8-string? Rope libraries?
Come on, be exhaustive! ;-)
Edward,
Thanks very much for writing this up. It certainly clears up a lot of things for me.
Edward,
For an article that purports to give advice, I am taken aback at the unsoundness of the criteria and the eccentricity of the guidance that you give.
For instance, your opener “You should default to using [Char] for text strings and [Word8] for binary strings” begins with a clanger, because (a) those types give simply atrocious performance and memory usage and (b) some of the standard list manipulation functions give incorrect results when used on [Char]. All of your subsequent advice to use [Char] or some kind of [Word] for text data can be disregarded on both bases, a point to which I shall return.
On a related note, your admonition to care about the internal representation is also in poor form, except for the very basic binary/text distinction. As an example, advising someone to (a) care about UTF-16 is marginal, but (b) then to use [Word16] is simply weird. If someone was to follow that [Word16] notion, they could easily end up accepting or generating garbage, and in fact would be very likely to do so. This is not to mention the substantial amount of time needed to cook up a library that could even do a sub-par job handling [Word16] as text.
I could go on, but I’m sure you get my point. It is not the best of form to try to strike an authoritative pose, and then produce something quite so bafflingly misguided.
I appreciate your enthusiasm for Haskell, but ardour is a dish best flavoured with good sense.
Hello Bryan,
I guess I did get carried away combining the different possibilities of binary/text, lazy/strict and packed/unpacked, and so thus thought “If for some very odd reason I wanted an unpacked lazy UTF-16 representation, what would I use?” and didn’t say, “You don’t actually want that.” Perhaps it is better not to mention the possibility at all.
You’re right that the tenor of the article is all wrong. The list is factually correct but interesting mostly for only academic purposes. For most people, better advice would be simply: “text for Unicode, bytestring for Binary.”
I will update the text accordingly.
Yours humbly, Edward
P.S. I’m curious to know what standard list manipulation functions mess up with [Char]. Since each Char is essentially a 32-bit integer, I imagine it can’t be anything involving codepoint divisions.
Edward, an example would be
as lowercasing a character sometimes produces two characters (same applied for uppercasing). Similar problems crop up when you want to compare two strings as they generally need to be converted to some canonical form before comparison (as the same string can be created from different sequences of code points).
Sean, I rephrased it; hopefully it’s better now.
Johann, that’s a good catch. Equality and canonicalization is definitely an important problem (http://www.mediawiki.org/wiki/Unicode_normalization_considerations) although as I understand it ’text’ will not perform those operations out of the tin either; you need to use ’text-icu’ explicitly.
Need more type-classes!
Having different types and different (qualified) function names for lazy/strict, or even for different UTF encodings of unicode is exactly the opposite of what I’d expect of Haskell.
In Haskell, instead of encoding-agnostic/laziness-agnostic code, because of the lack of type-classes, most code is horribly specific.
FWIW FYI (2018)
Data.CompactString is DEPRECIATED in favor of Data.Text
http://hackage.haskell.org/package/compact-string