Monday, September 22, 2008

Laziness redux

Okay, so last time around I made reference to a little program to generate subscriber profiles for an e-mail system I'm testing and how it was blowing up when trying to generate over 10k profiles. Well, like a big kid I actually managed to figure this one out for myself. I'll include a simplified version of the problem code to illustrate the point.

So the problem is that I wanted to randomly generate these user profiles, so I ended up making this data type an instance of Random which is entirely uninteresting & boilerplate code.

Now, essentially the logic for generating the list of profiles looked like, exaggerated for full effect,

import Control.Monad
import Random
filepath = "crud.csv"
generateProfiles :: Int -> IO [Int]
generateProfiles i = foldM aux [] [1..i]
where aux is i = do
sub <- randomIO
return (sub:is)
main = generateProfiles 10000000 >>= (writeFile filepath . unlines . map show)

Run this! Run this and weep for your feeble RAM as it is withers under the gaze of my space leak!

Alright, so this is pretty awful isn't it? It came from trying to think about the problem as "well, I want to generate 'i' many subscribers so I'm going to iterate over a list of i length, generating a subscriber each time". Bzzt, wrong. That's trying to (badly) emulate a C style for-loop in Haskell, not writing Haskell, and is a very bad idea.

Well, I took a look at this code with a fresh eye, was terrified, and then realized what would probably be a more idiomatic Haskell program: generate an infinite list of subscribers, and just write the first 'i' to a file. After all, if I'm truly being lazy then only as many as I need to write should be created. So the new approach looks something like

betterGenerate :: (RandomGen g) => g -> [Int]
betterGenerate = randoms

main = do
std <- newStdGen
let subs = betterGenerate std
(writeFile filepath . unlines . map show . take 10000000) subs

which takes a few seconds, but doesn't eat RAM all to Hell. The main point here, for me, was to stop trying to write C# in Haskell and just write Haskell. There's probably still ways I could improve the logic of my little utility, but I can now generate 1 million subscribers, or about 120 MB of data, in roughly 2 minutes which is the upper limit of what I need.


Anonymous said...

The head of generateProfiles' list is sub_i, so it has to build the entire list in memory to write out the first line. Of course (is ++ [sub]) doesn't do much better, and mapM (const randomIO) [1..i] is only a little better.

Creighton Hogg said...

Right, because all of those ways still force the data. Before I came up with the proper solution I tried mapM, but when that still failed I looked at the definition of the function and saw that it was sequence . map.
Well, looking at the definition of sequence it didn't look like it would play nicely with space constraints. As a test, I tried
take 10 $ sequence $ repeat $ return ()
at the command line, blowing the stack.

Jedaï said...

Yes, by default, everything you do in the IO monad is strict (because lazyness don't play well with side-effects), it's not the fault of sequence or mapM (which are lazy in lazy monad, you can try with the State monad or Maybe monad or...).

To introduce lazyness in a IO context you have to use unsafeInterleaveIO, which as its name imply should only be used with precaution since it can have nasty effects (look at lazy IO for some of the dangers) though in your case it would have been safe.

Much better in your case (as you've probably done) was to write an instance of Random for Profile and call randoms.