Time Info

In Fall 2019, students self-reported spending a median of 6.0 hours on this assignment.

Compared to that version of the assignment, the current version adds the mapreduce-related programming tasks.

Tasks

1. Lexical Scope and Closures (20 points)

Draw lexical scope contours and the environments produced by evaluating the following sequence of Racket bindings as we did in class. (Note this may happen after the assignment is posted.)
```
(define f
    (lambda (g)
        (let ([x 3])
            (g 2))))
(define x 4)
(define h
    (lambda (y)
        (+ x y)))
(define z (f h))
```
Choose one of these options to submit this part:
- git add and git commit a PDF file named exactly env.pdf containing your drawings in your repository; OR
- Upload photos of the lexical scope contours and the environment diagram to the CS 251 course on GradeScope. Log in with your Wellesley email address. (See Gradescope help and how to scan / upload to Gradescope with a smartphone camera.)

2. Functional Idioms in the Large (10 points)

Read MapReduce: Simplified Data Processing on Large Clusters, by Dean and Ghemawat. The MapReduce system is a good example of how an idiom developed with a particular language or style of language can be applied even far outside that language. Consider that (a) this paper was published in 2004, when the system had already been in use for some time, (b) the scale of workloads at major software services companies has probably increased exponentially since, and (c) the MapReduce system itself is no longer used, but many of its lessons live on in successor systems.

For the following questions, write a paragraph or so in the file mapreduce.txt.

The map and reduce functions used in MapReduce have names that are similar to familiar higher-order Racket functions. (fold is also sometimes called reduce.) Discuss similarities and differences between MapReduce’s programming model and the programming idioms we have used with familiar higher-order functions in Racket. What familiar Racket functions are similar to the primitives of the MapReduce framework? Is there an exact match? It may help to consider the mapreduce function you are asked to implement in the programming section below.
In class (possibly after the release of this assignment), we consider pros and cons of immutability vs. mutability in programming languages. How do internet-scale design constraints affect the (im)mutability discussion? Discuss how this choice is important for at least 2 design problems that MapReduce aims to solve.

3. Higher-Order Data-Crunching (50 points)

This part explores several functions that use or provide higher-order operations over lists and related structures. Many of the functions will have similarities. Write your answers to this part in datacrunch.rkt.

A corpus (for the purposes of this assignment) is a list of documents, where each document is a pair (cons cell) of a name and the contents (i.e., list of words of the document) of the document. Names and words may be any simple values, such as symbols, strings, or numbers.

A corpus may also be considered to be:
- an association list from document name to document contents; or
- a list of lists where the first element of each list is the document name and the remaining elements are the document words.
The following example corpus is written with a careful choice of quoted notation that helps illustrate the form as we will interpret it. As we have seen before, Racket may give a different display notation. Consider why.
```
(define corpus
  '((hw  . (hello world))
    (hw2 . (hi world))
    (wc  . (world champion of the hi fi world))))
```
Evaluate corpus in DrRacket. How is it displayed? Why does DrRacket display it differently than the above? Why are both displays accurate and equivalent? Recall that '(A . B) denotes a cons cell holding 'A in the car and 'B in the cdr. You do not need to submit an answer for this part.
Write a function count that takes a corpus of docs and a word and returns the total number of occurrences of word in the corpus. Use higher-order functions to your advantage. Minimize the number of intermediate data structures (lists / cons cells) that your solution produces. An ideal solution never creates a fresh cons cell.
```
> (count corpus 'hi)
2
> (count corpus 'world)
4
> (count corpus 'hello)
1
> (count corpus 'hooligans)
0
> (count null 'hooligans)
0
> (count corpus 'hw)
0
```
Write a function grep that takes a corpus docs and a word and returns an association list mapping document name to the number of occurrences of word in that document. ONLY documents in docs with at least 1 occurrence of word should appear as keys in the association list. Use higher-order functions to your advantage. Minimize the number of intermediate data structures (lists / cons cells) that your solution produces that are not part of the final result. Order does not matter within the result.
```
> (grep corpus 'hi)
'((wc . 1)
  (hw2 . 1))
> (grep corpus 'world)
'((wc . 2)
  (hw2 . 1)
  (hw . 1))
> (grep corpus 'hooligans)
'()
> (grep null 'world)
'()
```
For the purposes of this task, a pile is either:
- a non-null atom (a number, boolean, symbol, etc.); or
- a list (empty or non-empty) of piles.
A pile:
- never contains cons cells that are not lists;
- may be arbitrarily deep; and
- typically contains only one type of atom.
The following example binds pile1 through pile5 to piles:
```
(define pile1 (list 1))
(define pile2 2)
(define pile3 null)
(define pile4 (list 2 3 4))
(define pile5 (list
                (list 1 2 3 4)
                (list 2 (list 4 5) (list 9 2 1) 3)
                (list 2 8 7)))
```
A pile may be flattened into a single-dimensional list with elements in their original pile order by applying the built in Racket function flatten:
```
> (flatten pile5)
'(1 2 3 4 2 4 5 9 2 1 3 2 8 7)
```
Your task is to write a function foldl-pile that takes these parameters:
- a combiner function combine;
- an initial accumulator value init; and
- a pile pile;
and returns a result equivalent to the result of
```
(foldl combine init (flatten pile))
```
Implementation rules:
- You may not invoke flatten or any equivalent function as part of your foldl-pile implementation.
- You must minimize the number of intermediate data structures (lists / cons cells) that your solution produces and discards as by-products. The best implementations of foldl-pile do not create any cons cells that are not part of the final result value of foldl-pile.
Examples:
```
> (foldl-pile + 0 pile1)
1
> (foldl-pile + 0 pile2)
2
> (foldl-pile + 0 pile3)
0
> (foldl-pile + 0 pile4)
9
> (foldl-pile + 0 pile5)
53
> (foldl-pile cons null pile5)
'(7 8 2 3 1 2 9 5 4 2 4 3 2 1)
> (foldl-pile - 1 pile4)
2
> (foldl-pile - 1 pile5)
6
```
Hints:
- The sample solution adds 3 lines below the define line.
- Only one of those 3 lines is longer than the define line.

Write a matching function foldr-pile that acts just like foldl-pile and is subject to the same constraints, except that foldr-pile folds from the right instead from the left, just as foldr complements foldl.

> (foldr-pile + 0 pile1)
1
> (foldr-pile + 0 pile2)
2
> (foldr-pile + 0 pile3)
0
> (foldr-pile + 0 pile4)
9
> (foldr-pile + 0 pile5)
53
> (foldr-pile cons null pile5)
'(1 2 3 4 2 4 5 9 2 1 3 2 8 7)
> (foldr-pile - 1 pile4)
2 ; there was a typo previously
> (foldr-pile - 1 pile5)
-4

Write a function mapreduce that takes three arguments:
- a mapping function, mapper, that takes one element and returns one mapping result;
- a reducing function, reducer, that takes one mapping result and one accumulator and returns one accumulator;
- an initial accumulator, init; and
- a list of elements, elems
The mapreduce function applies the mapper function to each element of elems and uses the reducer function to combine these results of the mapper. One possible implementation is:
```
(define (mapreduce mapper reducer init elems)
    (foldl reducer init (map mapper elems)))
```
This implementation creates an intermediate list (the result of map) that is used only for folding and then discarded. Your task is to implement a version that creates zero temporary cons cells.

While the reducer step in this style of computation often has no ordering requirements (it is commutative), we will enforce left-to-right ordering of reductions for the purposes of this task to simplify testing. In other words, your implementation should produce results identical to those produced by the above map/foldl implementation. To demonstrate that ordering, some examples below use non-commutative (order matters) reducers, like cons and append.
```
> (mapreduce (lambda (x) (* x x)) cons null (list 1 2 3 4))
'(16 9 4 1)
> (mapreduce length + 0 corpus)
14
> (mapreduce cadr cons null corpus)
'(world hi hello)
> (mapreduce
    (lambda (assoc) (filter (lambda (x) (not (equal? x 'world))) (cdr assoc)))
    append
    null
    corpus)
'(champion of the hi fi hi hello)
```
Write a function grep-mapreduce that acts just like grep, but is implemented via mapreduce. The top-level expression in the body of grep-mapreduce must be a call to mapreduce. Any number of inner helper function definitions may precede this call to mapreduce, but neither grep-mapreduce nor any helper function you write for it may use explicit recursion. Helper functions may use mapreduc or other higher-order functions.
Write a function wordfreqs that takes a corpus and returns a word-frequency association list that counts how often each word appears across all documents in the corpus. Order within the result does not matter. Each word from the corpus must appear as the key of exactly one association in the result list. There is no ordering/sorting requirement. Only document contents should be counted; document titles should not be counted. The implementation must use a call to mapreduce as its top-level body expression. Any number of inner helper function definitions may precede this call to mapreduce, but neither grep-mapreduce nor any helper function you write for it may use explicit recursion. Helper functions may use mapreduce or other higher-order functions.
```
> (wordfreqs corpus)
; one legitimate result
'((world . 4)
  (hello . 1)
  (champion . 1)
  (hi . 2)
  (fi . 1)
  (of . 1)
  (the . 1))
; another legitimate result
'((hello . 1)
  (hi . 2)
  (fi . 1)
  (champion . 1)
  (of . 1)
  (world . 4)
  (the . 1))
```
There are many potential ways to approach this. Aim for a solution that is easy to express, even if it is not the most efficient.
[Optional extra, not graded] Can foldl-pile be implemented with mapreduce? How many distinct implementations can you come up with? Do they use both the mapping and reducing aspects?

Submission

Submit: The course staff will collect your work directly from your hosted repository. To submit your work:

Test your source code files one last time. Make sure that, at a minimum, submitted source code is free of syntax errors and any other static errors (such as static type errors or name/scope errors). In other words: the code does not need to complete the correct computation when invoked, but it must be a valid program. We will not grade files that do not pass this bar.
Make sure you have committed your latest changes.
```
$ git add ...
$ git commit ...
```
Run the command cs251 sign to sign your work and respond to any assignment survey questions.
```
$ cs251 sign
```
Push your signature and your latest local commits to the hosted repository.
```
$ git push
```

Confirm: All local changes have been submitted if the output of git status shows both:

Your branch is up to date with 'origin/master', meaning all local commits have been pushed
nothing to commit, meaning all local changes have been committed

Resubmit: If you realize you need to change something later, just repeat this process.

Join the Fold

Contents