Assignment 2

# Due: Friday, February 23 by 5:00pm

You can turn in your assignment up until 5:00pm on 2/23/18. You should hand in both a hardcopy and electronic copy of your solutions. Your hardcopy submission should include printouts of four code files: ```redsox.m, colonTest.m, smartPhones.m``` and `analyzeData.m`. To save paper, you can cut and paste all of your code files into one script, but your electronic submission should contain the separate files. Your electronic submission is described in the section How to turn in this assignment.

You will begin working on the programming exercises for this assignment in lab, working with a new partner. For this assignment, we ask that you again complete both the programming exercises and problems with the same partner. (Starting with Assignment 3, there will be more flexibility in choosing partners for your assignment work.)

The following material from the fifth or sixth edition of the text is especially useful to review for this assignment: pages 42-43, 111-113, 157-158, and 175-186. You should also review notes and examples from Lectures #4 and #5 and Lab #3.

Use Cyberduck to download the `assign2_exercises` folder. This folder contains two code files for the exercises in this assignment, `redsox.m` and `colonTest.m`. In MATLAB, set the Current Directory to the `assign2_exercises` folder on your Desktop.

When starting the problems on this assignment, download the `assign2_problems` folder onto your Desktop. This folder contains two code files, `smartPhones.m` and `analyzeData.m`, and a data file named `depthData.mat`.

## Exercise 1: The Red Sox roster

In this exercise, you'll work with the following subset of data from the Boston Red Sox baseball team roster from the 2007 (World Series winning!) season:

Player Name Player Number Weight At Bats Home Runs Batting Average 2007 Salary Runs Batted In Runs Stolen Bases
Jason Varitek 33 230 435 17 .255 11,000,000 68 57 1
David Ortiz 34 230 549 35 .332 13,250,000 117 116 3
Manny Ramirez 24 200 483 20 .296 17,016,381 88 84 0
J.D. Drew 7 200 466 11 .270 14,400,000 64 84 4
Mike Lowell 25 210 589 21 .324 9,000,000 120 79 3
Julio Lugo 23 175 570 8 .237 8,250,000 73 71 33
Kevin Youkilis 20 220 528 16 .288 424,500 83 85 4
Coco Crisp 10 180 526 6 .268 3,833,333 60 85 28
Dustin Pedroia 15 180 520 8 .317 380,000 50 86 7

#### Generating some Red Sox statistics:

The file `redsox.m` in the `assign2_exercises` folder creates nine separate vectors, one for each numerical statistic. Look closely at the code in `redsox.m` and note that the names are stored in a cell array called `names`. Think of a cell array as a special kind of vector that allows us to store strings. Note that `names` is created using curly braces { } rather than the square brackets [ ] that we use for numerical vectors. Although a cell array is created using curly braces, you do not need to use curly braces to access its contents. For example, here is a clip of MATLAB code accessing the contents of `names`:

```>> powerHitter = names(homeRuns > 20)
powerHitter =
'Ortiz'   'Lowell'```

Note that you can place a semi-colon at the end of the initial assignment statements in `redsox.m` to suppress the printout of the statistics.

In the exercise below, you may use MATLAB's   ` mean, sum, length,` and ` any`.

#### Write MATLAB code to do the following:

1. Create a variable totalStolen with the total number of stolen bases in 2007.
2. Create a variable avgWt with the average weight of a Red Sox player.
3. Create a variable bestBatters with the name(s) of player(s) whose batting average is greater than or equal to .300.
4. Create a variable expensiveHomer that is true if any player costs more than \$500,000 per homerun.
5. Create a variable bigHitter that is true if any players hit more than 10 homeruns with batting averages less than .290.
6. Create a vector highRBI with the name(s) of player(s) whose Runs Batted In is greater than or equal to Runs.
7. Create a variable bigBatter that contains the number(s) of the player(s) with more than 550 at bats or more than 20 stolen bases.
8. Create a variable weightInGold that contains the number of players who were paid more than \$25,000 per pound of body weight in the 2007 season.

Add comments to your code so that it is clear and easy to read. Also add comments at the top of each file with the names of you and your partner, and the date. Save your final version of `redsox.m` in your `assign2_exercises` folder to upload to the CS server.

## Exercise 2: Indexing with colon notation

In lecture you learned how to use colon notation to specify a sequence of regularly spaced numbers. You also learned how to use indexing to read and store values in specific locations of a vector. This exercise combines these two concepts. Colon notation can be used to specify an evenly spaced sequence of vector indices or contents, as shown in the following examples:

```>> nums = 1:8
nums =
1   2   3   4   5   6   7    8
>> nums(1:2:5) = 10
nums =
10   2   10    4   10   6   7   8
>> nums(2:3:8) = [13 9 16]
nums =
10   13   10   4   9    6   7   16
>> nums2 = nums([1:3 6:8])
nums2 =
10   13   10   6   7   16
>> nums([1:3 6:8]) = 12:-2:2
nums =
12   10   8   4   9   6   4   2```

The following program, `colonTest.m` is contained in your `assign2_exercises` folder. Follow the instructions in the comments to rewrite the existing code statements and add five additional statements that use colon notation:

```% colonTest.m
% program that provides practice with colon notation and indexing

% rewrite each of the next 4 statements using colon notation
nums1 = [10 9 8 7 6 5 4 3 2 1]
nums2 = nums1([2 4 6 8 10 7 4 1])
nums1([3 4 5 6]) = [9 6 3 0]
nums3 = [1 2 3 1 2 3 1 2 3]

% replace the next 3 statements with a single assignment statement
% that uses colon notation
nums2(6) = 10
nums2(7) = 20
nums2(8) = 30

% for each of the following examples, use "end" in the colon
% notation, for example: nums8 = nums1(3:end)

% write a statement that assigns nums4 to a vector that contains
% the odd-indexed elements of nums1

% write a statement that assigns nums5 to a vector of the
% elements contained in the top half (higher indices) of nums2

% write a statement that assigns nums6 to a vector that contains
% every 3rd element of nums1, starting with index 2

% write a statement that places the value 0 in all of the
% evenly indexed locations of nums2

% write a statement that places the numbers 8 12 16 20 in the
% successive odd-indexed elements of nums2
```

If you write each code statement with no semi-colon at the end, so that the value generated is printed out during execution of the code, then your program should generate the following printout:

```>> colonTest
nums1 =
10  9  8  7  6  5  4  3  2  1
nums2 =
9  7  5  3  1  4  7  10
nums1 =
10  9  9  6  3  0  4  3  2  1
nums3 =
1  2  3  1  2  3  1  2  3
nums2 =
9  7  5  3  1  10  20  30
nums4 =
10  9  3  4  2
nums5 =
1  10  20  30
nums6 =
9  3  3
nums2 =
9  0  5  0  1  0  20  0
nums2 =
8  0  12  0  16  0  20  0```

Place comments at the top of your file with the names of you and your partner, and the date. Your final submission should include a copy of your final `colonTest.m` code file.

## Problem 1: Smartphones

With her curiosity piqued by recent smartphone market share data posted by the International Data Corporation, and article on the use of smartphones by healthcare professionals, Wendy Wellesley decided to collect some data on smartphone preferences and uses among Wellesley students. Wendy conducted a survey of smartphone owners with the following questions:

• Which brand of smartphone do you own? (1) Apple iPhone, (2) Google Android, (3) Samsung Galaxy, (4) Other
• If you were to buy a new smartphone in the next 6 months, which brand would you buy? (1) Apple iPhone, (2) Google Android, (3) Samsung Galaxy, (4) Other
• How many minutes per day do you spend doing the following activities on your smartphone: talking? texting? e-mailing? using the internet? listening to music? playing games?

The file `smartPhones.m` in the `assign2_problems` folder contains Wendy's survey data. The file creates two vectors, `currentPhones` and `newPhones` that contain the integers 1-4 indicating the smartphone brand owned and desired by each of the 150 students who completed the survey. The file also creates six vectors that each contain the number of minutes per day spent on each smartphone activity, for each survey participant.

Add code to the `smartPhones.m` code file to perform the following tasks:

• Display two bar graphs showing the percentage of each brand currently owned (iPhone, Android, Samsung, other) and percentage of each brand that students would buy in the future
• Display a third bar graph showing the average number of minutes per day spent on each of the six smartphone activities
• Print three messages to the user indicating whether the percentage of iPhones, Androids and Samsung Galaxies would increase or decrease

To complete these tasks, consider the following background, tips and guidelines:

• The `bar` function can be used to create a bar graph, as shown in the example below that displays the party affiliations of Massachusetts voters. This example prints strings on the X axis using the `XTickLabel` property. The three strings are stored in a cell array, designated by the surrounding curly braces {}, similar to the names of the Redsox players in Exercise 1. The final statement sets the `Position` property for the current figure window, which allows the user to specify the location and dimensions of the figure window. The four numbers in the vector [100 100 500 250] specify, in order, the distance from the left side of the figure window to the left side of the computer screen, the distance from the bottom of the figure window to the bottom of the screen, the width of the window, and the height of the window.

```>> voterParty = [37.1 11.4 51.2];
>> bar(voterParty)
>> set(gca, 'XTickLabel', {'Democrats' 'Republicans' 'Independents'})
>> ylabel('% of registered voters')
>> title('Party Affiliation of Massachusetts Voters')
>> set(gcf, 'Position', [100 100 500 250])```

• Use `subplot`, described here, to display the three bar graphs in one figure window in a `3 x 1` configuration.
• Create two vectors to store (1) the percentage of each phone brand currently in use and (2) the percentage of each phone brand that students will buy in the future. The `zeros` function can be used to create a vector to store the percentages of iPhones, Androids, Galaxies and other smartphones, and the percentages of each brand can then be calculated and stored in each location of the vector. Your program should calculate these percentages and use them to determine whether the percentage of each phone type will increase or decrease.
• For each of the three main phone brands, print a message something like this:

`iPhones would decrease from 46% to 34%`

Remember that numbers need to be converted to strings when using `disp`:

```>> slugs = 32.15;
>> disp(['there are ' num2str(slugs) ' slugs in a pound']);
there are 32.15 slugs in a pound```

Your final submission should include a copy of your final `smartPhones.m` code file. Be sure to add comments at the top of the file with the name(s) of the authors(s), and any collaborators other than a partner.

## Problem 2: Cleaning up the data

Unreliable measurement instruments or unpredictable environments can sometimes yield data that is clearly erroneous. To obtain a reliable assessment of simple properties like the mean value of the data, it may be desirable to remove data samples that are clearly outside the expected range. Such samples are sometimes referred to as outliers. An advantage to analyzing data in MATLAB, with its general programming language, is that we can easily write a program to preprocess the data in a customized way. In this problem, you will complete a program that removes outlying data samples, using the mean and standard deviation of the data.

Imagine that you collected sonar data on the depth of the ocean floor over a large region that is essentially flat. Due to instrument problems and the occasional large marine animal, some measurements are clearly invalid. For simplicity, assume that all of the erroneous measurements are underestimates of the true depth of the ocean floor. The file `analyzeData.m` in your `assign2_problems` folder uses the `load` command to load 1000 depth measurements from a file named `depthData.mat` into a vector named `depthData`, and creates a plot of this initial data:

Most of the data is at a depth of around 10,000 feet. The erroneous data samples appear as downward spikes in the data, at depths that are significantly less than 10,000 feet. One principled way to go about removing outlying data is to remove samples whose value is far from the mean value, using the standard deviation to determine the range of values to remove. The standard deviation captures how spread out the data values are, and is given by the following formula:

N is the number of samples in the data, vi is the ith data sample, and v is the mean value of the data. If the distribution of the data follows a bell-shaped curve (which is not really the case here), almost all of the data should lie within three standard deviations of the mean value.

For the ocean floor depth data, we could just remove all samples that are more than three standard deviations away from the mean depth value. A problem with this strategy is that an initial calculation of the mean and standard deviation of all of the data will be biased by the presence of the outlying data samples. Thus, we will instead use a more conservative approach that removes data in two stages, as described in the following steps, which also print, display and save the data:

1. calculate and print the mean and standard deviation of the original data
2. modify the data so that samples whose value is more than 4 standard deviations away from the mean value are removed
3. calculate and print the mean and standard deviation of the modified data, and plot the modified data
4. ask the user if she would like to remove more outliers from the data. If so,
• 4a. modify the data again so that samples whose value is more than 3 standard deviations away from the mean are removed, using the newly calculated mean and standard deviation from step 3
• 4b. calculate and print the mean and standard deviation of the newly modified data, and plot this new data
5. save the final modified data and its mean and standard deviation in a file named `newData.mat`

Expand the `analyzeData.m` code file to perform the above steps. When completing this program, keep in mind the following background, tips and guidelines:

• The built-in `mean` function can be used to calculate the mean value of the data, but you must compute the standard deviation using the above formula. It is OK to use the built-in `std` function to check your calculation - you should obtain similar results. Keep in mind that the number of samples in the data changes after each modification.
• Logical vectors can be used to represent the locations of values in a vector that satisfy a certain logical condition. Steps 2 and 4a describe which data values to remove - how can you rephrase these steps to describe which data values to preserve? This will facilitate your thinking about how to construct a logical vector that has a 1 in locations of data to preserve.
• Create two variables to store the mean and standard deviation of the original data stored in the `depthData` vector, and re-use these two variable names to store the new mean and standard deviation in steps 3 and 4b. When modifying the data in steps 2 and 4a, re-use the same variable name, `depthData`.
• Calculating the standard deviation and the modified data each involve multiple conceptual steps. Try to incorporate all of these steps into a single assignment statement for each of these two variables. Be careful about the placement of parentheses in these statements!
• In general the values of variables can be printed by omitting the semi-colon at the end of assignment statements or by using `disp`. However, using disp can produce more beautiful and informative output.
• When asking the user whether more outliers should be removed, assume that the user replies with a string, such as `yes` or `no`. Two strings can be compared with the `strcmp` function that returns true (logical value 1) if the two strings are equal:
```name = input('Enter your name: ', 's');
if strcmp(name, 'Sohie')
disp('I know you! You''re our Lab Instructor!')
else
disp('I don''t know you')
end```
• Click here for information about using `.mat` files to store and retrieve variables using `save` and `load`.
• Multiple plots can be displayed within a single figure window using the `subplot` function, described here. The `analyzeData.m` code file contains a call to `subplot` that specifies that the figure window should have one row of three plotting areas, and that the first plot should appear in the leftmost area. Use `subplot` to place the plot of the intermediate data in the center plotting area and the final data (if further analysis is requested by the user) in the rightmost area. You can expand the window in the horizontal direction by dragging the borders of the window horizontally. Note that the range of values plotted on the axes will differ for the three plots.
• Place comments at the beginning of the code file with your name(s) and date, and place a few comments throughout your program briefly describing what is accomplished by small groups of code statements.
• At the end of your program, add comments that record the values of the mean and standard deviation after steps 1, 3, and 4b, and answer the following question: How did the mean and standard deviation change between the initial, intermediate and final data, and why did these quantities change in this way?

Your final submission should include a copy of your final `analyzeData.m` code file. Again, be sure to add comments at the top of the file with the name(s) of the authors(s), and any collaborators other than a partner.

## How to turn in this assignment

Step 1. Complete this online form.
The form asks you to estimate your time spent on the problems. We use this information to help us design assignments for future versions of CS112. Completing the form is a requirement of submitting the assignment.

Step 2. Upload your final programs to the CS server. When you have completed all of the work for this assignment, your `assign2_exercises` folder should contain two code files named `redsox.m` and `colonTest.m`. Your `assign2_problems` folder should contain two code files named `smartPhones.m` and `analyzeData.m` and the data file named `depthData.mat`. Use Cyberduck to connect to your personal account on the server and navigate to your `cs112/drop/assign2` folder. Drag your `assign2_exercises` and `assign2_problems` folders to this drop folder. More details about this process can be found on the webpage on Managing Assignment Work.

Step 3. Hardcopy submission.
Your hardcopy submission should include printouts of four code files: `redsox.m, colonTest.m, smartPhones.m` and `analyzeData.m`. To save paper, you can cut and paste your four code files into one script, and you only need to submit one hardcopy for you and your partner. If you cannot submit your hardcopy in class on the due date, please slide it under Ellen's office door.