When to clean the data?

You might remember the following snippet of code that we discussed in Organizing the Data section on Tuesday.

var cleanedCourses = {}// global variable

function processCourses(allCourses){   
  // 1. create an object that will store each course by its CRN
  for (var i in allCourses){
    var course = allCourses[i];
    var crsObj = {'CRN': course["gsx$crn"]["$t"],
                  'Name': course["gsx$course"]["$t"],
                  'Title': course["gsx$title"]["$t"],
                  // more fields here
                  'Days': course["gsx$days"]["$t"].split("\n")[1], // clean up days
                  //continue with fields
                  };
    //2. Add it to the cleanedCourses object, using the CRN value as the property name.
    cleanedCourses[crsObj.CRN] = crsObj;
  }
  // other things
}

The problem with this code is that it tries to clean the data, before we have a sense of what data there are in the feed. That is, I only used partial information about how the field looks like, e.g., "2\nT" or "1\nMT", and missed other (more rare) values such as "2\nTF\nW" or "1\nMTH\nT". Therefore, I changed my code to remove the part in red (see above), and rerun the analysis of the values for this field:

//3. Calculate all different values for the field Days
var days = {};
for (var crn in cleanedCourses){
  var d = cleanedCourses[crn]["Days"]; // get value
  if (days.hasOwnProperty(d)) {days[d] += 1;}
  else {days[d] = 1;}   
}
console.log("Counts of day strings in the data: ", days);       

While in the previous notes we got 17 different values, this time there is larger variability, with 31 different values, shown in the screenshot below.

distribution of values for 'days'

There are several things we will need to clean up in this scenario:

  1. Remove (or delete) all the leading and trailing digits
  2. Remove the newline characters
  3. Remove multiple appearances of the same day letter.

While we can think of many string manipulation methods to do this task, a tool used by advanced programmers (and hackers) is a regex (regular expression) engine. Luckily, all programming languages, and Javascript too, have such an engine, that given a string and a pattern are able to find the pattern and perform operations to it such as replace or match.

Very short intro to Regular Expressions

The best way to understand regular expressions is to look at a lot of examples and then consider the theory. Our CS235 Languages and Automata course is a good place where to look for the mathematical theory underlying regexes. In this short section, we will only look at some examples that can be useful to the purposes of cleaning our data.

Suppose we have some text and want to find if there are any numbers in it. We can certainly use a for loop and check every character if it is a digit or not, or we can use regular expressions. The pattern we can use to check for digits looks like this: \d. It's known as a metacharacter. To try it out in Javascript, we will put a pattern between two slash symbols, for example /\d/. Then, we will need a string that contains some digits and use it with the methods match() or replace(), depending on whether we want to find or replace digits.

The screenshot below shows a string, and the use of the regular expression with the methods match() and replace().

screenshot for regex example

In the first use of match, only the first occurrence of a digit is found. However, by adding the modifier g in the second try, meaning perform a global match to find all mathces, we get back the list of all digits. Since most of the time we want to find things in order to remove or replace them, the last example shows how to replace all digits with a * character.

Notice how every digit was matched individually. If we want to consider 16 as a whole number, instead of two individual digits, we can use a quantifier character to specify how many occurrences of the patterns we want. The quantifier + signifies one or more occurrences. See example below:

screenshot for regex example

There are many metacharacters and quantifiers and mastering regular expressions requires time and constant practice (or need for its use). A summary of some of the vocabulary necessary for writing regexes can be found at the W3Schools page. I have added an O'Reilly book to our library of class books in the Google Drive, it's a book written for beginners, thus, if you are interested, you should check it out.

To clean up the different strings that we found for the "Days" field, we can create a few patterns and chain them together, as below:

.replace(/\d/g, "").replace(/\n/g, "").replace(/Th/g, "R").replace(/(.)(?=.*\1)/g, "")

You should try this in the console with some real data from the Days field. The very last pattern is the most difficult to understand and create (thanks Jamie & Priscilla for finding it on SO). It is about removing repetitions of a character and it uses concepts such as positive lookaheads (?=), and backreferences \1.

After cleaning all days, we should again organize them in a data structure, here is how the result of organization looks like:

screenshot structure of days

More ond Data Variability

We can write a generic function that it will display us the distribution of values for different fields (as we did for days). Below is a function that does that (notice that we saw this code before for "Days", it has been converted into a parameterized function).

/* Helper function to find out the different value distributions for a field.*/
function explore_values(field){
  var values = {};
  for (var crn in cleanedCourses){
    var v = cleanedCourses[crn][field];
    if (values.hasOwnProperty(v)) {values[v] += 1;}
    else {values[v] = 1;}   
  }
  console.log("Counts of ", field, "values in the data: ", values);
} 

explore_values("Distributions");
explore_values("Meeting Times");

The results for the field "Distributions" are shown in the screenshot below:

screenshot dist requirements

The results for the field "Meeting Times" are shown in multiple screenshots below:

screenshot meeting times screenshot meeting times screenshot meeting times

jQuery UI

jQuery UI is a library that provides user interface elements that work with the jQuery library. It has common elements such as tabs, accordion, dialog boxes, draggable and dropable elements and many effects.

Here are just a few examples I had created for the CS 110 lectures. The lecture/lab notes that accompany these examples are here and here.

One example that is not shown in that material, but can be useful for your project, is the autocomplete search field. Try out the example below. Notice how you get results even for small phrases such as "ch", or "ma"

To make this example work, we need the following code:

  1. The jQuery library (the JS file)
  2. The jQuery UI library (the JS file)
  3. One of the jQuery UI CSS files (jQuery UI has several themes, choose one)
  4. A list of strings that you want to search over
  5. An HTML input element that will serve for searching. This should be part of a container that can be styled with a jQuery UI class.
  6. Two lines of jQuery that bind the input element to the method .autocomplete

Below is the HTML and JS code. It assumes that you have linked to the JS and CSS files somewhere else.

 <div class="ui-widget">
  <label for="subject">Start typing a subject: </label>
  <input id="tags" size=20>
</div>
var subjects = [
   "Africana Studies", 
   "American Studies", 
   "Anthropology", 
   "Arabic",   
   //... 
   ];

$("#subject").autocomplete({
    source: subjects
}); 

The only thing that you do in the JS code is select the input field through the $() jQuery wrapper method and invoke the method .autocomplete() with an argument, that in this case is an object that contains one property, the source of data from whhich to search the strings.

Can you make a guess of how the .autocomplete() method works?

More Calendar methods

AM4 requires you to create a calendar for your app to put all the course schedule. Last time we showed how we can do that, using the newCalendar() function.

If you create a calendar in one session, that calendar is going to stay in the user's Google Calendar and you want to avoid creating a new one the next time the user comes to use the app. That means that your code should check first whether your app calendar is already in the list of the user's calendar. The following code snippet shows the API call to get the list of all calendars. You can then iterate over the elements of this list to find out whether your calendar is there. For example, the calendar I created is in position 3.

screenshot calendar list

If you want to delete the calendar you have created (which is much easier than having to delete one by one all courses added in the calendar), you should store the calendarId value in a global variable, and then somewhere in your code (or you can have a button in the page to let the user do this), you can send a delete request to the API:

  var deleteRequest = gapi.client.calendar.calendarList.delete({'calendarId': "YOUR CALENDAR ID"});
  deleteRequest.execute(function(r){console.log("calendar was deleted", r)});

To summarize, you strategy for dealing with the calendar and the entries should be:

  1. Get the list of calendars and check if you have previously created a calendar for your app. If yes, store the calendarId in a global variable.
  2. If you didn't find your calendar, then send an API request to create it, and store the returned calendarId in a global variable.
  3. Whenever you create a new course event, always use the calendarId of the calendar you created, not "primary".
  4. Have a button to ask the user if they want to delete the created calendar. This button will execute an API request to delete the calendar.

Date Calculations

To prepare "resources" (the events you will add to the calendar), you will need to create Date objects for different days of the week. If you store the days by their day indexes, you can easily create different dates using the methods of the Date() object. The screenshot below shows two examples. If you missed class when we discussed Date(), look at the W3 Schools tutorial on the Date object.

screenshot date operatons