CSV Files

Earlier, we learned how to dump the contents of a database to a file, and SQL dumps are still a good way to transfer data from one RDBMS to another.

However, lots of other systems create and use tabular data:

  • Excel spreadsheets (and other spreadsheets)
  • Public data repositories like data.gov and others
  • Business databases, such as WorkDay
  • Learning Management Systems like Sakai

Let's take WorkDay as an example. If I download the roster for this class, it is downloaded in a very common format that I can easily load into Excel, Numbers, or Google Sheets. Similarly, if I download the results of one of our Sakai quizzes, it's usually downloaded in this common format.

That format is called Comma Separated Values or CSV.

WorkDay Example

Here's an example of data for CS 304, downloaded from Workday, when Harry, Ron and Hermione all took the course. (Hermione was my tutor the next semester; Ron had to repeat.) This is really the format; I just changed the student names for privacy reasons.

Instructors,"Course Title","Term Code","Student Name",BannerID,Email,"Reg Status","Class Year",Majors,Confidential,"Reg Seq No",Subject,"Course No",Section,"Sort Name","Reg/Waitlist Time"
"Scott Anderson","CS 304 T1-01-Remote",202009,"Hermione Granger",B12341234,hgranger@wellesley.edu,,"Class of 2021","Computer Science",,,CS,304,T1-01-Remote,"Granger Hermione","07/30/2020 10:00:09"
"Scott Anderson","CS 304 T1-01-Remote",202009,"Harry Potter",B11223344,hpotter@wellesley.edu,,"Class of 2022","Computer Science",,,CS,304,T1-01-Remote,"Potter Harry","08/11/2020 11:54:49"
"Scott Anderson","CS 304 T1-01-Remote",202009,"Ron Weasley",B22334455,rweasley@wellesley.edu,,"Class of 2022","Computer Science",,,CS,304,T1-01-Remote,"Weasley Ron","08/11/2020 10:19:24"

You can get that CSV file from this link to CS_304-hogwarts.csv

The Format

The format is, as you might have guessed, that each line of the file is a different row of the table. Each column is separated from other columns (delimited is the jargon) by commas. Hence "comma separated values".

Notice the column headers in the first row. That's optional, but very common.

Line Endings

The line-ending is determined by an sequence of characters that is specific to a particular operating system:

  • Unix uses a line feed character (ASCII 10 or control-j(^J) ; often denoted \n)
  • Mac uses a carriage return character (ASCII 13 or control-m (^M) ; often denoted \r)
  • Windows uses \r\n

I mention the line ending only because sometimes problems can arise when we copy a CSV file from one OS to another. This is shown in the video below. At the end of this reading, there is information about diagnosing and solving the line-ending problem, but let's not digress on that now.

Delimiters

The delimiter is usually a comma, but can be a semi-colon, colon, or even a tab character (ASCII 9 or control-i).

If we use tabs, we call them TSV files, but sometimes they are still considered a kind of CSV, since that's a general term for delimiter-separated values.

Importing CSV into Spreadsheets

Here is a screenshot of importing the CSV file above into OpenOffice:

import CSV file into OpenOffice

Notice the variety of delimiters on offer. The software has guessed to use a comma as a delimiter (correctly this time, but it doesn't always guess correctly). It's also shown you how it determines the columns given the specified delimiters.

I imported the same CSV file into Numbers (the Apple spreadsheet program; their answer to Excel). Numbers assumes it has guessed correctly, but it offers you a chance to adjust your settings. Here is a screenshot of what is shown when I click on the button to "adjust settings":

import CSV file into Numbers

If you just double-click on a CSV file in Excel, it'll open it and guess at everything (like Numbers). Alternatively, you can create new spreadsheet and do File > Import and choose CSV from the format list. The first two screens you see are these:

import CSV file into Excel, step 1 import CSV file into Excel, step 2

Notice how in step 2, you are offered a choice of delimiters.

Finally, if you open a new spreadsheet in Google Sheets and do File > Import and choose this CSV file, you'll see a screen like this:

import CSV file into Google Sheets

In short, all the major spreadsheet programs understand CSV very well.

Exporting CSV from Spreadsheets

Exporting from OpenOffice is done by doing Save As and specifying "Text CSV" as the file type. If you click on "Edit filter settings" (optional), you can specify the delimiter and whether text fields are enclosed with quotation marks or some such.

Exporting from Numbers and from Excel don't seem to offer a way to change the delimiters. You can export to CSV or to TSV.

CSV and MySQL

We want to be able to do two things with CSV files and MySQL. We want to be able to import data from a CSV file into MySQL. We also want to be able to export our MySQL data to CSV, so that we can process it in Excel or some other software.

Despite their symmetry, these are done very differently. We'll look at each in turn.

Importing CSV to MySQL

Importing CSV data to MySQL is like a long INSERT statement, which means that the table has to already exist (you did a create table earlier). The table is not created; it's just filled. The statement is LOAD DATA

You can specify line endings, delimiters and characters enclosing fields (like quotation marks around strings).

Here is an example:

load data local infile 'foo.csv' 
into table bar
fields terminated by '\t' 
lines terminated by '\n';

The keyword local means that the file is read by the mysql client program and the data sent to the server. If you omit this, you can still use the load data statement, but it's much harder, so I suggest you always use it.

The infile keyword just precedes the name of the file to be loaded.

The columns of table bar should match the columns of the CSV file: the same number of columns and the same types.

If your CSV file has a "header" row (as our Hogwarts data does), you often don't want that to be imported as a row. You can skip one or more rows:

load data local infile 'foo.csv' 
into table bar
fields terminated by '\t' 
lines terminated by '\n'
ignore 1 lines;

It turns out that the default for fields and for lines are tab (\t) and newlines (\n) so if you want TSV input and you don't have a header row, you can just do:

load data local infile 'foo.csv' 
into table bar;

This is sufficient for our purposes. If you have more unusual needs, read the reference page above.

Exporting Data from MySQL

There are two ways. The easy way, in my opinion, is to write a batch file that prints the data you want:

use wmdb;

select nm, name from person limit 10;

Then, run that batch file from the shell:

[cs304@tempest csv]$ mysql < first-ten-names.sql 
nm  name
451148  Aamir Khan
1093951 Aaron Taylor-Johnson
3117836 Aaron Tveit
541902  Aasif Mandvi
45393   Abhishek Bachchan
1113550 Abigail Breslin
23551   Abrar Alvi
284 Adam Baldwin
111845  Adam Brooks
3485845 Adam Driver

Believe it or not, the output is in TSV format: there are TAB characters between columns, instead of the vertical bar or pipe: | symbol. So, we just need to capture that data in a file, like this:

mysql < first-ten-names.sql > first-ten-names.tsv

That's it! You can then scp that file to wherever your spreadsheet program is and import it.

SELECT INTO

The more complex way is to use the SELECT INTO statement. I've always found the TSV output technique simple and easy, so I've rarely had to use SELECT INTO.

TSV versus CSV

In my opinion, TSV is much better than CSV, both because it's easier to get output from MySQL, but also because it's much easier for programs like Excel and Numbers to parse into columns. With commas and semi-colons, you have to worry about whether any of the fields have a comma or semi-colon in them. If so, it'll mess things up.

I remember once making a mess of the diabetes data because the person I was working with and I were sharing data using CSV, but there were commas in a field, but only rarely, so we forgot about it, until 99 percent of our rows had 9 columns and the other 1%, hidden at random locations, had 12 or 13 columns. Ugh.

But tab characters almost never occur in real data. They don't appear in the titles of movies, unlike a comma. They don't appear in people's names, like an apostrophe. (O'Brian and such). The main disadvantage of tab characters is that they are invisible, so you want to avoid editing a TSV file by hand (VSCode, Atom, Vim, Emacs). It can be done, but it's tricky. Instead, import to a spreadsheet, edit the data, and export to TSV.

Video

I created a video showing a lot of this, including both import an TSV file to MySQL and exporting a TSV file from MySQL. It's in Sakai Media Resources and it's called csv-files.

Diagnosing and Solving the Line Endings Trouble

Excel on a Mac will use the existing line endings in a file that it opens, so if you copy a TSV file from Tempest (a Unix machine) and edit it in Excel, and save to the same file, the line endings stay the same (I'm pretty sure). But if you export to a new file, it will use Mac line endings, which are a carriage return (\r or ^M).

If you transfer the Mac-format file back to Tempest, it will retain those Mac line endings. SCP does not modify line endings, as explained in this Stack Overflow post about SCP and line endings.

How can you know if this has happened? The Linux word-count program wc counts the number of lines, words and characters in any text file. If you use wc on the file and it reports zero lines, that means the line endings probably need to be fixed.

$ wc car-sales.tsv 
  0  95 533 car-sales.tsv

You can fix the line endings with the mac2unix command mac2unix man page:

$ mac2unix car-sales.tsv
$ wc car-sales.tsv
  15  95 533 car-sales.tsv

There's also a dos2unix command to fix files transferred from Windows machines. (DOS was a predecessor of Windows, both by Microsoft.)

Actually, the mac2unix program is idempotent, meaning that it doesn't hurt to run it unnecessarily. If there aren't any carriage returns in the file, then it's not modified. So you might just run mac2unix as a matter of course.

Summary

  • CSV is a common and useful data interchange format
  • CSV can be imported into all the major spreadsheet programs
  • CSV can be exported from all the major spreadsheet programs
  • CSV can be imported and exported from MySQL
  • newline characters can be difficult
  • Embedded commas and even newlines can be a nightmare
  • Tabs are often easier