
CSV Files
Earlier, we learned how to dump the contents of a database to a file, and SQL dumps are still a good way to transfer data from one RDBMS to another.
However, lots of other systems create and use tabular data:
- Excel spreadsheets (and other spreadsheets)
- Public data repositories like data.gov and others
- Business databases, such as WorkDay
- Learning Management Systems like Sakai
Let's take WorkDay as an example. If I download the roster for this class, it is downloaded in a very common format that I can easily load into Excel, Numbers, or Google Sheets. Similarly, if I download the results of one of our Sakai quizzes, it's usually downloaded in this common format.
That format is called Comma Separated Values or CSV.
WorkDay Example¶
Here's an example of data for CS 304, downloaded from Workday, when Harry, Ron and Hermione all took the course. (Hermione was my tutor the next semester; Ron had to repeat.) This is really the format; I just changed the student names for privacy reasons.
Instructors,"Course Title","Term Code","Student Name",BannerID,Email,"Reg Status","Class Year",Majors,Confidential,"Reg Seq No",Subject,"Course No",Section,"Sort Name","Reg/Waitlist Time"
"Scott Anderson","CS 304 T1-01-Remote",202009,"Hermione Granger",B12341234,hgranger@wellesley.edu,,"Class of 2021","Computer Science",,,CS,304,T1-01-Remote,"Granger Hermione","07/30/2020 10:00:09"
"Scott Anderson","CS 304 T1-01-Remote",202009,"Harry Potter",B11223344,hpotter@wellesley.edu,,"Class of 2022","Computer Science",,,CS,304,T1-01-Remote,"Potter Harry","08/11/2020 11:54:49"
"Scott Anderson","CS 304 T1-01-Remote",202009,"Ron Weasley",B22334455,rweasley@wellesley.edu,,"Class of 2022","Computer Science",,,CS,304,T1-01-Remote,"Weasley Ron","08/11/2020 10:19:24"
You can get that CSV file from this link to CS_304-hogwarts.csv
The Format¶
The format is, as you might have guessed, that each line of the file is a different row of the table. Each column is separated from other columns (delimited is the jargon) by commas. Hence "comma separated values".
Notice the column headers in the first row. That's optional, but very common.
Line Endings
The line-ending is determined by an sequence of characters that is specific to a particular operating system:
- Unix uses a line feed character (ASCII 10 or control-j(^J) ; often denoted \n)
- Mac uses a carriage return character (ASCII 13 or control-m (^M) ; often denoted \r)
- Windows uses \r\n
I mention the line ending only because sometimes problems can arise when we copy a CSV file from one OS to another. This is shown in the video below. At the end of this reading, there is information about diagnosing and solving the line-ending problem, but let's not digress on that now.
Delimiters
The delimiter is usually a comma, but can be a semi-colon, colon, or even a tab character (ASCII 9 or control-i).
If we use tabs, we call them TSV files, but sometimes they are still considered a kind of CSV, since that's a general term for delimiter-separated values.
Importing CSV into Spreadsheets¶
Here is a screenshot of importing the CSV file above into OpenOffice:
Notice the variety of delimiters on offer. The software has guessed to use a comma as a delimiter (correctly this time, but it doesn't always guess correctly). It's also shown you how it determines the columns given the specified delimiters.
I imported the same CSV file into Numbers (the Apple spreadsheet program; their answer to Excel). Numbers assumes it has guessed correctly, but it offers you a chance to adjust your settings. Here is a screenshot of what is shown when I click on the button to "adjust settings":
If you just double-click on a CSV file in Excel, it'll open it and
guess at everything (like Numbers). Alternatively, you can create new
spreadsheet and do File > Import
and choose CSV from the format
list. The first two screens you see are these:
Notice how in step 2, you are offered a choice of delimiters.
Finally, if you open a new spreadsheet in Google Sheets and do File >
Import
and choose this CSV file, you'll see a screen like this:
In short, all the major spreadsheet programs understand CSV very well.
Exporting CSV from Spreadsheets¶
Exporting from OpenOffice is done by doing Save As
and specifying
"Text CSV" as the file type. If you click on "Edit filter settings"
(optional), you can specify the delimiter and whether text fields are
enclosed with quotation marks or some such.
Exporting from Numbers and from Excel don't seem to offer a way to change the delimiters. You can export to CSV or to TSV.
CSV and MySQL¶
We want to be able to do two things with CSV files and MySQL. We want to be able to import data from a CSV file into MySQL. We also want to be able to export our MySQL data to CSV, so that we can process it in Excel or some other software.
Despite their symmetry, these are done very differently. We'll look at each in turn.
Importing CSV to MySQL¶
Importing CSV data to MySQL is like a long INSERT statement, which
means that the table has to already exist (you did a create table
earlier). The table is not created; it's just filled. The statement
is LOAD DATA
You can specify line endings, delimiters and characters enclosing fields (like quotation marks around strings).
Here is an example:
load data local infile 'foo.csv'
into table bar
fields terminated by '\t'
lines terminated by '\n';
The keyword local
means that the file is read by the mysql client
program and the data sent to the server. If you omit this, you can
still use the load data
statement, but it's much harder, so I
suggest you always use it.
The infile
keyword just precedes the name of the file to be loaded.
The columns of table bar
should match the columns of the CSV file:
the same number of columns and the same types.
If your CSV file has a "header" row (as our Hogwarts data does), you often don't want that to be imported as a row. You can skip one or more rows:
load data local infile 'foo.csv'
into table bar
fields terminated by '\t'
lines terminated by '\n'
ignore 1 lines;
It turns out that the default for fields
and for lines
are tab
(\t
) and newlines (\n
) so if you want TSV input and you don't have
a header row, you can just do:
load data local infile 'foo.csv'
into table bar;
This is sufficient for our purposes. If you have more unusual needs, read the reference page above.
Exporting Data from MySQL¶
There are two ways. The easy way, in my opinion, is to write a batch file that prints the data you want:
use wmdb;
select nm, name from person limit 10;
Then, run that batch file from the shell:
[cs304@tempest csv]$ mysql < first-ten-names.sql
nm name
451148 Aamir Khan
1093951 Aaron Taylor-Johnson
3117836 Aaron Tveit
541902 Aasif Mandvi
45393 Abhishek Bachchan
1113550 Abigail Breslin
23551 Abrar Alvi
284 Adam Baldwin
111845 Adam Brooks
3485845 Adam Driver
Believe it or not, the output is in TSV format: there are TAB
characters between columns, instead of the vertical bar or pipe: |
symbol. So, we just need to capture that data in a file, like this:
mysql < first-ten-names.sql > first-ten-names.tsv
That's it! You can then scp that file to wherever your spreadsheet program is and import it.
SELECT INTO
The more complex way is to use the SELECT INTO statement. I've always found the TSV output technique simple and easy, so I've rarely had to use SELECT INTO.
TSV versus CSV¶
In my opinion, TSV is much better than CSV, both because it's easier to get output from MySQL, but also because it's much easier for programs like Excel and Numbers to parse into columns. With commas and semi-colons, you have to worry about whether any of the fields have a comma or semi-colon in them. If so, it'll mess things up.
I remember once making a mess of the diabetes data because the person I was working with and I were sharing data using CSV, but there were commas in a field, but only rarely, so we forgot about it, until 99 percent of our rows had 9 columns and the other 1%, hidden at random locations, had 12 or 13 columns. Ugh.
But tab characters almost never occur in real data. They don't appear in the titles of movies, unlike a comma. They don't appear in people's names, like an apostrophe. (O'Brian and such). The main disadvantage of tab characters is that they are invisible, so you want to avoid editing a TSV file by hand (VSCode, Atom, Vim, Emacs). It can be done, but it's tricky. Instead, import to a spreadsheet, edit the data, and export to TSV.
Video¶
I created a video showing a lot of this, including both import an TSV
file to MySQL and exporting a TSV file from MySQL. It's in Sakai Media
Resources and it's called csv-files
.
Diagnosing and Solving the Line Endings Trouble¶
Excel on a Mac will use the existing line endings in a file that it opens, so if you copy a TSV file from Tempest (a Unix machine) and edit it in Excel, and save to the same file, the line endings stay the same (I'm pretty sure). But if you export to a new file, it will use Mac line endings, which are a carriage return (\r or ^M).
If you transfer the Mac-format file back to Tempest, it will retain those Mac line endings. SCP does not modify line endings, as explained in this Stack Overflow post about SCP and line endings.
How can you know if this has happened? The Linux word-count program
wc counts the
number of lines, words and characters in any text file. If you use
wc
on the file and it reports zero lines, that means the line
endings probably need to be fixed.
$ wc car-sales.tsv
0 95 533 car-sales.tsv
You can fix the line endings with the mac2unix
command mac2unix man page:
$ mac2unix car-sales.tsv
$ wc car-sales.tsv
15 95 533 car-sales.tsv
There's also a dos2unix
command to fix files transferred from
Windows machines. (DOS was a predecessor of Windows, both by
Microsoft.)
Actually, the mac2unix
program is idempotent, meaning that it
doesn't hurt to run it unnecessarily. If there aren't any carriage
returns in the file, then it's not modified. So you might just run
mac2unix
as a matter of course.
Summary¶
- CSV is a common and useful data interchange format
- CSV can be imported into all the major spreadsheet programs
- CSV can be exported from all the major spreadsheet programs
- CSV can be imported and exported from MySQL
- newline characters can be difficult
- Embedded commas and even newlines can be a nightmare
- Tabs are often easier