My ugly mug

Hi, I'm Thomas Cannon!

I work on Freckle with Amy Hoy and Thomas Fuchs, and I teach people how to add an extra layer of polish to their work. Previously I worked for BMW and Clemson University, and have a bunch of smaller apps I tinker with on the side. Follow me on Twitter!

Importing CSVs is a necessary pain: you can’t turn your back on customers who have data they want to import into your application, but debugging problems with CSV imports is a nightmare.

The biggest problem is that there are so many different flavors of CSVs that you can run into. Even the name “Comma-separated Values” is a misnomer: CSVs often have different separator characters depending on where the CSV originated from. You also have problems with incorrect MIME types, and even line endings can cause major headaches when trying to figure out why a CSV will not import.

At Freckle, we get a lot of support requests about imports, and we’ve gotten pretty good at diagnosing the most common CSV problems. I’ve taken the time to write up some quick tips on how to check for these problems and ways you can fix them.

Unexpected Delimiters

Did you know that almost 25% of the world’s population uses a “,” as a decimal separator? This means that as more international users start to use your application, they might run into weird bugs.

For example: consider that you allow users to import a CSV of their recent purchases. You’re used to seeing imports like:

Date, Purchase, Price
2013-10-12, Freeze Ray, 129.99

If your Import only recognizes “,” as a valid separator character, then users in Germany, Italy, Sweden, or various other countries around the globe won’t be able to import their CSVs:

Date; Purchase; Price
2013-10-12; Freeze Ray; 129,99

How to detect it

If you notice that your CSV parser is combining fields and returning “nonsense” fields, open the file in a text editor visually inspect it to determine the separator.

How to fix it

The best way to fix this problem is to update your Import to allow for custom separator characters to be used. If you want to be fancy, you can count the number of occurrences of common separator characters and auto-suggest the separator character with the highest count. From my experience, the most common CSV separators are:

The most important thing is to make sure your users can manually set the separator character. Otherwise, you prevent users who use custom separator characters (or even strings) from importing your data.

MIME typing problems

Generally, files are given an extension in order to indicate what type of data is contained within the file. MIME-types work in a similar fashion, but are “invisible” because they are used in the headers of HTTP traffic to indicate the format of the message. The field we focus on most is: Content-Type, which tells us the Internet Media Type of the file.

With a Content-Type, servers can make a “first impression” of the file, and can accept or reject it as necessary. Of course, a file’s Content-Type and extension can be spoofed, so you can’t just blindly trust files based on their Content-Type or extension. You also have to inspect the file to ensure it isn’t malicious. Usually bad files are unable to be handled by your CSV parser but you must make absolutely sure that uploaded files cannot be executed.

The official Content-Type for CSV files is: text/csv. Unfortunately, there are over 4 different types that I’ve encountered in the wild: * text/plain * text/comma-separated-values * text/csv * application/csv * application/excel

How to detect it

Detecting the MIME type for a particular file is pretty easy. You can just pass the --mime argument to the file command:

tcannon@localhost:~$ file --mime import.csv
import.csv: text/csv; charset=us-ascii

How to fix it

MIME typing problems are a real pain, because all you can really do is add support for edge cases as they come up. Be careful about what MIME types you add, otherwise you may compromise security by allowing unwanted files.

Windows Line Endings

Fun fact: Windows uses different file endings than OS X or Linux! Instead of just using newlines (\n) for file endings, Windows incorporates carriage returns (\r) as part of its line endings. Usually, this means the line endings for the file will be: \r\n, but I’ve also run into files with just \r for the line ending.

This makes it especially fun to debug CSV files that come from Windows machines. since the parser will just consider these files as one long line and most text editors compensate for this bug and hide these line endings from you.

How to detect it

The best way I’ve found for figuring out the line endings for a CSV is to use the Octal dump command with the -c argument to output C-style escaped characters:

Let’s say I have the following CSV file: ~~~ Name, nickname, profession Thomas Cannon, Cannon, developer ~~~

What an Octal dump of the file’s contents with C-style escaped characters looks like:

tcannon@localhost:~$ od -c import.csv 
0000000    N   a   m   e   ,   n   i   c   k   n   a   m   e   ,   p   r
0000020    o   f   e   s   s   i   o   n  \n   T   h   o   m   a   s    
0000040    C   a   n   n   o   n   ,   C   a   n   n   o   n   ,   d   e
0000060    v   e   l   o   p   e   r  \n                                
0000070

It’s a little messy to look at, but a close visual inspection will show you exactly what’s being used for line endings. What you’re usually looking for are \n and \r

How to fix it

The most effective way to fix this problem is to find any abnormal line endings and replace them with \n. Thankfully, you can use a REGEX find and replace method to do this, just make two calls: 1. replace \r\n with \n 1. replace any remaining \r with \n

You could probably optimize this to replace both cases at the same time, but I like to keep them separate so we explicitly know what cases we’re fixing.

Conclusion

Now you know how to handle just about any edge case your users might run into when using your CSV import (and more importantly, you know how to fix every one!)