Comma-separated values (CSV) is a serialization format that has been used for exchanging and converting data between various spreadsheet programs for quite some time.
It can be described simply:
Other separators are also used in practice, e.g.,
This is what CSV data looks like.
id,R,G,B,color
1,105,30,255,blue
2,255,183,17,orange
3,255,53,221,violet
4,244,254,100,yellow
5,251,255,48,yellow
6,158,255,155,green
7,255,106,108,red
8,159,150,255,blue
9,71,228,109,green
10,255,45,178,violet
11,120,255,30,green
12,214,50,250,violet
13,255,176,40,orange
14,255,79,79,red
15,255,162,22,orange
16,255,247,114,yellow
17,61,180,254,blue
18,255,97,12,red
CSV has never been officially standardized:
Back in 2005, a request for comment (RFC) was published by the Internet Engineering Task Force.
RFC 4180 is not a standard:
This memo provides information for the Internet community. It does not specify an Internet standard of any kind. Distribution of this memo is unlimited.
Each record is located on a separate line, delimited by a line break (CRLF).
For example:
aaa,bbb,ccc CRLF
zzz,yyy,xxx CRLF
The last record in the file may or may not have an ending line break.
For example:
aaa,bbb,ccc CRLF
zzz,yyy,xxx
There may be an optional header line appearing as the first line of the file with the same format as normal record lines. This header will contain names corresponding to the fields in the file and should contain the same number of fields as the records in the rest of the file.
For example:
field_name,field_name,field_name CRLF
aaa,bbb,ccc CRLF
zzz,yyy,xxx CRLF
Within the header and each record, there may be one or more fields, separated by commas. Each line should contain the same number of fields throughout the file. Spaces are considered part of a field and should not be ignored. The last field in the record must not be followed by a comma.
For example:
aaa,bbb,ccc
Each field may or may not be enclosed in double quotes (however some programs, such as Microsoft Excel, do not use double quotes at all). If fields are not enclosed with double quotes, then double quotes may not appear inside the fields.
For example:
"aaa","bbb","ccc" CRLF
zzz,yyy,xxx
Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes.
For example:
"aaa","b CRLF
bb","ccc" CRLF
zzz,yyy,xxx
If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote.
For example:
"aaa","b""bb","ccc"
In popular usage, "CSV" is not a single, well-defined format. As a result, in practice, the term "CSV" might refer to any file that:
Within these general constraints, many variations are in use.
The tabular data below:
Year | Make | Model | Description | Price |
---|---|---|---|---|
1997 | Ford | E350 | ac, abs, moon | 3000.00 |
1999 | Chevy | Venture "Extended Edition" | 4900.00 | |
1999 | Chevy | Venture "Extended Edition, Very Large" | 5000.00 | |
1996 | Jeep | Grand Cherokee | MUST SELL!air, moon roof, loaded | 4799.00 |
Is represented in CSV as follows:
Year,Make,Model,Description,Price
1997,Ford,E350,"ac, abs, moon",3000.00
1999,Chevy,"Venture ""Extended Edition""","",4900.00
1999,Chevy,"Venture ""Extended Edition, Very Large""",,5000.00
1996,Jeep,Grand Cherokee,"MUST SELL!air, moon roof, loaded",4799.00
Note how we dealt with:
"
in the values (lines 3 and 4)USA/UK CSV:
Year,Make,Model,Length
1997,Ford,E350,2.34
2000,Mercury,Cougar,2.38
European CSV:
Year;Make;Model;Length
1997;Ford;E350;2,34
2000;Mercury;Cougar;2,38
Parsing is the process of transforming input data (usually text) into a data structure:
"1.0.0.0","1.0.0.255","16777216","16777471","AU","Australia"
"1.0.1.0","1.0.3.255","16777472","16778239","CN","China"
"1.0.4.0","1.0.7.255","16778240","16779263","AU","Australia"
"1.0.8.0","1.0.15.255","16779264","16781311","CN","China"
"1.0.16.0","1.0.31.255","16781312","16785407","JP","Japan"
"1.0.32.0","1.0.63.255","16785408","16793599","CN","China"
"1.0.64.0","1.0.127.255","16793600","16809983","JP","Japan"
"1.0.128.0","1.0.255.255","16809984","16842751","TH","Thailand"
String csvFile = "./src/test.csv";
String line = "";
String separator = ",";
try (BufferedReader br = new BufferedReader(new FileReader(csvFile))) {
while ((line = br.readLine()) != null) {
// use comma as separator
String[] token = line.split(separator);
System.out.println("Country [code= " + token[4] + " , name=" + token[5] + "]");
}
} catch (IOException e) {
e.printStackTrace();
}
Country [code= "AU" , name="Australia"]
Country [code= "CN" , name="China"]
Country [code= "AU" , name="Australia"]
Country [code= "CN" , name="China"]
Country [code= "JP" , name="Japan"]
Country [code= "CN" , name="China"]
Country [code= "JP" , name="Japan"]
Country [code= "TH" , name="Thailand"]
In a text editor create a CSV file (test.csv
) with the data below:
1,105,30,255,blue
2,255,183,17,orange
3,255,53,221,violet
4,244,254,100,yellow
5,251,255,48,yellow
Using the previous example, write a program that, for each line, prints the sum of the values of the first two columns.
Write a CSV parser that can handle:
Follow the conventions defined in the RFC 4180.
You don't have to write your own CSV parser, as there are several available: