Re: Data munging
Edited for style, to update links, and to improve the code after publication. Thanks to Mark Janssen for suggesting improvements.
Not long ago I read a curious blog post titled “Data munging in Perl 6 vs Perl 5”. I liked how each individual Perl 6 code snippet for each data manipulation looked. I understood that the purpose of the exercise was to highlight this particular part of the language. And yet, in the end I couldn’t shake the thought this was not the right way to solve this type of problem. I have come to suspect that complex dictionary manipulation is mostly an antipattern that appears in scripts as they evolve into complex programs over time.
This antipattern developed in my own project, a static site generator that grew from a simple shell script. At one point it became too complex and slow. I realized something was wrong when I found myself using ordered dictionaries a lot and worrying about keeping them correctly sorted. Build times were getting long enough to distract the user once they had a couple hundred pages in their static website, because it was expensive to query and update data. Finally replacing ordered dictionaries with SQLite improved the performance and made it possible to throw away the complex and brittle dictionary-manipulating code.
Thanks to this experience and others, I have made a mental note I can now phrase as seriously consider using an embedded database from the start when you begin a new data munging project.
Tcl is my go-to scripting language these days and one great thing about it is the ease with which it interoperates with SQLite (not surprising given SQLite’s early history as a Tcl extension). Using any database is an upfront investment; however, this investment will pay if you later have to deal with
- generating multiple views of the data;
- a larger dataset.
The impedance mismatch between SQL (if your database uses SQL) and your programming language is also a cost to consider. Luckily, with Tcl and SQLite the impedance mismatch feels quite small; in other languages it can be reduced with things like LINQ.
To demonstrate the principle in practive I have reimplemented the task from the post in Tcl with SQLite. The result is longer than the corresponding Perl 6 code, in part due to the differences between the languages. However, a real-world application written this way is more likely to grow without becoming unmaintainable, and growing is something real-world applications are very prone to.
If you have suggestions on how to improve the following code, you can contact me.
#! /usr/bin/env tclsh
package require fileutil
package require sqlite3
sqlite3 db :memory:
db eval {
CREATE TABLE grades(
name TEXT PRIMARY KEY,
grade TEXT
)
}
db transaction {
foreach {name grade} [::fileutil::cat grades.txt] {
if {![regexp {[A-F][+-]?} $grade]} {
puts stderr "Can't parse pair '$name $grade'"
exit 1
}
db eval {
INSERT INTO grades
VALUES (:name, :grade)
}
}
}
puts -nonewline {Zsófia's grade: }
puts [db eval {
SELECT grade
FROM grades
WHERE name='Zsófia'
}]
puts -nonewline "List of students with a failing grade:\n "
puts [join [
db eval {
SELECT name
FROM grades
WHERE grade >= 'E'
ORDER BY grade
}
] {, }]
puts {Distribution of grades by letter:}
db eval {
SELECT
trimmed_grade AS grade,
count(name) as count
FROM (
SELECT
trim(grade, '+-') AS trimmed_grade,
name
FROM grades
)
GROUP BY trimmed_grade
ORDER BY trimmed_grade
} group {
set noun student[expr { $group(count) > 1 ? {s} : {} }]
puts " $group(grade): $group(count) $noun"
}
Peter B
Celine A-
Zsófia B+
João F
Maryam B+
秀英 B-
Finn D+
Aarav A
Emma F
Omar B
Zsófia's grade: B+
List of students with a failing grade:
João, Emma
Distribution of grades by letter:
A: 2 students
B: 5 students
D: 1 student
F: 2 students