Overview

Artifact ID: be219eb67bf3bfcce52d56cda5068d8214d2a257acc9ff917540664320e4a0bd
Page Name:Re: Data munging
Date: 2020-07-04 16:33:53
Original User: dbohdan
Mimetype:text/x-markdown
Parent: de587a54aca670f7c89f1c8d3a949b1a747ef3042c695e38135691af4fee27c1 (diff)
Content

Re: Data munging

Published .

A short while ago I read a curious blog post titled "Data munging in Perl 6 vs Perl 5". I liked how each individual Perl 6 code snippet for each data manipulation looked. I understood that the purpose of the exercise was to highlight this particular part of the language. And yet, in the end I couldn't shake off the thought that this was not the right way to solve the kind of problems of which the toy problem at hand was an example. I have come to suspect that complex dictionary manipulation is mostly an antipattern that appears in scripts as they evolve into complex programs over time.

This antipattern developed in my own project, a static site generator that grew from a simple shell script, at one point resulting in both excessive complexity and suboptimal performance. I realized that something was wrong when I found myself using ordered dictionaries extensively and worrying about keeping them correctly sorted. At the same time the compile delays were getting long enough to distract the user once you had a couple hundred pages in your static website because it was expensive to query and update data. Finally replacing ordered dictionaries with SQLite improved the performance and made it possible to throw away the complex and brittle dictionary-manipulating code.

As a result of the above experience, as well as others, I have made a mental note that I can now phrase as follows: seriously consider using an embedded database from the start when you begin a new data munging project.

Tcl is my go-to scripting language these days and one great thing about it is the ease with which it can interoperate with SQLite (not surprising given SQLite's early history as a Tcl extension). Using any database is an upfront investment; however, this investment will pay if you later have to deal with

  • generating multiple views of the data;
  • an increase in dataset size.

The impedance mismatch between SQL (if your database uses SQL) and your programming language of choice is also a cost to consider. Luckily, with Tcl and SQLite the impedance mismatch feels quite small; in other languages it can be reduced with things like LINQ.

As a practical demonstration of the above principle I have reimplemented the task from the linked post in Tcl with SQLite in the following GitHub Gist. The result is longer than the corresponding Perl 6 code, in part due to the differences between the languages, but a real-world application written this way is more likely to grow without becoming unmaintainable, and growing is something real-world applications are very prone to.

If you have suggestions on how to improve the following code you can issue a pull request or contact me using the link at the top of the page.

Code

GitHub gist.

grades.tcl

#!/usr/bin/env tclsh

package require sqlite3
package require fileutil

sqlite3 db :memory:
db eval {CREATE TABLE grades(name TEXT PRIMARY KEY, grade TEXT)}

foreach {name grade} [::fileutil::cat grades.txt] {
    if {![regexp {[A-F][+-]?} $grade]} {
        puts "Can't parse pair '$name $grade'"
        exit 1
    }
    db eval {INSERT INTO grades VALUES ($name, $grade)}
}

puts "Zsófia's grade: [db eval {SELECT grade FROM grades WHERE name='Zsófia'}]"

puts "List of students with a failing grade:"
puts "  [join [
    db eval {SELECT name FROM grades WHERE grade >= 'E' ORDER BY grade}
] {, }]"

set grades [db eval {
    SELECT DISTINCT trim(grade, '+-') FROM grades ORDER BY grade
}]
puts "Distribution of grades by letter:"
foreach grade $grades {
    puts -nonewline "  $grade: "
    set count [db eval {
        SELECT count(name) FROM grades WHERE trim(grade, '+-')=$grade
    }]
    puts -nonewline "$count student"
    if {$count > 1} { puts -nonewline s }
    puts {}
}

grades.txt

Peter	B
Celine	A-
Zsófia	B+
João	F
Maryam	B+
秀英	B-
Finn	D+
Aarav	A
Emma	F
Omar	B

output.txt

Zsófia's grade: B+
List of students with a failing grade:
  João, Emma
Distribution of grades by letter:
  A: 2 students
  B: 5 students
  D: 1 student
  F: 2 students

Tags: Perl, Tcl, SQLite, SQL, comparison, programming, old blog