Nushell for data exploration #1
Sep 30, 2023Background
I’ve been working with Nushell as a substitute for both a Unix shell and my usual tools for commandline data manipulation. All of my examples are at work though, so I’m writing up a few demonstrations.
Working with the US CPSC Data
I’ll collect a public dataset and work with it. In this case, the history of product recalls from the US Consumer Product Safety Commission. See https://catalog.data.gov/dataset/recalls-api. It’s public, large, has dirty data, and doesn’t require an API key to access.
I’ll use nushell’s built in http
handler to get the dataset:
mut recalls = (http get "http://www.saferproducts.gov/RestWebServices/Recall?format=json")
I’ve chosen to use “mut” rather than “let” for this variable because I’m going to modify it. I could just as easily overlay a new declaration with the same name.
Now to explore the data a bit…
# Quite a few records...
$recalls | length
8759
# What fields (columns) does it have?
$recalls | columns
╭────┬───────────────────────╮
│ 0 │ RecallID │
│ 1 │ RecallNumber │
│ 2 │ RecallDate │
│ 3 │ Description │
│ 4 │ URL │
│ 5 │ Title │
│ 6 │ ConsumerContact │
│ 7 │ LastPublishDate │
│ 8 │ Products │
│ 9 │ Inconjunctions │
│ 10 │ Images │
│ 11 │ Injuries │
│ 12 │ Manufacturers │
│ 13 │ Retailers │
│ 14 │ Importers │
│ 15 │ Distributors │
│ 16 │ SoldAtLabel │
│ 17 │ ManufacturerCountries │
│ 18 │ ProductUPCs │
│ 19 │ Hazards │
│ 20 │ Remedies │
│ 21 │ RemedyOptions │
╰────┴───────────────────────╯
# Examining a single record, what are the types for each column?
$recalls | first | describe
record<RecallID: int, RecallNumber: string, RecallDate: date, Description: string, URL: string, Title: string, ConsumerContact: string, LastPublishDate: string,
Products: table<Name: string, Description: string, Model: string, Type: string, CategoryID: string, NumberOfUnits: string>, Inconjunctions: list<any>, Images:
table<URL: string, Caption: string>, Injuries: table<Name: string>, Manufacturers: table<Name: string, CompanyID: string>, Retailers: table<Name: string,
CompanyID: string>, Importers: list<any>, Distributors: list<any>, SoldAtLabel: nothing, ManufacturerCountries: table<Country: string>, ProductUPCs: list<any>,
Hazards: table<Name: string, HazardType: string, HazardTypeID: string>, Remedies: table<Name: string>, RemedyOptions: table<Option: string>>
The data is all strings with some tables. I’ll examine the last one in the dataset. I don’t know yet if it’s the oldest because dates are stored as strings and I’m not familiar with the API’s output.
$recalls | last
╭───────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ RecallID │ 3722 │
│ RecallNumber │ 73003 │
│ RecallDate │ 1973-06-08T00:00:00 │
│ Description │ Tappan Built-In Ovens Warning NEWS from CPSC U.S. Consumer Product Safety Commission Office of Information and Public Affairs │
│ │ Washington, DC 20207 FOR IMMEDIATE RELEASE June 8, 1973 Release # 73-003 Consumer Product Safety Commission Says Some Tappan │
│ │ Built-In Ovens May Pose Hazard To Consumers WASHINGTON, D.C. (June 8) - The Consumer Product Safety Commission Friday said it has │
│ │ learned from the Tappan Company of a defect in certain Tappan Company built-in gas ovens manufactured for installation primarily in │
│ │ mobile homes and other newly constructed homes. The Commission said the defect --a possible loose fitting connection-- could result │
│ │ in a hazard to consumers from gas leakage. The ovens in question were manufactured between January 23, 1973, and May 31, 1973, by │
│ │ the F.H. Lawson Company, Springfield, Tennessee, for the Tappan Company. While the majority of the ovens carry the "Tappan" name, │
│ │ some were produced for Montgomery Ward and carry the brand-name "Signature." According to Tappan, the problem occurred in some │
│ │ 15,185 ovens, of which 11,974 have been shipped from the place of manufacture. The ovens carry model numbers AZ12-1239; AZ12-1249; │
│ │ AZ12-1441; AKTM-3850; and AKTM-3893. Some 8,172 of the ovens-- model number AZ12-1239-- are used exclusively in mobile homes. The │
│ │ Montgomery Ward ovens--"Signature"--carry the "AKTM" model number prefix. CPSC Chairman Richard 0. Simpson said the Commission is │
│ │ asking wholesalers, distributors, and retailers who may have the affected ovens in stock or installed in mobile homes or newly │
│ │ constructed homes immediately to contact the Tappan Company. In addition, Simpson said consumers who may have purchased a Tappan or │
│ │ "Signature" built-in oven with one of the model numbers in question should contact their local gas company or the nearest Tappan │
│ │ dealer regarding possible gas leakage between the shut-off valve and the pressure regulator. The Commission was told by Tappan that │
│ │ consumers can locate the model number on the oven by opening the top oven door and looking at the metal strip at the top of the oven │
│ │ which carries the model number on it. CPSC investigators have been sent to the Tappan headquarters, Mansfield, Ohio, and to the │
│ │ F.H. Lawson Company to determine whether the remedial action proposed by Tappan is appropriate and to determine the degree of │
│ │ potential hazard to consumers. │
│ URL │ https://www.cpsc.gov/Recalls/1973/Consumer-Product-Safety-Commission-Says-Some-Tappan-Built-In-Ovens-May-Pose-Hazard-To-Consumers │
│ Title │ Tappan Built-In Ovens Warning │
│ ConsumerContact │ │
│ LastPublishDate │ 2015-02-03T00:00:00 │
│ │ ╭───┬──────────────────────────────────────────────┬─────────────┬───────┬────────────────────────────────┬────────────┬─────╮ │
│ Products │ │ # │ Name │ Description │ Model │ Type │ CategoryID │ ... │ │
│ │ ├───┼──────────────────────────────────────────────┼─────────────┼───────┼────────────────────────────────┼────────────┼─────┤ │
│ │ │ 0 │ │ │ │ Ovens/Stoves/Ranges/Microwaves │ 2056 │ ... │ │
│ │ │ 1 │ Montgomery Ward Signature built-in gas ovens │ │ │ │ │ ... │ │
│ │ │ 2 │ Tappan built-in gas ovens │ │ │ │ │ ... │ │
│ │ ╰───┴──────────────────────────────────────────────┴─────────────┴───────┴────────────────────────────────┴────────────┴─────╯ │
│ Inconjunctions │ [list 0 items] │
│ Images │ [list 0 items] │
│ Injuries │ [list 0 items] │
│ │ ╭───┬─────────────────┬───────────╮ │
│ Manufacturers │ │ # │ Name │ CompanyID │ │
│ │ ├───┼─────────────────┼───────────┤ │
│ │ │ 0 │ F.H. Lawson │ │ │
│ │ │ 1 │ Tappan │ │ │
│ │ │ 2 │ Montgomery Ward │ │ │
│ │ ╰───┴─────────────────┴───────────╯ │
│ Retailers │ [list 0 items] │
│ Importers │ [list 0 items] │
│ Distributors │ [list 0 items] │
│ SoldAtLabel │ │
│ ManufacturerCountries │ [list 0 items] │
│ ProductUPCs │ [list 0 items] │
│ Hazards │ [list 0 items] │
│ Remedies │ [list 0 items] │
│ RemedyOptions │ [list 0 items] │
╰───────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
There are fields that contain dates, but they’re encoded as strings. I’ll convert those to a “datetime” which has convenient properties I’ll use later…
# In-place convert the RecallDate column with a `date`
$recalls = ( $recalls | upsert RecallDate {|r| ($r.RecallDate | into datetime)} )
Nice. If I skim all of them (not shown, it’s long) I get something that is still human readable. Looking at the last few, I can see that Nu is auto-converting the date to a more friendly format.
$recalls | last 3 | get RecallDate
╭───┬──────────────╮
│ 0 │ 50 years ago │
│ 1 │ 50 years ago │
│ 2 │ 50 years ago │
╰───┴──────────────╯
If I wanted I could change the date format by overriding the default:
~> $env.config.datetime_format = {normal: "%Y-%m-%d %H:%M:%S", table: "%Y-%m-%d"}
~> $recalls | last 3 | get RecallDate
╭───┬────────────╮
│ 0 │ 1973-08-21 │
│ 1 │ 1973-08-20 │
│ 2 │ 1973-06-08 │
╰───┴────────────╯
That’s good. Now I’ll do the same conversion on the “LastPublishDate”:
let maybe_ok = (
$recalls | upsert LastPublishDate {|r| ($r.LastPublishDate | into datetime)}
)
That seems to have worked but if I list them:
Error: nu::shell::only_supports_this_input_type
× Input type not supported.
╭─[entry #77:1:1]
1 │ mut recalls = (http get "http://www.saferproducts.gov/RestWebServices/Recall?format=json")
· ────────────────────────────────┬────────────────────────────────
· ╰── input type: nothing
╰────
╭─[entry #89:1:1]
1 │ let maybe_ok = ($recalls | upsert LastPublishDate {|r| ($r.LastPublishDate | into datetime)})
· ──────┬──────
· ╰── only string and integer input data is supported
╰────
Two things here, the error message and why the error message appeared when it did. The error indicates that the field was empty (neither string nor integer) and where that data came from. This is a really nice feature of Nu. I believe the reason the error was emitted on output is that Nu defers non-fatal evaluation errors in some cases (This seems inconsistent with Nu’s goal of having early and clear errors, so my understanding is likely wrong).
I’ll try that again, this time checking for null
and using the RecallDate if there is no LastPublishDate:
recalls = (
$recalls |
upsert LastPublishDate {|r|
let lpd = $r.LastPublishDate;
if ($lpd != null) { echo ($lpd | into datetime) } else { echo ($r.RecallDate) } } )
Now if I list all the records I don’t get an error (though I have a bit less confidence in the quality of the data).
The date type has, as I mentioned, nice properties for selecting records.
# How many recalls in the last year?
$recalls | where RecallDate >= ((date now) - 52wk) | length
315
# How many in 2022?
$recalls | where RecallDate >= ("2022-01-01" | into datetime) and RecallDate <= ("2022-12-31" | into datetime) | length
292
# In the last three years?
$recalls | where RecallDate >= ((date now) - (3 * 52wk)) | length
814
Notice that I used a date unit wk
and could perform math on it and on the date. Neat.
I’m not concerned with timezones and leap years and such here.
With a few manipulations, I can aggregate them by date. I’ll pull out
the year into a new record, group, then tally. Note I’m extracting a
string
Year from the record and losing the date
type. It’s okay
for this example.
$recalls |
each {|r| {RecallYear: ( $r.RecallDate | date to-record | get Year )}} |
group-by RecallYear |
items {|k,v| {Year: $k, Recalls: ($v | length)} }
╭────┬──────┬─────────╮
│ # │ Year │ Recalls │
├────┼──────┼─────────┤
│ 0 │ 2023 │ 237 │
│ 1 │ 2022 │ 292 │
│ 2 │ 2021 │ 219 │
│ 3 │ 2020 │ 257 │
│ 4 │ 2019 │ 241 │
│ 5 │ 2018 │ 258 │
│ 6 │ 2017 │ 280 │
│ 7 │ 2016 │ 330 │
│ 8 │ 2015 │ 305 │
│ 9 │ 2014 │ 296 │
│ 10 │ 2013 │ 290 │
│ 11 │ 2012 │ 310 │
│ 12 │ 2011 │ 313 │
│ 13 │ 2010 │ 362 │
│ 14 │ 2009 │ 383 │
│ 15 │ 2008 │ 392 │
│ 16 │ 2007 │ 452 │
│ 17 │ 2006 │ 320 │
│ 18 │ 2005 │ 323 │
│ 19 │ 2004 │ 278 │
│ 20 │ 2003 │ 216 │
│ 21 │ 2002 │ 255 │
│ 22 │ 2001 │ 246 │
│ 23 │ 2000 │ 193 │
│ 24 │ 1999 │ 156 │
│ 25 │ 1998 │ 133 │
│ 26 │ 1997 │ 151 │
│ 27 │ 1996 │ 138 │
│ 28 │ 1995 │ 128 │
│ 29 │ 1994 │ 109 │
│ 30 │ 1993 │ 75 │
│ 31 │ 1992 │ 82 │
│ 32 │ 1991 │ 84 │
│ 33 │ 1990 │ 78 │
│ 34 │ 1989 │ 80 │
│ 35 │ 1988 │ 75 │
│ 36 │ 1987 │ 49 │
│ 37 │ 1986 │ 46 │
│ 38 │ 1985 │ 30 │
│ 39 │ 1984 │ 44 │
│ 40 │ 1983 │ 23 │
│ 41 │ 1982 │ 23 │
│ 42 │ 1981 │ 12 │
│ 43 │ 1980 │ 22 │
│ 44 │ 1979 │ 33 │
│ 45 │ 1978 │ 43 │
│ 46 │ 1977 │ 36 │
│ 47 │ 1976 │ 27 │
│ 48 │ 1975 │ 13 │
│ 49 │ 1974 │ 14 │
│ 50 │ 1973 │ 8 │
├────┼──────┼─────────┤
│ # │ Year │ Recalls │
╰────┴──────┴─────────╯
Let’s search for some well known brand names:
$recalls | where Description =~ "John Deere" | length
51
$recalls | where Description =~ "Frigidaire" | length
24
$recalls | where Description =~ "General Electric" | length
24
$recalls | where Description =~ "Samsung" | length
9
The last one looks interesting, and short enough to show, let’s restructure it to extract just some of the fields:
$recalls |
where Description =~ "Samsung" |
each {|x| {Name: $x.Products.Name,
NumberOfUnits: $x.Products.NumberOfUnits}}
╭───┬───────────────────────────────────────────────────────────────────────────┬──────────────────────────────────────────────────────────────────────────────╮
│ # │ Name │ NumberOfUnits │
├───┼───────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────┤
│ 0 │ ╭───┬───────────────────────────────────╮ │ ╭───┬───────────────╮ │
│ │ │ 0 │ Samsung Top-Load Washing Machines │ │ │ 0 │ About 663,500 │ │
│ │ ╰───┴───────────────────────────────────╯ │ ╰───┴───────────────╯ │
│ 1 │ ╭───┬─────────────────────────────────────╮ │ ╭───┬────────────────────────╮ │
│ │ │ 0 │ Batteries in refurbished cellphones │ │ │ 0 │ About 10,200 batteries │ │
│ │ ╰───┴─────────────────────────────────────╯ │ ╰───┴────────────────────────╯ │
│ 2 │ ╭───┬───────────────────────────────────╮ │ ╭───┬───────────────────╮ │
│ │ │ 0 │ Samsung top-load washing machines │ │ │ 0 │ About 2.8 million │ │
│ │ ╰───┴───────────────────────────────────╯ │ ╰───┴───────────────────╯ │
│ 3 │ ╭───┬──────────────────────────────────╮ │ ╭───┬──────────────────────────────────────────────────────────────────────╮ │
│ │ │ 0 │ Samsung Galaxy Note7 smartphones │ │ │ 0 │ About 1.9 million (This includes the 1 million Galaxy Note7s │ │
│ │ ╰───┴──────────────────────────────────╯ │ │ │ recalled on September 15, 2016) │ │
│ │ │ ╰───┴──────────────────────────────────────────────────────────────────────╯ │
│ 4 │ ╭───┬───────────────────────────────────────────────────────────────────╮ │ ╭───┬─────────────────╮ │
│ │ │ 0 │ Samsung Galaxy Note7 smartphones sold prior to September 15, 2016 │ │ │ 0 │ About 1 million │ │
│ │ ╰───┴───────────────────────────────────────────────────────────────────╯ │ ╰───┴─────────────────╯ │
│ 5 │ ╭───┬────────────────────────────────────────╮ │ ╭───┬──────────────╮ │
│ │ │ 0 │ Samsung Over-the-Range Microwave Ovens │ │ │ 0 │ About 43,000 │ │
│ │ ╰───┴────────────────────────────────────────╯ │ ╰───┴──────────────╯ │
│ 6 │ ╭───┬─────────────────────────────────╮ │ ╭───┬───────────────╮ │
│ │ │ 0 │ Samsung "Jitterbug" Cell Phones │ │ │ 0 │ About 160,000 │ │
│ │ ╰───┴─────────────────────────────────╯ │ ╰───┴───────────────╯ │
│ 7 │ ╭───┬────────────────────────────────────────╮ │ ╭───┬──╮ │
│ │ │ 0 │ Maytag Front Loading Washing Machines │ │ │ 0 │ │ │
│ │ │ 1 │ Samsung Front Loading Washing Machines │ │ │ 1 │ │ │
│ │ ╰───┴────────────────────────────────────────╯ │ ╰───┴──╯ │
│ 8 │ ╭───┬───────────────────────────────────────╮ │ ╭───┬──╮ │
│ │ │ 0 │ Maytag Front-Loading Washing Machines │ │ │ 0 │ │ │
│ │ ╰───┴───────────────────────────────────────╯ │ ╰───┴──╯ │
╰───┴───────────────────────────────────────────────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────╯
That’s ugly, let’s clean that up…
$recalls |
where Description =~ "Samsung" |
each {|x| {Name: ($x.Products.Name | str join ", "),
NumberOfUnits: ($x.Products.NumberOfUnits | str join "")}}
╭───┬───────────────────────────────────────────────────────────────────────────────┬──────────────────────────────────────────────────────────────────────────╮
│ # │ Name │ NumberOfUnits │
├───┼───────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────┤
│ 0 │ Samsung Top-Load Washing Machines │ About 663,500 │
│ 1 │ Batteries in refurbished cellphones │ About 10,200 batteries │
│ 2 │ Samsung top-load washing machines │ About 2.8 million │
│ 3 │ Samsung Galaxy Note7 smartphones │ About 1.9 million (This includes the 1 million Galaxy Note7s recalled on │
│ │ │ September 15, 2016) │
│ 4 │ Samsung Galaxy Note7 smartphones sold prior to September 15, 2016 │ About 1 million │
│ 5 │ Samsung Over-the-Range Microwave Ovens │ About 43,000 │
│ 6 │ Samsung "Jitterbug" Cell Phones │ About 160,000 │
│ 7 │ Maytag Front Loading Washing Machines, Samsung Front Loading Washing Machines │ │
│ 8 │ Maytag Front-Loading Washing Machines │ │
╰───┴───────────────────────────────────────────────────────────────────────────────┴──────────────────────────────────────────────────────────────────────────╯
How about baby strollers? There are two fields of interest, Description
and Products.Type
. This search is somewhat time consuming…
let $strollers = (
$recalls |
where ($it.Description =~ "(?i)stroller" or
($it.Products.Type | any {|x| $x =~ "(?i)stroller"})))
$strollers | length
166
What about trampolines? I don’t care about assembling a record, so I’ll just extract the columns I want as a list:
let $trampolines = ($recalls | where ($it.Description =~ "(?i)trampoline"))
$trampolines | length
16
$trampolines | sort-by RecallDate | first | get RecallID RecallDate Description
╭───┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ 0 │ 2284 │
│ 1 │ 1999-12-07 │
│ 2 │ Trampoline Safety Net Enclosures - Inc., of Saratoga, Calif., is recalling about 2,300 trampoline safety net enclosures. The safety net enclosures are │
│ │ sold separately to be attached to trampolines that measure 10 to 14 feet in diameter. │
╰───┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
$trampolines | sort-by RecallDate | last | get RecallID RecallDate Description
╭───┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ 0 │ 8630 │
│ 1 │ 2019-08-01 │
│ 2 │ This recall involves Super Jumper 14-foot trampolines, and 14-foot and 16-foot combo trampolines with enclosures sold without reinforcement clamps. The │
│ │ Super Jumper logo is printed on the center of the trampolines' mat. The trampolines have a black mat, gold springs and a blue pad covering the springs. │
│ │ The 14-foot models have six metal legs and the 16-foot combo has eight metal legs. │
╰───┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
I wonder if the infamous lawn dart is there… We’ll trim the description and tidy up the list of manufacturers and distributors.
let $jarts = $recalls |
where Description =~ "lawn dart" |
each {|r| {RecallDate: $r.RecallDate,
Description: ($r.Description | str substring 0..50),
Manufacturers: ($r.Manufacturers | get Name | collect {|x| ($x | str join ",") }),
Distributors: ($r.Distributors | get Name | collect {|x| ($x | str join ",")})}}
$jarts
╭────┬─────────────┬─────────────────────────────────────────────────────┬────────────────────────────────────┬────────────────────────────────────────────────╮
│ # │ RecallDate │ Description │ Manufacturers │ Distributors │
├────┼─────────────┼─────────────────────────────────────────────────────┼────────────────────────────────────┼────────────────────────────────────────────────┤
│ 0 │ 2020-08-12 │ The recalled lawn dart sets were sold for 2, 4 and │ Crown Darts UK, of United Kingdom │ │
│ 1 │ 2008-10-02 │ The recalled lawn dart game includes four projecti │ │ John Jaques & Son Ltd., of Edenbridge, England │
│ 2 │ 1989-04-27 │ Franklin Sports Industries, Inc, Recalls All Of It │ Franklin Sports │ │
╰────┴─────────────┴─────────────────────────────────────────────────────┴────────────────────────────────────┴────────────────────────────────────────────────╯
I thought only fellow Gen-Xers would have seen these toys but someone brought them back. And again a few years ago. What could possibly go wrong?