A data exploration project using data from: https://www.craigoates.net
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

520 lines
17 KiB

#+title: Data Exploration of Artwork Section
#+date: 2023-03-26 Sun
#+author: Craig Oates
#+email: craig@craigoates.net
#+options: ':nil *:t -:t ::t <:t H:3 \n:nil ^:t arch:headline author:t
#+options: broken-links:nil c:nil creator:nil d:(not "LOGBOOK") date:t e:t
#+options: email:nil f:t inline:t num:t p:nil pri:nil prop:nil stat:t tags:t
#+options: tasks:t tex:t timestamp:t title:t toc:t todo:t |:t
#+language: en
#+select_tags: export
#+exclude_tags: noexport
#+creator: Emacs 29.0.60 (Org mode 9.6.1)
#+cite_export:
#+export_file_name: ./exported/artwork.html
* Summary & Set-up
#+begin_quote
Make sure you have gone through the [[file:./README.org][README]] and set-up the environment on your
machine.
#+end_quote
The code in this file explores the [[https://www.craigoates.net/art][Artworks]] section of the site.
* Clean Data
This is the SQL used to remove data I don't want in a public facing
repository. The database is not included. I'm keeping the SQLite code for future
reference and for the sake of completeness.
#+header: :list
#+header: :separator \
#+header: :results raw
#+header: :dir data
#+header: :db co-production-2023-03-21.db
#+begin_src sqlite
.headers on
.mode csv
.output artwork-2023-03-21.csv
select
id,
title,
slug,
published,
category,
width,
height,
depth,
pixel_width,
pixel_height,
play_length,
medium,
created_at,
updated_at
from
artwork;
#+end_src
#+RESULTS:
#+begin_src shell :results code
# Use -l to check file permissions.
ls -h data/artwork*.csv
#+end_src
#+RESULTS:
#+begin_src shell
data/artwork-2023-03-21.csv
#+end_src
To view the data in =data/artwork-2023-03-21.csv=, you need ~csvlook~ installed.
#+begin_src shell :results silent
sudo apt update
sudo apt install csvkit
#+end_src
If ~csvlook~ isn't installed, skip the following code block. It produces a sample
of the data this file will be using to explore the data for the Artworks section
of my site.
#+begin_src shell :results output
head -n 4 data/artwork-2023-03-21.csv | csvlook
#+end_src
#+RESULTS:
: | id | title | slug | published | category | width | height | depth | pixel_width | pixel_height | play_length | medium | created_at | updated_at |
: | -- | ----------------------------- | -------------------- | ------------------- | -------- | ----- | ------ | ----- | ----------- | ------------ | ----------- | ----------------- | --------------------------- | --------------------------- |
: | 1 | Drop and Run (Purple Squares) | drop-and-run | 2012-05-07 00:00:00 | Video | | | | | | 4 | Digital Animation | 2022-04-11 00:00:00.000000Z | 2022-05-09 14:43:28.379441Z |
: | 2 | Eje x, Exio y, Z-Achse | eje-x-exio-y-z-achse | 2016-11-11 00:00:00 | Prints | 15 | 21 | | | | | Digital Print | 2022-04-11 | |
: | 3 | Up This Way | up-this-way | 2016-01-24 00:00:00 | Prints | 21 | 30 | | | | | Digital Print | 2022-04-11 | |
* Set-Up SLIME
Run =m-x slime= before executing the following code.
#+begin_src lisp
(format nil "SLIME and Common Lisp is up and running!")
#+end_src
#+RESULTS:
: SLIME and Common Lisp is up and running!
#+begin_src lisp :results silent :session
(ql:quickload :plot/vega)
(ql:quickload :lisp-stat)
(ql:quickload :data-frame)
#+end_src
#+begin_src lisp :session
(defparameter *artworks* (lisp-stat:read-csv #P"data/artwork-2023-03-21.csv")
"The data read in from data/artwork-2023-03-21.csv.")
#+end_src
#+RESULTS:
: *ARTWORKS*
* Explore Data
#+begin_src lisp :session :results output
(format t "Number of artworks: ~A" (lisp-stat:nrow *artworks*))
#+end_src
#+RESULTS:
: Number of artworks: 375
#+begin_src lisp :session
(lisp-stat:defdf *artworks-df* *artworks*)
#+end_src
#+RESULTS:
: #<DATA-FRAME:DATA-FRAME (375 observations of 14 variables)>
** Data Heuristics
#+begin_src lisp :session :results output
(lisp-stat:heuristicate-types *artworks-df*)
(lisp-stat:describe *artworks-df*)
#+end_src
#+RESULTS:
#+begin_example
,*ARTWORKS-DF*
A data-frame with 375 observations of 14 variables
Variable | Type | Unit | Label
-------- | ---- | ---- | -----------
ID | INTEGER | NIL | NIL
TITLE | INTEGER | NIL | NIL
SLUG | INTEGER | NIL | NIL
PUBLISHED | STRING | NIL | NIL
CATEGORY | STRING | NIL | NIL
WIDTH | DOUBLE-FLOAT | NIL | NIL
HEIGHT | DOUBLE-FLOAT | NIL | NIL
DEPTH | INTEGER | NIL | NIL
PIXEL-WIDTH | INTEGER | NIL | NIL
PIXEL-HEIGHT | INTEGER | NIL | NIL
PLAY-LENGTH | INTEGER | NIL | NIL
MEDIUM | STRING | NIL | NIL
CREATED-AT | STRING | NIL | NIL
UPDATED-AT | SYMBOL | NIL | NIL
#+end_example
** Create Sample Data-Frame
#+begin_src lisp :session
(defparameter *artworks-sm-list*
(select:select *artworks-df* (select:range 0 10) t)
"A small sample of artwork for quickly testing code.")
#+end_src
#+RESULTS:
: #<DATA-FRAME:DATA-FRAME (10 observations of 14 variables)>
** Summary: Width
#+begin_src lisp :session :results drawer
(lisp-stat:summarize-column '*artworks-df*:width)
#+end_src
#+RESULTS:
:results:
WIDTH ()
n: 375
missing: 55
min=6.50
q25=25.02
q50=34.73
mean=34.62
q75=42.07
max=70
:end:
** Summary: Height
#+begin_src lisp :session :results drawer
(lisp-stat:summarize-column '*artworks-df*:height)
#+end_src
#+RESULTS:
:results:
HEIGHT ()
n: 375
missing: 55
min=10
q25=26.84
q50=29.93
mean=37.04
q75=42.47
max=70
:end:
** Summary: Depth
#+begin_src lisp :session :results drawer
(lisp-stat:summarize-column '*artworks-df*:depth)
#+end_src
#+RESULTS:
:results:
DEPTH ()
n: 375
missing: 374
min=7
q25=7
q50=7
mean=7
q75=7
max=7
:end:
** Summary: Pixel Width
#+begin_src lisp :session :results drawer
(lisp-stat:summarize-column '*artworks-df*:pixel-width)
#+end_src
#+RESULTS:
:results:
PIXEL-WIDTH ()
n: 375
missing: 341
min=2480
q25=2550.93
q50=2952.86
mean=2927.82
q75=3298.00
max=3508
:end:
#+begin_src lisp
(format nil "Total (2D) Digital Artworks: ~A" (- 375 341))
#+end_src
#+RESULTS:
: Total (2D) Digital Artworks: 34
** Summary: Pixel Height
#+begin_src lisp :session :results drawer
(lisp-stat:summarize-column '*artworks-df*:pixel-height)
#+end_src
#+RESULTS:
:results:
PIXEL-HEIGHT ()
n: 375
missing: 341
min=2480
q25=3326.59
q50=4467.47
mean=4012.12
q75=4700.63
max=4722
:end:
#+begin_src lisp :session :results drawer
(lisp-stat:summarize-column '*artworks-df*:medium)
#+end_src
#+RESULTS:
:results:
119 (32%) x "Digital Photograph", 73 (19%) x "Digital Print", 70 (19%) x
"Watercolour and ink", 45 (12%) x "Pen and ink", 23 (6%) x "Felt-tip marker on
paper", 21 (6%) x "Digital Animation", 11 (3%) x "Watercolour on paper", 5 (1%)
x "Screen-print on paper", 2 (1%) x "Screen-print and felt-tip marker on paper",
1 (0%) x "Lino. print on paper", 1 (0%) x "Graphite on paper", 1 (0%) x
"Drawing", 1 (0%) x "Glass light bulb and jar", 1 (0%) x "Felt tip marker on
paper", 1 (0%) x "Staples on paper",
:end:
*NOTE:* ~created-at~ refers to the time I added the artwork to the website's
database. See [[published_summary][Published Summary]] below for the artwork creation date.
#+begin_src lisp :session :results drawer
(lisp-stat:summarize-column '*artworks-df*:created-at)
#+end_src
#+RESULTS:
:results:
338 (90%) x "2022-04-11", 2 (1%) x "2022-04-11 00:00:00.000000Z", 1 (0%) x
"2022-06-25 20:47:51.282515", 1 (0%) x "2022-07-19 02:21:20.722981Z", 1 (0%) x
"2022-07-19 04:38:06.685195Z", 1 (0%) x "2022-07-19 04:42:14.648641", 1 (0%) x
"2022-07-19 04:45:03.533171", 1 (0%) x "2022-07-19 04:46:57.297030", 1 (0%) x
"2022-07-19 04:49:38.193585", 1 (0%) x "2022-07-19 04:58:04.069055", 1 (0%) x
"2022-07-19 04:59:38.732074Z", 1 (0%) x "2022-07-19 05:00:55.259252", 1 (0%) x
"2022-07-19 05:02:04.145161", 1 (0%) x "2022-07-19 05:03:20.898681", 1 (0%) x
"2022-07-19 05:04:35.132294", 1 (0%) x "2022-07-19 05:05:38.856980", 1 (0%) x
"2022-07-19 05:06:48.692528", 1 (0%) x "2022-08-15 18:16:52.321678", 1 (0%) x
"2022-08-15 19:11:08.879204", 1 (0%) x "2022-08-15 19:14:40.060236", 1 (0%) x
"2022-08-15 19:17:03.134433", 1 (0%) x "2022-08-15 19:20:02.404717", 1 (0%) x
"2022-08-15 19:22:00.766659", 1 (0%) x "2022-08-15 19:24:06.150506", 1 (0%) x
"2022-08-15 19:27:47.224984", 1 (0%) x "2022-08-15 19:49:19.064553", 1 (0%) x
"2022-08-15 19:57:22.403963", 1 (0%) x "2022-08-15 20:00:46.926246", 1 (0%) x
"2022-08-15 20:04:02.172163", 1 (0%) x "2022-08-15 20:06:48.419529", 1 (0%) x
"2022-08-15 20:10:49.282631", 1 (0%) x "2022-08-15 20:13:10.251745", 1 (0%) x
"2022-08-15 20:15:20.199923", 1 (0%) x "2022-08-15 20:18:57.298303", 1 (0%) x
"2022-08-15 20:54:31.246681", 1 (0%) x "2022-08-15 21:10:15.367998", 1 (0%) x
"2022-08-15 21:15:46.119031",
:end:
*NOTE:* ~published~ refers to when I finished the artwork.
It looks like my most prolific day was 2022-08-15, with 14 artworks -- which
total about 4% of my total /finished/ output. One day produced 4% -- of course it
was around the Covid pandemic.
#+name: published_summary
#+begin_src lisp :session :results output drawer
(pprint (lisp-stat:summarize-column '*artworks-df*:published))
#+end_src
*I have removed dates with only 1 entry (0%).*
#+RESULTS: published_summary
:results:
20 (5%) x "2022-08-15", 14 (4%) x "2020-03-13", 2 (1%) x "2016-01-23
00:00:00.000", 2 (1%) x "2012-05-26 00:00:00.000", 2 (1%) x "2017-08-19
1 (0%)
x "2016-11-11 00:00:00.000", 1 (0%) x "2016-01-24 00:00:00.000",
:end:
I keep forgetting about ~output~. Leaving this ~(lisp-stat:head...~ example here to
help me remember to use it.
#+begin_src lisp :session :results output code
(lisp-stat:head *artworks-df*)
#+end_src
#+RESULTS:
#+begin_src lisp
;; ID TITLE SLUG PUBLISHED CATEGORY WIDTH HEIGHT DEPTH PIXEL-WIDTH PIXEL-HEIGHT PLAY-LENGTH MEDIUM CREATED-AT UPDATED-AT
;; 0 1 Drop and Run (Purple Squares) drop-and-run 2012-05-07 Video NA NA NA NA NA 4 Digital Animation 2022-04-11 00:00:00.000000Z 2022-05-09 14:43:28.379441Z
;; 1 2 Eje x, Exio y, Z-Achse eje-x-exio-y-z-achse 2016-11-11 00:00:00.000 Prints 15.0 21.0 NA NA NA NA Digital Print 2022-04-11 NA
;; 2 3 Up This Way up-this-way 2016-01-24 00:00:00.000 Prints 21.0 30.0 NA NA NA NA Digital Print 2022-04-11 NA
;; 3 4 Now Then now-then 2016-01-23 00:00:00.000 Prints 21.0 30.0 NA NA NA NA Digital Print 2022-04-11 NA
;; 4 5 Here Now There here-now-there 2016-01-23 21:31:24.000 Prints 21.0 30.0 NA NA NA NA Digital Print 2022-04-11 NA
;; 5 6 Everything In-between everything-in-between 2015-07-07 00:00:00.000 Prints 21.0 30.0 NA NA NA NA Digital Print 2022-04-11 NA
NIL
#+end_src
#+begin_src lisp :session :results drawer
(length (lisp-stat:select *artworks-df* t '(width height)))
#+end_src
#+RESULTS:
:results:
#<DATA-FRAME:DATA-FRAME (375 observations of 2 variables)>
:end:
** Plot: Width Vs Height Scatter (Non-Digital)
#+begin_src lisp :session :results file
(vega:defplot width-height
`(:title "Art: Width vs Height (Non-Digital)"
:description "Comparison between the physical dimensions of artworks."
:width 400
:height 400
:mark :circle
:data ,*artworks-df*
:selection (:grid (:type :interval :bind :scales))
:encoding (:x (:field :width :title "Width (cm)" :type :quantitative)
:y (:field :height :title "Height (cm)" :type :quantitative)
:tooltip (:field :title :type :nominative)
:color (:field :title :legend :null))))
(vega:write-html width-height "output/art-width-height-2023-03-21.html")
#+end_src
#+RESULTS:
[[file:output/art-width-height-2023-03-21.html]]
*** Note: A Line of Lines has wrong dimensions
They should be =21 x 14.8 cm= and not =210 x 148 cm=. *I have updated the dimensions
on the live site.* I did not notice it until I saw the chart. Basically, the
decimal point is was shifted one place to the right.
[[file:output/art-width-height-2023-03-21.png]]
** Plot: Pixel-Width vs Pixel-Height (digital only)
#+begin_src lisp :session :results file
(defparameter *artworks-px-w-h-df*
(lisp-stat:df-remove-duplicates
(lisp-stat:drop-missing
(lisp-stat:select *artworks-df* t '(pixel-width pixel-height))))
"A data-frame containing all the `PIXEL-WIDTH' and `PIXEL-HEIGHT' values.
All the missing/null values have been removed from the list.")
(vega:defplot px-width-px-height
`(:title "Art: Pixel-Width vs Pixel-Height (2D Digital)"
:description
"Comparison between the pixel width and height dimensions of digital artworks."
:width 400
:height 400
:mark :circle
:data ,*artworks-px-w-h-df* ; ,*artworks-df*
:selection (:grid (:type :interval :bind :scales))
:encoding (:x (:field :pixel-width :title "Pixel-Width (px)" :type :quantitative)
:y (:field :pixel-height :title "Pixel Height (px)" :type :quantitative)
:tooltip (:field :title :type :nominative)
:color (:field :PIXEL-WIDTH :legend :null))))
(vega:write-html width-height "output/art-px-width-px-height-2023-03-21.html")
#+end_src
#+RESULTS:
[[file:output/art-px-width-px-height-2023-03-21.html]]
#+begin_src lisp :session :results output raw
(lisp-stat:df-print
(lisp-stat:df-remove-duplicates
(lisp-stat:drop-missing
(lisp-stat:select *artworks-df* t '(pixel-width pixel-height))
(lambda (x) (eql :na x)))))
#+end_src
#+RESULTS:
| PIXEL-WIDTH | PIXEL-HEIGHT |
|-------------+--------------|
| 3142.0d0 | 4722 |
| 2480.0d0 | 3508 |
| 3508.0d0 | 2480 |
| 3456.0d0 | 4608 |
| 3402.0d0 | 4536 |
There is a lot of duplicated sizes in these columns. The chart *had* loads of dots
resting on top of each other so you only see five at any one point. I've removed
all the rows with missing values and the duplicates to help show how thirty-four
(2D) digital images only show-up as five images in the chart.
** TODO Compare landscape to portrait
** Note: Corrected A Line of Lines in CSV file
I've made a note of the error in [[<2023-03-27 Mon> Note: A Line of Lines has wrong dimensions][A Line of Lines has wrong dimensions]]. I made
the correction directly in =data/artwork-2023-03-21.csv= because I am lazy. I
didn't want to download the recently updated database, from the live site, and
run the scripts to remove/clean it again. The change takes a few seconds (on my
machine) but the downloading and cleaning of the database from the server; the
exporting of the data to a CSV file and adding said CSV file is not.
** Plot: Corrected Width Vs Height Scatter (Non-Digital)
#+begin_src lisp :session :results file
(vega:defplot width-height
`(:title "Art: (Corrected) Width vs Height (Non-Digital)"
:description "Comparison between the physical dimensions of artworks (corrected)."
:width 400
:height 400
:mark :circle
:data ,*artworks-df*
:selection (:grid (:type :interval :bind :scales))
:encoding (:x (:field :width :title "Width (cm)" :type :quantitative)
:y (:field :height :title "Height (cm)" :type :quantitative)
:tooltip (:field :title :type :nominative)
:color (:field :title :legend :null))))
(vega:write-html width-height "output/art-width-height-2023-03-21-corrected.html")
#+end_src
#+RESULTS:
[[file:output/art-width-height-2023-03-21-corrected.html]]
*** Plot: Side-by-Side of Width Vs Height (Corrected and Original)
Included these images side-by-side just to see how the correction changes the
feel of the graph.
[[file:output/art-width-height-2023-03-21.png]]
[[file:output/art-width-height-2023-03-21-corrected.png]]
** Note: Added missing depth dimensino to Touching but Not Connected
Depth is =7 cm=. *I have updated it on the live site* and
=data/artwork-2023-03-21.csv=.
Only one sculpture so no point plotting a graph.
** TODO Plot yearly totals
#+begin_src lisp :session
#+end_src