#+options: ':nil *:t -:t ::t <:t H:3 \n:nil ^:t arch:headline author:t
#+options: broken-links:nil c:nil creator:nil d:(not "LOGBOOK") date:t e:t
#+options: email:nil f:t inline:t num:t p:nil pri:nil prop:nil stat:t tags:t
#+options: tasks:t tex:t timestamp:t title:t toc:t todo:t |:t
#+title: Spare Room Manchester
#+date: today
#+author: Craig Oates
#+email: craig@craigoates.net
#+language: en
#+select_tags: export
#+exclude_tags: noexport
#+creator: Emacs 28.2 (Org mode 9.5.5)
#+cite_export:

* Setup Common Lisp Environment

You will not need to execute this code block if you've already set up SLIME in
another ORG file. This is just in case this is the only file you're working on
today, or it's your first file of the day.

*Run ~m-x slime~ before running the following code.* And, make note of the
~:session~ attribute. It allows you to use the code in the code block to be use
in other code blocks which also use the ~:session~ attribute.

#+begin_src lisp :session :results silent
  (ql:quickload :com.inuoe.jzon) ; JSON parser.
  (ql:quickload :dexador)        ; HTTP requests.
  (ql:quickload :plump)          ; HTML/XML parser.
  (ql:quickload :lquery)         ; HTML/DOM manipulation.
  (ql:quickload :lparallel)      ; Parallel programming.
  (ql:quickload :cl-ppcre)       ; RegEx. library.
  (ql:quickload :plot/vega)      ; Vega plotting library.
  (ql:quickload :lisp-stat)      ; Stat's library.
  (ql:quickload :data-frame)     ; Data frame library eqv. to Python's Numpy.
  (ql:quickload :str)            ; String library, expands on 'string' library.
#+end_src

* Gather Spare Room Data

Having done a quick manual search on [[https://www.spareroom.co.uk][Spare Room]], using =Manchester= as the
location/search term, the results page says there are =Showing 1-10 of 1000+
results=. So the total page count is,

#+begin_src calc :results raw
  1000 / 10
#+end_src

#+RESULTS:
100 (obvious, but kept for the sake of completeness)

Because I’m going to grabbing that many pages, it’ll be better if I do this once
and save the raw HTML on my computer. That way, I can parse and sift through the
data without putting needless stress on the Spare Room servers – by repeatedly
scraping and parsing the data as I work out the bugs.

#+begin_src shell :results silent
  # If you don't already have this directory. It includes files ignored by Git.
  mkdir "raw-data/external"
#+end_src

#+begin_src shell
  cd raw-data/external
  DIRECTORY="$(date '+%Y-%m-%d')_spare-room-manc"
  mkdir $DIRECTORY
  for OFFSET in {0..990..10}
  do
      curl -o "$DIRECTORY/spare-room-manc-$OFFSET.html" \
           "https://www.spareroom.co.uk/flatshare/index.cgi?offset=$OFFSET&search_id=1281027880&sort_by=price_low_to_high&mode=list"
      sleep 5
  done
  # Change back to the project's root directory, so I don't call code whilst still
  # in this directory.
  cd ../../
#+end_src

* Clean Up and Parse Data

** Failed Attempt (Parse Whole Page)

This is my initial attempt to parse the HTML files I gathered from Spare
Room's website.

#+begin_src lisp :results silent
  ;; This is here for future reference, I don't recommend you run it.
  (loop for file-path
          in (directory  #P"raw-data/external/2024-02-23_spare-room-manc/spare-room-manc-0.html")
        do
           (with-open-file (stream file-path)
             (let* ((doc (plump:parse stream))
                    (locations (lquery:$ doc ".listingLocation"))
                    (prices (lquery:$ doc ".listingPrice"))
                    (data (loop for loc across locations
                                for price across prices
                                collect (list (plump:text loc) (plump:text price)))))
               (with-open-file (stream
                                #P"working-data/2024-02-23-spare-room-manc.csv"
                                :direction :output
                                :if-exists :supersede)
                 (format stream "Location,Price~%")
                 (dolist (item data)
                   (format stream "~a,~a~%" (first item) (second item)))))))
#+end_src

Unfortunately, this approach produces inconsistent results. There are too many
lines of text with excessive spacing. There are newlines with no rhyme or reason
keep breaking the formatting, also. Having had a play around with the data, I
don't feel confident each entry has the rent price in the same place. So, I
think there is a chance the code which processes the files will produce
misaligned data. In other words, the rent for one location might be placed next
to the location which came before it in the processing pipeline (i.e. an
/off-by-one/ bug). Because of this, I think a more secure approach would be to
extract each listing into its own file and process one file at a time. This will,
hopefully, make it easier to identify if a listing has missing or ill-formed
pricing, location etc. data.

** Separate Each Listing into Their Own File

#+begin_src shell
  mkdir "raw-data/external/2024-02-23_spare-room-manc-listings/"
#+end_src

#+begin_src lisp :results silent
  (let ((counter 0))
    (loop for file-path
            in (directory  #P"raw-data/external/2024-02-23_spare-room-manc/*.html")
          do (with-open-file (in-stream file-path)
               (let* ((doc (plump:parse in-stream))
                      (listings (lquery:$ doc ".listing-result" (serialize))))
                 (loop for item across listings
                       do (let ((out-path
                                  (merge-pathnames "raw-data/external/2024-02-23_spare-room-manc-listings/"
                                                   (format nil "listing-~a.html" (write-to-string counter)))))
                            (with-open-file (out-stream
                                             out-path
                                             :direction :output
                                             :if-exists :supersede)
                              (format out-stream "~a" item))
                            (incf counter)))))))
#+end_src

* Create CSV of Listings

#+begin_src lisp :results silent
  (with-open-file (out-stream
                   #P"working-data/2024-02-23-spare-room-manc.csv"
                   :direction :output
                   :if-exists :supersede)
    (format out-stream "ROW-ID, LISTING-ID, LOCATION, POSTCODE, RAW-PRICE, PRICE, FREQUENCY, PRICE-DETAILS, URL~%")
    (let ((row-id 0))
      (loop for file-path
              in (directory #P"raw-data/external/2024-02-23_spare-room-manc-listings/*.html")
            do (with-open-file (in-stream file-path)
                 (let* ((doc (plump:parse in-stream))
                        (listing-id (lquery:$ doc ".listing-result" (attr "data-listing-id")))
                        (price-details (lquery:$ doc ".listingPriceDetails" (text)))
                        (location (lquery:$ doc ".listingLocation" (text)))
                        (url (lquery:$ doc "article" "a" (attr "href")))
                        (postcode (lquery:$ doc ".listing-result" (attr "data-listing-postcode")))
                        (price (lquery:$ doc ".listingPrice" (text)))
                        (cleaned-price (if (str:contains? "," (aref price 0))
                                           (str:replace-all "," "" (aref price 0))
                                           (aref price 0))))
                   (format out-stream "~a,~a,~a,~a,~a,~a,~a,~a,~a"
                           (write-to-string row-id)
                           (aref listing-id 0)
                           (aref location 0)
                           (aref postcode 0)
                           cleaned-price
                           (first (cl-ppcre:all-matches-as-strings "\\d+" cleaned-price))
                           (first (cl-ppcre:all-matches-as-strings "(pw|pcm)" cleaned-price))
                           (aref price-details 0)
                           (format nil "https//www.spareroom.co.uk~a~%" (aref url 0)))
                   (incf row-id))))))
#+end_src

* Explore CSV Data for Spare Room Manchester (2024-02-23)

I tried to create a data-frame with the CSV file but lisp-stat couldn't read
it. Having had a quick glance over it, there are rows with the following entry
in them, =this room is no longer available');=. The error, for lisp-stat, is
this bit of text has been separated from the =URL= column, leaving an odd number
or column entries on some rows.

 | URL                                                                              | (Extra Row, Not Needed)               |
 |----------------------------------------------------------------------------------+---------------------------------------|
 | =https//www.spareroom.co.uk/flatshare/flatshare_detail.pl?flatshare_id=17119950= |                                       |
 | =https//www.spareroom.co.ukjavascript:alert('Sorry=                              | =this room is no longer available');= |

Run the following script to remove these entries from the CSV file.

#+begin_src shell :results silent
# CSV file to process.
csv_file="working-data/2024-02-23-spare-room-manc.csv"
# Temporary file for intermediate processing.
temp_file="working-data/2024-02-23-spare-room-manc-temp.csv"
# String to search for...
search_string="this room is no longer available"
# Use awk to process the file and remove rows containing the search string.
awk -v s="$search_string" -F',' '{
    for(i=1; i<=NF; i++) {
        if($i ~ s) {
            next
        }
    }
    print $0
}' $csv_file > $temp_file
# Move the temporary file to the original file.
mv $temp_file $csv_file
#+end_src

I had to manually fix one row in the file in the =PRICE-DETAILS= column. I did
it manually because to save time.

#+begin_src lisp :session
  (lisp-stat:defdf *sr-manchester*
      (lisp-stat:read-csv #P"working-data/2024-02-23-spare-room-manc.csv"))
#+end_src

#+RESULTS:
: #<DATA-FRAME:DATA-FRAME (1096 observations of 9 variables)>

The 1096 entries is a little to big to put into a bar chart. I'm going to skip
this and look at the data in just tables and use the data-frame outputs.

** Filter Results for Weekly and Monthly Rent (and Bills Inc.)

The data in the CSV file includes both weekly and monthly rent rates. I need to
separate them out so I can workout the average rent for the listings on Spare
Room.

*I'm just going to exclude all listing which don't include bills in the rent
rate.* There are too many missing variables to work out and the data will vary
quite a bit. The most obvious influence on differing rent rates is Council Tax
bands. I don't have enough time to work out on each listing and work out an
(badly formed) estimate.

#+begin_src lisp :session :results silent
  (lisp-stat:write-csv
   (lisp-stat:filter-rows *sr-manchester*
                          '(and (string= "pw" frequency)
                            (< 100 price)
                            (str:contains? "Bills inc." price-details :ignore-case t)))
   #P"working-data/2024-02-23-spare-room-manc-weekly.csv"
   :add-first-row t)
#+end_src

The ~(< 100 price)~ filter is to stop the average from getting skewed. I saw
some listings at £1 and use the description to explain the (most likely
convoluted) offer. This is a bit like the car park listings in the Right Move
listings data.

#+begin_src lisp :session :results silent
  (lisp-stat:write-csv
   (lisp-stat:filter-rows *sr-manchester*
                          '(and (string= "pcm" frequency)
                            (< 100 price)
                            (str:contains? "Bills inc." price-details :ignore-case t)))
   #P"working-data/2024-02-23-spare-room-manc-monthly.csv"
   :add-first-row t)
#+end_src

If you want to look at the data from within this file, run the code block
below. The file can clog up a fair bit of the file, and lead to excessive
scrolling. So, I'd prefer to delete the data after looking at it before I commit
any changes made in this file.

#+begin_src shell :results output drawer
  csvlook working-data/2024-02-23-spare-room-manc-weekly.csv
#+end_src

#+begin_src lisp :session
  (lisp-stat:defdf *sr-manc-wk*
      (lisp-stat:read-csv #P"working-data/2024-02-23-spare-room-manc-weekly.csv"))
#+end_src

#+RESULTS:
: #<DATA-FRAME:DATA-FRAME (63 observations of 9 variables)>

#+begin_src lisp :session
  (lisp-stat:defdf *sr-manc-mth*
      (lisp-stat:read-csv #P"working-data/2024-02-23-spare-room-manc-monthly.csv"))
#+end_src

#+RESULTS:
: #<DATA-FRAME:DATA-FRAME (584 observations of 9 variables)>

Total number of listings being used,

#+begin_src lisp :results output raw
  (let* ((wk (lisp-stat:mean *sr-manc-wk*:price))
         (wk-to-mth (* 4 (lisp-stat:mean *sr-manc-wk*:price)))
         (mth (lisp-stat:mean *sr-manc-mth*:price))
         (diff (- mth wk-to-mth)))
    (format t "|Avg. Weekly Listings (£)|Weekly to Monthly (£) |Avg. Monthly Listings (£)|Difference (£)|~%")
    (format t "|-|-|-|-|~%")
    (format t "|~f|~f|~f|~f|~%" wk wk-to-mth mth diff))
#+end_src

#+RESULTS:
| Avg. Weekly Listings (£) | Weekly to Monthly (£) | Avg. Monthly Listings (£) | Difference (£) |
|--------------------------+-----------------------+---------------------------+----------------|
|                145.44775 |               581.791 |                  596.6906 |      14.899553 |

I've hard coded the rent prices into the chart below because I was having
trouble getting the values from the data-frames into something either the
~vega:defplot~ or ~vega:write-html~ functions understand. I think it has something
to do with the values, usually, taking the form of a ~ratio~ (i.e. 234/345) when
you call ~(lisp-stat:mean *sr-manc-wk*:price)~ (same when using
~*sr-manc-mth*:price~). Getting these types of values into something resembling a
/normal/ number value is something I just can't figure out here. So, I'm just
going to take advantage of the tiny amount of data needed and input it manually.

#+begin_src lisp :results file
  (let* ((values (lisp-stat:plist-df `(:x-axis #("Weekly (Scaled up to Monthly)" "Monthly")
                                       :y-axis #(581.79 596.69)))))
    (vega:defplot weekly-monthly-mean
      `(:title "Difference in Rent Mean Between Monthly & Weekly Listings on Spare Room (23/02/2024)"
        :mark :bar
        :width 600
        :height 600
        :data ,values
        :encoding (:x (:field :x-axis :title "Payment Frequency" :type :nominal :axis ("labelAngle" 0))
                   :y (:field :y-axis :title "Rent Mean (£)" :type :quantitative)
                   :tooltip (:field :y-axis))))
    (vega:write-html weekly-monthly-mean "renders/2024-02-23-spare-room-average-rent-diffs.html"))
#+end_src

#+RESULTS:
[[file:renders/2024-02-23-spare-room-average-rent-diffs.html]]

[[file:renders/2024-02-23-spare-room-average-rent-diffs.png]]

* Summary of Spare Room Data (Not Inc. Income Tax)

#+begin_src lisp :results raw
  (lisp-stat:summarize-column '*sr-manc-mth*:price)
#+end_src

#+RESULTS:
584 reals, min=300, q25=547.6666, q50=599.8028, q75=657.125, max=1260

#+begin_src lisp :results raw
  (lisp-stat:summarize-column '*sr-manc-wk*:price)
#+end_src

#+RESULTS:
63 reals, min=110, q25=141.7857, q50=153, q75=162.75, max=167

#+begin_src lisp :results output raw
  (format t "- Total Listings: ~a~%" (+ 584 63))
  (format t "- Week Min. Adjusted to Month: ~a~%" (* 110 4))
  (format t "- Week Max. Adjusted to Month: ~a" (* 167 4))
#+end_src

#+RESULTS:
- Total Listings: 647
- Week Min. Adjusted to Month: 440
- Week Max. Adjusted to Month: 668

| Total Listing Used | Min. (£) | Max. (£) |
|--------------------+----------+----------|
|                647 |      300 | 1260     |

The average rent (including bills) between the weekly and monthly listings
(23/02/2024),

#+begin_src lisp :session :results output
    (let* ((wk (lisp-stat:mean *sr-manc-wk*:price))
           (wk-to-mth (* 4 (lisp-stat:mean *sr-manc-wk*:price)))
           (mth (lisp-stat:mean *sr-manc-mth*:price))
           (avg (/ (+ (float wk-to-mth) (float mth)) 2)))
      (format t "Monthly Average: £~a~%" avg)
      (format t "Annual Average: £~a" (* avg 12)))
#+end_src

#+RESULTS:
: Monthly Average: £589.24084
: Annual Average: £7070.89

So, I need to earn around £7,100/yr to cover my living costs. This does not
include things like food, travel, clothing and socialising. Realistically, I'm
probably looking at £13,000/yr to keep me going, at a basic level.

#+begin_src lisp :results output raw
  (let* ((ann (float (+ 7100 5000)))
         (mth (float (/ ann 12)))
         ;; Hardcoded rent price because it's quicker and easier to do that here.
         (wk  (float (/ (- mth 589.24) 4)))
         (day (float (/ wk 7))))
    (format t "|Time Span|Value (£)|~%")
    (format t "|-|-|~%")
    (format t "|Annual|£~a|~%" ann)
    (format t "|Monthly|£~a|~%" mth)
    (format t "|Weekly (After Rent)|£~a|~%" wk)
    (format t "|Daily (After Rent)|£~a|~%" day))
#+end_src

#+RESULTS:
| Time Span           | Value (£)  |
|---------------------+------------|
| Annual              | £12100.0   |
| Monthly             | £1008.3333 |
| Weekly (After Rent) | £104.77333 |
| Daily (After Rent)  | £14.967619 |

At £14.96/day, this is 'bottom of the barrel' type of stuff. Having said that, I
should emphasise the rent calculated here is *the average*. There are rent rates
which are lower. Unfortunately, I don't feel comfortable thinking I have a
buffer zone here. From here on out, I'm going to treat these values as the
lowest I can go.

I could do with going over the [[file:./uk-wage-tax.org][minimum wage and tax stuff]] for the UK, to see how
far off these numbers are.

* Recalulate Figures with Income Tax

Having gone through the number in [[file:./ideal-flatmate-manchester.org][Ideal Flatmate (Manc.)]], I realised I haven't
taken into account the Income Tax with the figures above. I realise the
difference won't be that big because the annual salary target, used above,
barely passes the Personal Allowance band (Income Tax).

#+begin_src lisp :results output raw
  (let* ((earning-target 12100)
         (p-allow 12570)
         (taxable-income (- p-allow earning-target))
         (income-after-tax (* taxable-income 0.2))
         (total (- earning-target income-after-tax)))
    (format t "- Annual Target Salary: £~a~%" earning-target)
    (format t "- Part of Salary which is Taxable: £~a~%" taxable-income)
    (format t "- After Tax Deducted: £~a~%" income-after-tax)
    (format t "- Salary After Tax: £~a~%" total))
#+end_src

#+RESULTS:
- Annual Target Salary: £12100
- Part of Salary which is Taxable: £470
- After Tax Deducted: £94.0
- Salary After Tax: £12006.0

| Time Span             | Value After Tax (£) | Mean Rent (£) |
|-----------------------+---------------------+---------------|
| Annually              |               12006 |        589.24 |
| Monthly (Before Rent) |              1000.5 |               |
| Monthly (After Rent)  |              411.26 |               |
| Weekly (After Rent)   |             102.815 |               |
| Daily (After Rent)    |           14.687857 |               |
#+TBLFM: @3$2=@-1/12::@4$2=@-1-@-2$+1::@5$2=@-1/4::@6$2=@-1/7

The recalculation, with Income Tax, changes my spend limit to £14.68/day. It's
not a massive drop from £14.96, but it's not helping the situation. I should
point out here, these figures use the bare minimum salary target (I've set for
myself) of £12,100. Realistically, I'm going have to go with the £13,000 I
mentioned in section [[Summary of Spare Room Data (Not Inc. Income Tax)][Summary of Spare Room Data (Not Inc. Income Tax)]] above.