Code to help with the re-arranging of my life in 2024.

19 KiB

Raw Permalink Blame History

Spare Room Manchester

Setup Common Lisp Environment
Gather Spare Room Data
Clean Up and Parse Data
- Failed Attempt (Parse Whole Page)
- Separate Each Listing into Their Own File
Create CSV of Listings
Explore CSV Data for Spare Room Manchester (2024-02-23)
- Filter Results for Weekly and Monthly Rent (and Bills Inc.)
Summary of Spare Room Data (Not Inc. Income Tax)
Recalulate Figures with Income Tax

Setup Common Lisp Environment

You will not need to execute this code block if you've already set up SLIME in another ORG file. This is just in case this is the only file you're working on today, or it's your first file of the day.

Run m-x slime before running the following code. And, make note of the :session attribute. It allows you to use the code in the code block to be use in other code blocks which also use the :session attribute.

  (ql:quickload :com.inuoe.jzon) ; JSON parser.
  (ql:quickload :dexador)        ; HTTP requests.
  (ql:quickload :plump)          ; HTML/XML parser.
  (ql:quickload :lquery)         ; HTML/DOM manipulation.
  (ql:quickload :lparallel)      ; Parallel programming.
  (ql:quickload :cl-ppcre)       ; RegEx. library.
  (ql:quickload :plot/vega)      ; Vega plotting library.
  (ql:quickload :lisp-stat)      ; Stat's library.
  (ql:quickload :data-frame)     ; Data frame library eqv. to Python's Numpy.
  (ql:quickload :str)            ; String library, expands on 'string' library.

Gather Spare Room Data

Having done a quick manual search on Spare Room, using Manchester as the location/search term, the results page says there are Showing 1-10 of 1000+ results. So the total page count is,

  1000 / 10

100 (obvious, but kept for the sake of completeness)

Because I’m going to grabbing that many pages, it’ll be better if I do this once and save the raw HTML on my computer. That way, I can parse and sift through the data without putting needless stress on the Spare Room servers – by repeatedly scraping and parsing the data as I work out the bugs.

  # If you don't already have this directory. It includes files ignored by Git.
  mkdir "raw-data/external"

  cd raw-data/external
  DIRECTORY="$(date '+%Y-%m-%d')_spare-room-manc"
  mkdir $DIRECTORY
  for OFFSET in {0..990..10}
  do
      curl -o "$DIRECTORY/spare-room-manc-$OFFSET.html" \
           "https://www.spareroom.co.uk/flatshare/index.cgi?offset=$OFFSET&search_id=1281027880&sort_by=price_low_to_high&mode=list"
      sleep 5
  done
  # Change back to the project's root directory, so I don't call code whilst still
  # in this directory.
  cd ../../

Clean Up and Parse Data

Failed Attempt (Parse Whole Page)

This is my initial attempt to parse the HTML files I gathered from Spare Room's website.

  ;; This is here for future reference, I don't recommend you run it.
  (loop for file-path
          in (directory  #P"raw-data/external/2024-02-23_spare-room-manc/spare-room-manc-0.html")
        do
           (with-open-file (stream file-path)
             (let* ((doc (plump:parse stream))
                    (locations (lquery:$ doc ".listingLocation"))
                    (prices (lquery:$ doc ".listingPrice"))
                    (data (loop for loc across locations
                                for price across prices
                                collect (list (plump:text loc) (plump:text price)))))
               (with-open-file (stream
                                #P"working-data/2024-02-23-spare-room-manc.csv"
                                :direction :output
                                :if-exists :supersede)
                 (format stream "Location,Price~%")
                 (dolist (item data)
                   (format stream "~a,~a~%" (first item) (second item)))))))

Unfortunately, this approach produces inconsistent results. There are too many lines of text with excessive spacing. There are newlines with no rhyme or reason keep breaking the formatting, also. Having had a play around with the data, I don't feel confident each entry has the rent price in the same place. So, I think there is a chance the code which processes the files will produce misaligned data. In other words, the rent for one location might be placed next to the location which came before it in the processing pipeline (i.e. an off-by-one bug). Because of this, I think a more secure approach would be to extract each listing into its own file and process one file at a time. This will, hopefully, make it easier to identify if a listing has missing or ill-formed pricing, location etc. data.

Separate Each Listing into Their Own File

  mkdir "raw-data/external/2024-02-23_spare-room-manc-listings/"

  (let ((counter 0))
    (loop for file-path
            in (directory  #P"raw-data/external/2024-02-23_spare-room-manc/*.html")
          do (with-open-file (in-stream file-path)
               (let* ((doc (plump:parse in-stream))
                      (listings (lquery:$ doc ".listing-result" (serialize))))
                 (loop for item across listings
                       do (let ((out-path
                                  (merge-pathnames "raw-data/external/2024-02-23_spare-room-manc-listings/"
                                                   (format nil "listing-~a.html" (write-to-string counter)))))
                            (with-open-file (out-stream
                                             out-path
                                             :direction :output
                                             :if-exists :supersede)
                              (format out-stream "~a" item))
                            (incf counter)))))))

Create CSV of Listings

  (with-open-file (out-stream
                   #P"working-data/2024-02-23-spare-room-manc.csv"
                   :direction :output
                   :if-exists :supersede)
    (format out-stream "ROW-ID, LISTING-ID, LOCATION, POSTCODE, RAW-PRICE, PRICE, FREQUENCY, PRICE-DETAILS, URL~%")
    (let ((row-id 0))
      (loop for file-path
              in (directory #P"raw-data/external/2024-02-23_spare-room-manc-listings/*.html")
            do (with-open-file (in-stream file-path)
                 (let* ((doc (plump:parse in-stream))
                        (listing-id (lquery:$ doc ".listing-result" (attr "data-listing-id")))
                        (price-details (lquery:$ doc ".listingPriceDetails" (text)))
                        (location (lquery:$ doc ".listingLocation" (text)))
                        (url (lquery:$ doc "article" "a" (attr "href")))
                        (postcode (lquery:$ doc ".listing-result" (attr "data-listing-postcode")))
                        (price (lquery:$ doc ".listingPrice" (text)))
                        (cleaned-price (if (str:contains? "," (aref price 0))
                                           (str:replace-all "," "" (aref price 0))
                                           (aref price 0))))
                   (format out-stream "~a,~a,~a,~a,~a,~a,~a,~a,~a"
                           (write-to-string row-id)
                           (aref listing-id 0)
                           (aref location 0)
                           (aref postcode 0)
                           cleaned-price
                           (first (cl-ppcre:all-matches-as-strings "\\d+" cleaned-price))
                           (first (cl-ppcre:all-matches-as-strings "(pw|pcm)" cleaned-price))
                           (aref price-details 0)
                           (format nil "https//www.spareroom.co.uk~a~%" (aref url 0)))
                   (incf row-id))))))

Explore CSV Data for Spare Room Manchester (2024-02-23)

I tried to create a data-frame with the CSV file but lisp-stat couldn't read it. Having had a quick glance over it, there are rows with the following entry in them, this room is no longer available');. The error, for lisp-stat, is this bit of text has been separated from the URL column, leaving an odd number or column entries on some rows.

URL	(Extra Row, Not Needed)
`https//www.spareroom.co.uk/flatshare/flatshare_detail.pl?flatshare_id=17119950`
`https//www.spareroom.co.ukjavascript:alert('Sorry`	`this room is no longer available');`

Run the following script to remove these entries from the CSV file.

# CSV file to process.
csv_file="working-data/2024-02-23-spare-room-manc.csv"
# Temporary file for intermediate processing.
temp_file="working-data/2024-02-23-spare-room-manc-temp.csv"
# String to search for...
search_string="this room is no longer available"
# Use awk to process the file and remove rows containing the search string.
awk -v s="$search_string" -F',' '{
    for(i=1; i<=NF; i++) {
        if($i ~ s) {
            next
        }
    }
    print $0
}' $csv_file > $temp_file
# Move the temporary file to the original file.
mv $temp_file $csv_file

I had to manually fix one row in the file in the PRICE-DETAILS column. I did it manually because to save time.

  (lisp-stat:defdf *sr-manchester*
      (lisp-stat:read-csv #P"working-data/2024-02-23-spare-room-manc.csv"))

#<DATA-FRAME:DATA-FRAME (1096 observations of 9 variables)>

The 1096 entries is a little to big to put into a bar chart. I'm going to skip this and look at the data in just tables and use the data-frame outputs.

Filter Results for Weekly and Monthly Rent (and Bills Inc.)

The data in the CSV file includes both weekly and monthly rent rates. I need to separate them out so I can workout the average rent for the listings on Spare Room.

I'm just going to exclude all listing which don't include bills in the rent rate. There are too many missing variables to work out and the data will vary quite a bit. The most obvious influence on differing rent rates is Council Tax bands. I don't have enough time to work out on each listing and work out an (badly formed) estimate.

  (lisp-stat:write-csv
   (lisp-stat:filter-rows *sr-manchester*
                          '(and (string= "pw" frequency)
                            (< 100 price)
                            (str:contains? "Bills inc." price-details :ignore-case t)))
   #P"working-data/2024-02-23-spare-room-manc-weekly.csv"
   :add-first-row t)

The (< 100 price) filter is to stop the average from getting skewed. I saw some listings at £1 and use the description to explain the (most likely convoluted) offer. This is a bit like the car park listings in the Right Move listings data.

  (lisp-stat:write-csv
   (lisp-stat:filter-rows *sr-manchester*
                          '(and (string= "pcm" frequency)
                            (< 100 price)
                            (str:contains? "Bills inc." price-details :ignore-case t)))
   #P"working-data/2024-02-23-spare-room-manc-monthly.csv"
   :add-first-row t)

If you want to look at the data from within this file, run the code block below. The file can clog up a fair bit of the file, and lead to excessive scrolling. So, I'd prefer to delete the data after looking at it before I commit any changes made in this file.

  csvlook working-data/2024-02-23-spare-room-manc-weekly.csv

  (lisp-stat:defdf *sr-manc-wk*
      (lisp-stat:read-csv #P"working-data/2024-02-23-spare-room-manc-weekly.csv"))

#<DATA-FRAME:DATA-FRAME (63 observations of 9 variables)>

  (lisp-stat:defdf *sr-manc-mth*
      (lisp-stat:read-csv #P"working-data/2024-02-23-spare-room-manc-monthly.csv"))

#<DATA-FRAME:DATA-FRAME (584 observations of 9 variables)>

Total number of listings being used,

  (let* ((wk (lisp-stat:mean *sr-manc-wk*:price))
         (wk-to-mth (* 4 (lisp-stat:mean *sr-manc-wk*:price)))
         (mth (lisp-stat:mean *sr-manc-mth*:price))
         (diff (- mth wk-to-mth)))
    (format t "|Avg. Weekly Listings (£)|Weekly to Monthly (£) |Avg. Monthly Listings (£)|Difference (£)|~%")
    (format t "|-|-|-|-|~%")
    (format t "|~f|~f|~f|~f|~%" wk wk-to-mth mth diff))

Avg. Weekly Listings (£)	Weekly to Monthly (£)	Avg. Monthly Listings (£)	Difference (£)
145.44775	581.791	596.6906	14.899553

I've hard coded the rent prices into the chart below because I was having trouble getting the values from the data-frames into something either the vega:defplot or vega:write-html functions understand. I think it has something to do with the values, usually, taking the form of a ratio (i.e. 234/345) when you call (lisp-stat:mean *sr-manc-wk*:price) (same when using *sr-manc-mth*:price). Getting these types of values into something resembling a normal number value is something I just can't figure out here. So, I'm just going to take advantage of the tiny amount of data needed and input it manually.

  (let* ((values (lisp-stat:plist-df `(:x-axis #("Weekly (Scaled up to Monthly)" "Monthly")
                                       :y-axis #(581.79 596.69)))))
    (vega:defplot weekly-monthly-mean
      `(:title "Difference in Rent Mean Between Monthly & Weekly Listings on Spare Room (23/02/2024)"
        :mark :bar
        :width 600
        :height 600
        :data ,values
        :encoding (:x (:field :x-axis :title "Payment Frequency" :type :nominal :axis ("labelAngle" 0))
                   :y (:field :y-axis :title "Rent Mean (£)" :type :quantitative)
                   :tooltip (:field :y-axis))))
    (vega:write-html weekly-monthly-mean "renders/2024-02-23-spare-room-average-rent-diffs.html"))

/craig.oates/overhaul2024/src/branch/master/renders/2024-02-23-spare-room-average-rent-diffs.html

/craig.oates/overhaul2024/src/branch/master/renders/2024-02-23-spare-room-average-rent-diffs.png

Summary of Spare Room Data (Not Inc. Income Tax)

  (lisp-stat:summarize-column '*sr-manc-mth*:price)

584 reals, min=300, q25=547.6666, q50=599.8028, q75=657.125, max=1260

  (lisp-stat:summarize-column '*sr-manc-wk*:price)

63 reals, min=110, q25=141.7857, q50=153, q75=162.75, max=167

  (format t "- Total Listings: ~a~%" (+ 584 63))
  (format t "- Week Min. Adjusted to Month: ~a~%" (* 110 4))
  (format t "- Week Max. Adjusted to Month: ~a" (* 167 4))

Total Listings: 647
Week Min. Adjusted to Month: 440
Week Max. Adjusted to Month: 668

Total Listing Used	Min. (£)	Max. (£)
647	300	1260

The average rent (including bills) between the weekly and monthly listings (23/02/2024),

    (let* ((wk (lisp-stat:mean *sr-manc-wk*:price))
           (wk-to-mth (* 4 (lisp-stat:mean *sr-manc-wk*:price)))
           (mth (lisp-stat:mean *sr-manc-mth*:price))
           (avg (/ (+ (float wk-to-mth) (float mth)) 2)))
      (format t "Monthly Average: £~a~%" avg)
      (format t "Annual Average: £~a" (* avg 12)))

Monthly Average: £589.24084
Annual Average: £7070.89

So, I need to earn around £7,100/yr to cover my living costs. This does not include things like food, travel, clothing and socialising. Realistically, I'm probably looking at £13,000/yr to keep me going, at a basic level.

  (let* ((ann (float (+ 7100 5000)))
         (mth (float (/ ann 12)))
         ;; Hardcoded rent price because it's quicker and easier to do that here.
         (wk  (float (/ (- mth 589.24) 4)))
         (day (float (/ wk 7))))
    (format t "|Time Span|Value (£)|~%")
    (format t "|-|-|~%")
    (format t "|Annual|£~a|~%" ann)
    (format t "|Monthly|£~a|~%" mth)
    (format t "|Weekly (After Rent)|£~a|~%" wk)
    (format t "|Daily (After Rent)|£~a|~%" day))

Time Span	Value (£)
Annual	£12100.0
Monthly	£1008.3333
Weekly (After Rent)	£104.77333
Daily (After Rent)	£14.967619

At £14.96/day, this is 'bottom of the barrel' type of stuff. Having said that, I should emphasise the rent calculated here is the average. There are rent rates which are lower. Unfortunately, I don't feel comfortable thinking I have a buffer zone here. From here on out, I'm going to treat these values as the lowest I can go.

I could do with going over the minimum wage and tax stuff for the UK, to see how far off these numbers are.

Recalulate Figures with Income Tax

Having gone through the number in Ideal Flatmate (Manc.), I realised I haven't taken into account the Income Tax with the figures above. I realise the difference won't be that big because the annual salary target, used above, barely passes the Personal Allowance band (Income Tax).

  (let* ((earning-target 12100)
         (p-allow 12570)
         (taxable-income (- p-allow earning-target))
         (income-after-tax (* taxable-income 0.2))
         (total (- earning-target income-after-tax)))
    (format t "- Annual Target Salary: £~a~%" earning-target)
    (format t "- Part of Salary which is Taxable: £~a~%" taxable-income)
    (format t "- After Tax Deducted: £~a~%" income-after-tax)
    (format t "- Salary After Tax: £~a~%" total))

Annual Target Salary: £12100
Part of Salary which is Taxable: £470
After Tax Deducted: £94.0
Salary After Tax: £12006.0

Time Span	Value After Tax (£)	Mean Rent (£)
Annually	12006	589.24
Monthly (Before Rent)	1000.5
Monthly (After Rent)	411.26
Weekly (After Rent)	102.815
Daily (After Rent)	14.687857

The recalculation, with Income Tax, changes my spend limit to £14.68/day. It's not a massive drop from £14.96, but it's not helping the situation. I should point out here, these figures use the bare minimum salary target (I've set for myself) of £12,100. Realistically, I'm going have to go with the £13,000 I mentioned in section Summary of Spare Room Data (Not Inc. Income Tax) above.

19 KiB Raw Permalink Blame History