<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://oneturkmen.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://oneturkmen.github.io/" rel="alternate" type="text/html" hreflang="en" /><updated>2025-04-12T15:56:06+00:00</updated><id>https://oneturkmen.github.io/feed.xml</id><title type="html">blank</title><subtitle>A simple, whitespace theme for academics. Based on [*folio](https://github.com/bogoli/-folio) design.
</subtitle><entry><title type="html">A Simple Method for Smarter Decision Making</title><link href="https://oneturkmen.github.io/decision-theory/2025/02/05/multicriteria-decision-matrix.html" rel="alternate" type="text/html" title="A Simple Method for Smarter Decision Making" /><published>2025-02-05T16:31:12+00:00</published><updated>2025-02-05T16:31:12+00:00</updated><id>https://oneturkmen.github.io/decision-theory/2025/02/05/multicriteria-decision-matrix</id><content type="html" xml:base="https://oneturkmen.github.io/decision-theory/2025/02/05/multicriteria-decision-matrix.html"><![CDATA[<p>Humans, like you and me, make decisions every day, however small or large. In
today’s world with numerous options and factors, it becomes more important to
make decisions in a methodical way rather than relying on our “gut feeling”. In
fact, some decisions we make require deep thinking due to potential
consequences and irreversibility (“you can’t unring a bell” once you make
certain decisions). Beyond the two important factors, we may have limited
and dynamic contextual information that can easily overwhelm us.</p>

<p>To make structured, analytical decisions, we can leverage a very simple yet
powerful tool known as Multi-Criteria Decision Matrix, as described in the
book, <a href="https://www.cia.gov/resources/csi/books-monographs/psychology-of-intelligence-analysis-2/">“Psychology of Intelligence
Analysis”</a>,
by a former CIA veteran <a href="https://en.wikipedia.org/wiki/Richards_Heuer">Richards J. Heuer,
Jr.</a>. Before I introduce it,
let’s understand the problem context to which it is most relevant.</p>

<h2 id="problem-buying-a-house-is-complicated">Problem: Buying a house is complicated</h2>

<p>Let us imagine the following scenario; let’s say we would like to buy a house.
There are many factors involved in buying the house: obviously its price,
square footage, where it is located (e.g., close to or far from work?), the
crime levels in the surrounding neighborhood, quality of the building, future maintenance
fees, and so on.</p>

<p>We often buy the houses with our partners, and that can lead to conflicting
preferences. For example, what is more important: having a backyard, but paying
more, or perhaps vice versa? There is no “right” or “wrong” decision, just
different set of consequences (higher vs lower price; there is a backyard vs
there is none).</p>

<p>Buying a house, in general, is a high-stakes situation due to the amount of money
that we need to spend. We also often consider how long we are going to live at
that place, along with “investment” opportunities, as in how much it is going
to appreciate or depreciate over the next decade.</p>

<p>In short, it is a complicated and stressful process that takes weeks, if not
months. However, there is a way to simplify it by approaching the process from
the analytical mindset.</p>

<h2 id="solution-multi-criteria-decision-matrix">Solution: Multi-criteria Decision Matrix</h2>

<p>The essence of making a good decision is to gather and structure available
information at hand (the known knowns). Humans are not good at holding a lot of
information at the same time in their head. Thus, we need to <em>externalize</em> the
information onto a visible form, such as an A4 paper. In addition to the
externalization, we <em>break down</em> complex information into its simpler
constituents.</p>

<p>One way to externalize and break down the information in a structured way is
known as Multi-Criteria Decision Matrix (MCDM). It is a hands-on, paper-based 
method for making a decision that considers multiple constraints (criteria).
Let’s see how it applies to the housing purchase situation above.</p>

<p>First, we come up with a list of attributes, such as:</p>

<ul>
  <li>Location</li>
  <li>Price</li>
  <li>Square footage</li>
  <li>Estimated maintenance costs</li>
  <li>Whether there is a garden or not</li>
  <li>etc.</li>
</ul>

<p>Next, we both assign relative importance to each of the attributes:</p>

<table>
  <thead>
    <tr>
      <th><strong>Attribute</strong></th>
      <th><strong>% important</strong></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Location</td>
      <td>30%</td>
    </tr>
    <tr>
      <td>Price</td>
      <td>25%</td>
    </tr>
    <tr>
      <td>Sq Ft</td>
      <td>20%</td>
    </tr>
    <tr>
      <td>Maintenance</td>
      <td>15%</td>
    </tr>
    <tr>
      <td>Garden</td>
      <td>10%</td>
    </tr>
  </tbody>
</table>

<p>The ranked list of attributes helps focus on what matters most and leave out
things that do not matter. For example, is garden really important to us or can
we live without it? Are we okay with paying more but get a better location, or
vice versa, get a house further away but for a cheaper price? Should we include
<em>Maintenance</em> in the <em>Price</em> attribute rather than keeping it separate? Hence,
we can now make a more structured decision on what matters to us before
searching for a house. On top of that, we will immediately notice the
difference in our preferences and that of our partners, be able to quantify that
difference, and discuss it in detail.</p>

<p>Now that we have visited a few houses, we are ready to apply the structured
analytical method to make a decision. You now need to quantify each house’s
“score”: distribute (imaginary) 100 points among the options we have. For
example, in the table below, House A gets 50, House B gets 10 and House C gets
40 points for the same <em>Location</em> attribute. Simply put, we liked House A the
most, and House C is close enough to it, but we did not like House B as much;
all in terms of the “goodness” of location.</p>

<table>
  <thead>
    <tr>
      <th> </th>
      <th> </th>
      <th> </th>
      <th><strong>Houses</strong></th>
      <th> </th>
      <th> </th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Attributes</strong></td>
      <td><strong>% importance</strong></td>
      <td> </td>
      <td>House A</td>
      <td>House B</td>
      <td>House C</td>
    </tr>
    <tr>
      <td>Location</td>
      <td>30%</td>
      <td> </td>
      <td>50</td>
      <td>10</td>
      <td>40</td>
    </tr>
    <tr>
      <td>Price</td>
      <td>25%</td>
      <td> </td>
      <td>20</td>
      <td>50</td>
      <td>30</td>
    </tr>
    <tr>
      <td>Sq Ft</td>
      <td>20%</td>
      <td> </td>
      <td>30</td>
      <td>60</td>
      <td>10</td>
    </tr>
    <tr>
      <td>Maintenance</td>
      <td>15%</td>
      <td> </td>
      <td>20</td>
      <td>30</td>
      <td>50</td>
    </tr>
    <tr>
      <td>Garden</td>
      <td>10%</td>
      <td> </td>
      <td>70</td>
      <td>0</td>
      <td>30</td>
    </tr>
  </tbody>
</table>

<p>Having split 100 points across our options for each attribute, we can now
calculate the total score by multiplying the %-ge importance by the score of
the house and sum it up:</p>

<table>
  <thead>
    <tr>
      <th> </th>
      <th> </th>
      <th> </th>
      <th><strong>Houses</strong></th>
      <th> </th>
      <th> </th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Attributes</strong></td>
      <td><strong>% importance</strong></td>
      <td> </td>
      <td>House A</td>
      <td>House B</td>
      <td>House C</td>
    </tr>
    <tr>
      <td>Location</td>
      <td>30%</td>
      <td> </td>
      <td>50</td>
      <td>10</td>
      <td>40</td>
    </tr>
    <tr>
      <td>Price</td>
      <td>25%</td>
      <td> </td>
      <td>20</td>
      <td>50</td>
      <td>30</td>
    </tr>
    <tr>
      <td>Sq Ft</td>
      <td>20%</td>
      <td> </td>
      <td>30</td>
      <td>60</td>
      <td>10</td>
    </tr>
    <tr>
      <td>Maintenance</td>
      <td>15%</td>
      <td> </td>
      <td>20</td>
      <td>30</td>
      <td>50</td>
    </tr>
    <tr>
      <td>Garden</td>
      <td>10%</td>
      <td> </td>
      <td>70</td>
      <td>0</td>
      <td>30</td>
    </tr>
  </tbody>
  <tbody>
    <tr>
      <td><strong>Total Score</strong></td>
      <td><strong>100%</strong></td>
      <td> </td>
      <td><strong>36</strong></td>
      <td><strong>32</strong></td>
      <td><strong>32</strong></td>
    </tr>
  </tbody>
</table>

<p></p>

<p>Voila! In the table above, <strong>House A</strong> is the winner with total of <strong>36</strong>
points. House B and C have the same total scores despite scoring wildly
differently in some attributes, such as <em>Location</em> and <em>Price</em>. Note that the
points themselves only matter in <em>relative</em> terms, i.e., what house is better
than another and by how much.</p>

<p>We can also compare each of our results with our partners (if done
separately). The separate result will show the difference in both of our
preferences and help re-evaluate houses in case there is a vast difference
between our results.</p>

<p>You can extend the matrix and do <em>sensitivity</em> analysis to determine how much
change in one attribute’s score can swing our decision from one house to
another. For example, we can calculate the amount that the House B’s price has
to go down for us to make it our primary choice rather than House A. This kind
of add-on can help us understand how sensitive our choices are (firm on one
option vs being open to consideration of alternatives).</p>

<h2 id="conclusion">Conclusion</h2>

<p>In summary, we broke down a problem (“buying a house”) into smaller
sub-problems (attributes &amp; their relative importances for each option) and
externalized it onto a paper. These steps helped us evaluate and rank the options,
and ultimately make a decision.</p>

<p>Multi-Criteria Decision Matrix is one among many approaches to structured,
analytical decision making. It can be applied anywhere where the stakes are
high and a good enough decision has to be made, whether at home, in our
career, in relationships, or elsewhere. I highly recommend to read Richard Heuer’s
book, as it is packed with valuable and insightful information on how to
effectively make sense of a vast amount of information in today’s world.</p>

<h2 id="references">References</h2>

<ul>
  <li>Heuer, Richards J. Psychology of intelligence analysis. Center for the Study
of Intelligence, 1999.</li>
</ul>]]></content><author><name></name></author><category term="decision-theory" /><category term="decision-theory" /><summary type="html"><![CDATA[Simple framework for making complex decisions.]]></summary></entry><entry><title type="html">(In)efficient Insertions in Postgres</title><link href="https://oneturkmen.github.io/tech/2024/09/20/slow-fast-faster-postgres.html" rel="alternate" type="text/html" title="(In)efficient Insertions in Postgres" /><published>2024-09-20T22:12:54+00:00</published><updated>2024-09-20T22:12:54+00:00</updated><id>https://oneturkmen.github.io/tech/2024/09/20/slow-fast-faster-postgres</id><content type="html" xml:base="https://oneturkmen.github.io/tech/2024/09/20/slow-fast-faster-postgres.html"><![CDATA[<p>PostgreSQL is a popular relational database that is used for variety of applications.
With the recent surge in popularity of Generative AI (Large Language Models in particular), PostgreSQL is often used for storing document “embeddings” (low-level representation of documents) to enhance search functionality.
In addition to the ability to “search” similar documents in a query, it is used primarily due to its reliability, speed, and community support (I know, some folks might say relational databases are slow, but there are plenty of ways to tune the database to all kinds of use cases and access patterns).</p>

<h2 id="writing-to-postgresql">Writing to PostgreSQL</h2>

<p>The SQL in PostgreSQL is already a spoiler for what language is used to retrieve or store the data.
One of the common ways, which is also part of SQL standard, is to use <code class="language-plaintext highlighter-rouge">INSERT</code> command:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">customer</span> <span class="p">(</span><span class="n">id</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="n">age</span><span class="p">)</span>
<span class="k">VALUES</span> <span class="p">(</span><span class="mi">123</span><span class="p">,</span> <span class="nv">"john doe"</span><span class="p">,</span> <span class="mi">29</span><span class="p">);</span>
</code></pre></div></div>

<p>In the example below, we insert (store) a single row containing 3 columns.
This classic usage of <code class="language-plaintext highlighter-rouge">INSERT</code> is quite common when you need to store a few
rows (e.g. when you need to insert new information in real time).</p>

<p>But does <code class="language-plaintext highlighter-rouge">INSERT</code> still work with larger amount of data? What if we write 100s of columns?
One way is to loop over your data records and insert them one by one:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">psycopg</span>

<span class="n">customers</span> <span class="o">=</span> <span class="p">[(</span> <span class="mi">123</span><span class="p">,</span> <span class="s">"john doe"</span><span class="p">,</span> <span class="mi">29</span> <span class="p">),</span> <span class="p">(</span> <span class="mi">125</span><span class="p">,</span> <span class="s">"jane doe"</span><span class="p">,</span> <span class="mi">31</span> <span class="p">)]</span>

<span class="c1"># Connect to an existing database
</span><span class="k">with</span> <span class="n">psycopg</span><span class="p">.</span><span class="n">connect</span><span class="p">(</span><span class="s">"dbname=marketplace user=mytestusername"</span><span class="p">)</span> <span class="k">as</span> <span class="n">conn</span><span class="p">:</span>

    <span class="c1"># Open a cursor to perform database operations
</span>    <span class="k">with</span> <span class="n">conn</span><span class="p">.</span><span class="n">cursor</span><span class="p">()</span> <span class="k">as</span> <span class="n">cur</span><span class="p">:</span>

        <span class="c1"># Pass data to fill a query placeholders and let Psycopg perform
</span>        <span class="c1"># the correct conversion (no SQL injections!)
</span>	<span class="k">for</span> <span class="nb">id</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="n">age</span> <span class="ow">in</span> <span class="n">customers</span><span class="p">:</span>
		<span class="n">cur</span><span class="p">.</span><span class="n">execute</span><span class="p">(</span>
		    <span class="s">"INSERT INTO customer (id, name, age) VALUES (%s, %s, %s)"</span><span class="p">,</span> <span class="p">(</span><span class="nb">id</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="n">age</span><span class="p">)</span>
		<span class="p">)</span>

        <span class="c1"># Make the changes to the database persistent
</span>        <span class="n">conn</span><span class="p">.</span><span class="n">commit</span><span class="p">()</span>
</code></pre></div></div>

<p>This works fine for a handful of queries. Once you get into 1000s or even more records,
the code above will progressively start getting worse. It is not scalable.</p>

<p>Another option is to use <code class="language-plaintext highlighter-rouge">cur.executemany()</code> where the same query will be executed for a list of tuples
instead of a single tuple. It is better than serially executing <code class="language-plaintext highlighter-rouge">INSERTS</code> because the transaction is committed
only once instead of multiple times. However, <code class="language-plaintext highlighter-rouge">executemany()</code> will still be very slow because it is not
optimized for very large amounts of data (&gt;=1 million rows).</p>

<p>Best option for “batch” loading (when we load large amount of data at once) is to use <code class="language-plaintext highlighter-rouge">COPY</code>.
<code class="language-plaintext highlighter-rouge">COPY</code> is done only within a single transaction and treats the entire file as an input stream.
It is <a href="https://github.com/postgres/postgres/blob/c4d5cb71d229095a39fda1121a75ee40e6069a2a/src/backend/commands/copyfrom.c#L640">specifically optimized</a> for batch loading.</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">COPY</span> <span class="n">customer</span> <span class="k">FROM</span> <span class="s1">'customer_data.csv'</span> <span class="k">DELIMITER</span> <span class="s1">','</span><span class="p">;</span>
</code></pre></div></div>

<h2 id="benchmarking">Benchmarking</h2>

<p>I created a <a href="https://github.com/oneturkmen/slow-fast-postgres-insertions">repository with benchmarking code</a> that shows the performance
difference between the three methods (i.e. <code class="language-plaintext highlighter-rouge">execute</code>, <code class="language-plaintext highlighter-rouge">executemany</code>, and <code class="language-plaintext highlighter-rouge">copy</code>).
Feel free to fork it and reproduce locally. The results might differ, but only slightly. The speed-up
ratios between the approach should stay about the same.</p>

<p>Here are the results from a single run:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Function 'very_slow_insert' [execute one by one] executed in 4309.07 secs / 71.82 mins
Function 'slow_insert' [executemany] executed in 168.56 secs / 2.81 mins
Function 'fast_copy' [copy] executed in 61.26 secs / 1.02 mins
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">COPY</code> command was about <strong>~71x</strong> faster than running <code class="language-plaintext highlighter-rouge">INSERT</code> in a loop and <strong>~2.8x</strong> faster than
running a single-batch <code class="language-plaintext highlighter-rouge">INSERT</code> command.</p>

<h2 id="conclusion">Conclusion</h2>

<p>It is clear that <code class="language-plaintext highlighter-rouge">COPY</code> is the best option for large data loading in Postgres.
The single-batch <code class="language-plaintext highlighter-rouge">INSERT</code> (with many values) works too, but is still behind the copy.
The individual <code class="language-plaintext highlighter-rouge">INSERT</code> commands are the slowest and I do not recommend running them
unless you are only inserting a handful of rows and do not generally have a large load.</p>]]></content><author><name></name></author><category term="tech" /><category term="databases" /><category term="performance" /><category term="storage" /><summary type="html"><![CDATA[Multiple ways to insert large amount of data in the database.]]></summary></entry><entry><title type="html">Biting Off More Than We Can Chew with OLAP Libraries</title><link href="https://oneturkmen.github.io/tech/2024/03/17/data-analysis-with-less-memory.html" rel="alternate" type="text/html" title="Biting Off More Than We Can Chew with OLAP Libraries" /><published>2024-03-17T05:21:31+00:00</published><updated>2024-03-17T05:21:31+00:00</updated><id>https://oneturkmen.github.io/tech/2024/03/17/data-analysis-with-less-memory</id><content type="html" xml:base="https://oneturkmen.github.io/tech/2024/03/17/data-analysis-with-less-memory.html"><![CDATA[<h3 id="or-how-to-do-data-analysis-with-little-memory-using-three-popular-libraries-polars-duckdb-and-dask">Or how to do data analysis with little memory using three popular libraries: Polars, DuckDB, and Dask!</h3>

<p>With the <a href="https://www.statista.com/statistics/871513/worldwide-data-created/">exponentially increasing volume of data</a>, it would be nice to have the ability to read data larger than the available memory. 
Inspired by Daniel Beach’s post on <a href="https://dataengineeringcentral.substack.com/p/duckdb-vs-polars-thunderdome">DuckDB vs Polars</a>, I would like to do a similar analysis
of data processing libraries that focus on high performance. The only difference is that I would not be
reading data from a cloud storage like S3. Instead, I would have the data downloaded locally on my computer.</p>

<h2 id="setup">Setup</h2>

<p><strong>Data.</strong> I will use <a href="https://www.backblaze.com/cloud-storage/resources/hard-drive-test-data#w-tabs-2-data-w-pane-1">the same source of raw HD test dataset</a> that Daniel used (Backblaze hard drive data stats such as models, capacities, and failures). I could not determine what files exactly he downloaded, but I will use 2022 Q1 and Q2, and all of 2023
data for our setup. So we have a total of <code class="language-plaintext highlighter-rouge">~45 GB</code> of CSV files.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">du</span> <span class="nt">-sh</span> data_<span class="k">*</span> <span class="nt">--total</span>

6.2G	data_Q1_2022
7.1G	data_Q1_2023
6.4G	data_Q2_2022
7.6G	data_Q2_2023
8.6G	data_Q3_2023
9.0G	data_Q4_2023
45G	total
</code></pre></div></div>

<p><strong>Compute.</strong> For the compute power, I have <strong>11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz</strong> with 4 cores (2 threads per core; so total <strong>8 threads</strong>). I have 16 GB RAM total, but I will limit it to <strong>4 GB only</strong> using <code class="language-plaintext highlighter-rouge">ulimit -v 4000000</code> (4 million bytes) for each test.</p>

<p><strong>Task.</strong> Our task will be to compute the number of <code class="language-plaintext highlighter-rouge">failures</code> grouped by <code class="language-plaintext highlighter-rouge">date</code>s in the set of <code class="language-plaintext highlighter-rouge">~45 GB</code> of CSV files. In SQL, it looks like:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="nb">date</span><span class="p">,</span> <span class="k">SUM</span><span class="p">(</span><span class="n">failure</span><span class="p">)</span> <span class="k">as</span> <span class="n">failures</span>
<span class="k">FROM</span> <span class="n">table_with_all_failures</span> 
<span class="k">GROUP</span> <span class="k">BY</span> <span class="nb">date</span>
</code></pre></div></div>

<h2 id="setting-up-the-tools">Setting up the tools</h2>

<p><strong>Polars.</strong> Polars is a high-performance data processing library. It
can be used to manipulate structured data in a very fast way. While the core of the library is written in Rust, the library has APIs in Python, R, and NodeJS. Basically, think of it as a very fast alternative to <a href="https://pandas.pydata.org/">Pandas</a> (but remember, it’s not quite a drop-in replacement due to <a href="https://docs.pola.rs/user-guide/migration/pandas/">some major differences</a> between the two).</p>

<p>Our test code using Polars looks like this:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">polars</span> <span class="k">as</span> <span class="n">pl</span>
<span class="kn">import</span> <span class="nn">time</span>

<span class="c1"># Uncomment line below for more logs.
# pl.Config.set_verbose(True)
</span>
<span class="k">def</span> <span class="nf">polars_test</span><span class="p">():</span>
    <span class="n">lazy_df</span> <span class="o">=</span> <span class="n">pl</span><span class="p">.</span><span class="n">scan_csv</span><span class="p">(</span><span class="s">"*/*.csv"</span><span class="p">)</span>

    <span class="n">sql</span> <span class="o">=</span> <span class="n">pl</span><span class="p">.</span><span class="n">SQLContext</span><span class="p">()</span>
    <span class="n">sql</span><span class="p">.</span><span class="n">register</span><span class="p">(</span><span class="s">"harddrives"</span><span class="p">,</span> <span class="n">lazy_df</span><span class="p">)</span>   
    <span class="n">results</span> <span class="o">=</span> <span class="n">sql</span><span class="p">.</span><span class="n">execute</span><span class="p">(</span><span class="s">"""
        SELECT date, SUM(failure) as failures
        FROM harddrives 
        GROUP BY date
    """</span><span class="p">)</span>

    <span class="n">results_filename</span> <span class="o">=</span> <span class="s">"results_polars.csv"</span>
    <span class="n">results</span><span class="p">.</span><span class="n">collect</span><span class="p">(</span><span class="n">streaming</span><span class="o">=</span><span class="bp">True</span><span class="p">).</span><span class="n">write_csv</span><span class="p">(</span><span class="n">results_filename</span><span class="p">)</span>

<span class="n">start_time</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="n">time</span><span class="p">()</span>
<span class="n">polars_test</span><span class="p">()</span>
<span class="n">end_time</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="n">time</span><span class="p">()</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"It took </span><span class="si">{</span><span class="n">end_time</span> <span class="o">-</span> <span class="n">start_time</span><span class="si">}</span><span class="s"> seconds to run Polars test."</span><span class="p">)</span>
</code></pre></div></div>

<p>In the code snippet above, we are lazily scanning the set of CSV files into a polars <code class="language-plaintext highlighter-rouge">LazyFrame</code>, register the dataframe
as a (quasi) SQL table so that we can run the aforementioned SQL query on it. Note the <code class="language-plaintext highlighter-rouge">.collect(streaming=True)</code> part
with the <code class="language-plaintext highlighter-rouge">streaming</code> parameter: it will process the data <a href="https://docs.pola.rs/user-guide/concepts/streaming/">in chunks</a> because our dataset is larger than available memory. Once we get the results of the grouping
operation, we write them to a CSV file <code class="language-plaintext highlighter-rouge">results_polars.csv</code>.</p>

<p><strong>DuckDB.</strong> DuckDB is a “fast in-process analytical database”. Think of it as in-memory database that allows you
to perform very fast computations on columns (a.k.a. <a href="https://en.wikipedia.org/wiki/Online_analytical_processing">OLAP</a>).
Similar to Polars, it supports a SQL dialect that can be used to query and manipulate data.</p>

<p>Our test code using DuckDB looks like this:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">duckdb</span>
<span class="kn">import</span> <span class="nn">time</span>


<span class="k">def</span> <span class="nf">duckdb_test</span><span class="p">():</span>
    <span class="n">duckdb</span><span class="p">.</span><span class="n">sql</span><span class="p">(</span><span class="s">"""
        SET preserve_insertion_order = false;
        SET temp_directory = './temp';

        CREATE VIEW metrics AS 
        SELECT date, SUM(failure) as failures
        FROM read_csv('*/*.csv', union_by_name = true)
        GROUP BY date;
    """</span><span class="p">)</span>

    <span class="n">duckdb</span><span class="p">.</span><span class="n">sql</span><span class="p">(</span><span class="s">"""
        COPY metrics TO 'results_duckdb.csv';
    """</span><span class="p">)</span>

<span class="n">start_time</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="n">time</span><span class="p">()</span>
<span class="n">duckdb_test</span><span class="p">()</span>
<span class="n">end_time</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="n">time</span><span class="p">()</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"It took </span><span class="si">{</span><span class="n">end_time</span> <span class="o">-</span> <span class="n">start_time</span><span class="si">}</span><span class="s"> seconds to run DuckDB test."</span><span class="p">)</span>
</code></pre></div></div>

<p>In the snippet above, we read the CSV files within the SQL statement using <code class="language-plaintext highlighter-rouge">read_csv()</code> function. I had to set <code class="language-plaintext highlighter-rouge">union_by_name</code>
parameter to circumvent the <code class="language-plaintext highlighter-rouge">duckdb.duckdb.InvalidInputException: Invalid Input Error: Mismatch between the schema of different files</code>. The parameter combines schemas of different files by column name.</p>

<p>Also note couple configuration parameters, namely <code class="language-plaintext highlighter-rouge">preserve_insertion_order = false</code> and <code class="language-plaintext highlighter-rouge">temp_directory = './temp'</code>.
The former is set to flag DuckDB that it does not have to preserve the order of the files it reads. Disabling
the insertion order preserves memory. For the latter, setting <code class="language-plaintext highlighter-rouge">temp_directory</code> should have enabled us processing
data larger than memory. According to DuckDB,</p>

<blockquote>
  <p>If DuckDB is running in in-memory mode, it cannot use disk to offload data if it does not fit into main memory. To enable offloading in the absence of a persistent database file, use the <code class="language-plaintext highlighter-rouge">SET temp_directory</code> statement”.</p>
</blockquote>

<p>Despite many tries with different parameters, I could not make it work. Some folks
needed to set <a href="https://github.com/duckdb/duckdb/issues/11054">number of threads to 1</a> to make it work. Others
recommended using some <a href="https://github.com/duckdb/duckdb/issues/11054#issuecomment-1985758719">nightly build</a> that fixes the issue, but it looks like the issue is still there.</p>

<p>That’s quite unfortunate given that DuckDB claims that the <a href="https://duckdb.org/docs/guides/performance/how_to_tune_workloads.html#larger-than-memory-workloads-out-of-core-processing">larger-than-memory workloads</a> are its “key strength”:</p>

<blockquote>
  <p>A key strength of DuckDB is support for larger-than-memory workloads, i.e., it is able to process data sets that are larger than the available system memory (also known as out-of-core processing). It can also run queries where the intermediate results cannot fit into memory.</p>
</blockquote>

<p>Welp, did not work for me :/</p>

<p><strong>Dask.</strong> Dask is a library for parallel computing in Python. It is a feature-rich library that lets you scale Python code from a single computer to large distributed clusters.</p>

<p>Here is our setup with Dask:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">dask.dataframe</span> <span class="k">as</span> <span class="n">dd</span> 
<span class="kn">import</span> <span class="nn">time</span>


<span class="k">def</span> <span class="nf">dask_test</span><span class="p">():</span>
    <span class="n">dfs</span> <span class="o">=</span> <span class="n">dd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">"*/*.csv"</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="p">{</span><span class="s">'failure'</span><span class="p">:</span> <span class="s">'float64'</span><span class="p">})</span>
    <span class="n">result_df</span> <span class="o">=</span> <span class="n">dfs</span><span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="s">"date"</span><span class="p">).</span><span class="n">failure</span><span class="p">.</span><span class="nb">sum</span><span class="p">()</span>
    <span class="n">result_df</span><span class="p">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s">"results_dask.csv"</span><span class="p">,</span> <span class="n">single_file</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

<span class="n">start_time</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="n">time</span><span class="p">()</span>
<span class="n">dask_test</span><span class="p">()</span>
<span class="n">end_time</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="n">time</span><span class="p">()</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"It took </span><span class="si">{</span><span class="n">end_time</span> <span class="o">-</span> <span class="n">start_time</span><span class="si">}</span><span class="s"> seconds to run DuckDB test."</span><span class="p">)</span>
</code></pre></div></div>

<p>The code above reads the CSV files into Dask DataFrame, groups data by <code class="language-plaintext highlighter-rouge">date</code> and computes the <code class="language-plaintext highlighter-rouge">sum</code> of <code class="language-plaintext highlighter-rouge">failure</code>s, saving
the results in another CSV file. Note that I had to cast <code class="language-plaintext highlighter-rouge">failure</code> column to <code class="language-plaintext highlighter-rouge">float64</code> because it would throw <code class="language-plaintext highlighter-rouge">ValueError</code>
recommending that I change the type from <code class="language-plaintext highlighter-rouge">int64</code> to <code class="language-plaintext highlighter-rouge">float64</code>, even though I never specified <code class="language-plaintext highlighter-rouge">int64</code>. It is most likely
that most of the entries in the column are indeed of type <code class="language-plaintext highlighter-rouge">int64</code>, however it recommends <code class="language-plaintext highlighter-rouge">float64</code> due to the presence of 
<code class="language-plaintext highlighter-rouge">NaN</code>s.</p>

<h2 id="results">Results</h2>

<p>We are finally in the results section! So here are they for each of tool set up above.</p>

<table>
  <thead>
    <tr>
      <th>Polars</th>
      <th>DuckDB</th>
      <th>Dask</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>8-15 seconds</td>
      <td>OOM</td>
      <td>170-185 seconds</td>
    </tr>
  </tbody>
</table>

<p><strong>Polars.</strong> Polars is a winner with its setup script taking between <strong>8-15 seconds</strong> to get the CSV file with the results.
The setup, as you have seen, looks simple enough, it is readable, no surprises.</p>

<p><strong>DuckDB.</strong> Unfortunately, DuckDB kept throwing “Out Of Memory” exceptions:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>duckdb.duckdb.OutOfMemoryException: Out of Memory Error: Failed to allocate block of 32002048 bytes
</code></pre></div></div>

<p>As mentioned in the Setup section, I tried playing with several configuration parameters like <code class="language-plaintext highlighter-rouge">temp_directory</code> and <code class="language-plaintext highlighter-rouge">preserve_insertion_order</code>. Neither of them fixed the issue.</p>

<p><strong>Dask.</strong> Our Dask test took between <strong>175 seconds</strong> (~3 minutes), much slower than Polars.
There were no surprises with Dask, but I also think it was not built for single-computer data processing. Even though I have not tested it yet, it would most likely shine in distributed computing environments where one can run data-intensive programs across many computers.</p>

<h2 id="conclusion">Conclusion</h2>

<p>So here we have the results for the three tools. Polars is a clear winner here, but bear in mind that I provided
results for one specific test which is about performing data analysis on data that is larger than memory. I am sure
DuckDB is as performant as Polars, although I could not get the out-of-memory issue resolved. Dask is in between, likely a choice better for distributed computing environments rather
than a single computer case.</p>

<p>The source code is open source and available <a href="https://github.com/oneturkmen/experiments-with-olap-triad">here on GitHub</a>.</p>]]></content><author><name></name></author><category term="tech" /><category term="olap" /><category term="data-analysis" /><category term="python" /><category term="performance" /><summary type="html"><![CDATA[Comparing three popular libraries for analyzing data that is larger than available memory.]]></summary></entry><entry><title type="html">The Art of Controlled Evolution: How Migration Tools Shape Database Schemas</title><link href="https://oneturkmen.github.io/tech/2023/04/03/database-changes.html" rel="alternate" type="text/html" title="The Art of Controlled Evolution: How Migration Tools Shape Database Schemas" /><published>2023-04-03T04:00:00+00:00</published><updated>2023-04-03T04:00:00+00:00</updated><id>https://oneturkmen.github.io/tech/2023/04/03/database-changes</id><content type="html" xml:base="https://oneturkmen.github.io/tech/2023/04/03/database-changes.html"><![CDATA[<h2 id="evolving-requirements-evolving-data">Evolving requirements, evolving data</h2>

<p>Today’s world is dynamic and chaotic making it nearly impossible to predict
what is expected in future. Similar behavior can be observed in software development,
where requirements may drastically change within days even with careful, long-term planning.
That is perhaps why Agile methodology is a popular and the “standard” approach
to software development in today’s world.</p>

<p>Imagine a scenario in which you talk to a client about some cool mobile app
idea, and they ask you to store, process, and display some data in the app.
However, next week, they ask you us for something else, so you need to address the
change in client’s requirements on data. As a consequence, you need to make
changes to the application and the way it stores, processes, and renders the data.</p>

<p>If we had something hard-coded in code that needed to be changed, it would be
simply changing the value of the associated structures (e.g., variables or
lists of strings). However, software developers detach code from data, and typically use storage systems for storing the data.</p>

<p>There is a variety of database systems, such as relational and
non-relational (e.g., NoSQL). The relational systems are great at
establishing “relations” between different data entities, which
provide safeguards against data anomalies and often make it efficient to retrieve
related data together (e.g., through table joins and column indexes).</p>

<p>We know how developers commonly “version” source code changes to keep track of
where the source code was and where it is at the moment. Additionally, it is important
for effective collaboration with other developers.
However, changing database schema is not as clear cut as changing code.
Accidentally modifying (or even worse, deleting data) can result in large
negative impact, including loss of monetary value and/or reputation.</p>

<p>Luckily, we are not the first ones to encounter such a problem.
In this post, I would like to focus on changing the schema in the
relational SQL databases, such as PostgreSQL, in effective ways
that keeps the database state consistent.
By the end of this blog post, we will learn how to effectively manage changes in SQL
databases.</p>

<h2 id="schema-changes-as-migrations">Schema changes as migrations</h2>

<p>Software requirements change frequently, and so do database schema associated
with them. Changing the schema is a tricky task that can easily result in data
inconsistency and even unexpected downtimes. Not surprisingly, making changes
to the schema can be a nerve-wracking experience.</p>

<p>If schema changes are important, then we should monitor them. We should
track who makes the change, when it was made, and why it was made. We should have
something similar to
<a href="https://www.atlassian.com/git/tutorials/comparing-workflows">git workflow</a>, where
incremental changes to code are noted in history.</p>

<p>Akin to <code class="language-plaintext highlighter-rouge">git</code> commits, that keep track of changes in the application code,
schema migrations are changes to the file(s) that represent the schema.
Those changes can be applied incrementally (on top of each other) and can be
easily reverted, if something goes wrong.</p>

<p>Take a look at the example code snippet below where we want to split one column into two.
The initial version contained a column <code class="language-plaintext highlighter-rouge">TEXT full_name</code> that stored a person’s full name.
A change to the schema was made to split <code class="language-plaintext highlighter-rouge">full_name</code> into <code class="language-plaintext highlighter-rouge">first_name</code> and <code class="language-plaintext highlighter-rouge">last_name</code>
columns.</p>

<div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">CREATE TABLE contact (
</span>	id INT GENERATED ALWAYS AS IDENTITY,
<span class="gd">-	TEXT full_name
</span><span class="gi">+	TEXT first_name
+	TEXT last_name
</span>	TEXT email
<span class="err">);</span>
</code></pre></div></div>

<p>The changes made to the above SQL file can be tracked using <code class="language-plaintext highlighter-rouge">git</code>. That is an incomplete
way of applying changes to the schema because they are not immediately
reflected in the database. Yes, the file changed in the <code class="language-plaintext highlighter-rouge">git</code> repository, but
that file cannot just be loaded in the database to apply the changes.
So, how do we apply such a change in the database to actually split the column
into two?</p>

<h2 id="migration-tools">Migration tools</h2>

<p>To apply the changes in the database, we
can use a <em>SQL schema migration tool</em>, also known as
<a href="https://www.bytebase.com/blog/top-database-schema-change-tool-evolution/#gitops-database-as-code">database-as-code migration tool</a>.
There are plenty of amazing tools, both open source and free, as well as proprietary ones.</p>

<p>I am going to use <a href="https://alembic.sqlalchemy.org/en/latest/tutorial.html#the-migration-environment">Alembic</a>, a lightweight schema migration tool that uses <a href="https://www.sqlalchemy.org/">SQLAlchemy</a> as its engine.</p>

<blockquote>
  <p>If you have never worked with SQLAlchemy, it is a cool library that provides both object-relational mapping (ORM) and database toolkit for Python applications. It helps map SQL schema onto Python data structures for easier, in-code data management.</p>
</blockquote>

<p>Alembic keeps track of changes to the database schema through <em>revision scripts</em>.
The revision scripts contain the “delta”, or the change that is applied to the schema.
Let’s create an example script to split a single column <code class="language-plaintext highlighter-rouge">full_name</code>
into two columns <code class="language-plaintext highlighter-rouge">first_name</code> and <code class="language-plaintext highlighter-rouge">last_name</code>.
To create a revision script, we can run the following command:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>alembic revision <span class="nt">-m</span> <span class="s2">"split full_name into first_name and last_name"</span>
</code></pre></div></div>

<p>The command will generate a file named something like <code class="language-plaintext highlighter-rouge">a1829f4e7900_split_full_name.py</code>.
Note the prefix of the file name - that’s a revision hash used to mark a
schema change, similar to a git commit hash. The contents of the file may look
like this:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="s">"""Split full_name into first_name and last_name

Revision ID: a1829f4e7900
Revises:
Create Date: 2023-02-02 11:40:27.089406
"""</span>

<span class="c1"># revision identifiers, used by Alembic.
</span><span class="n">revision</span> <span class="o">=</span> <span class="s">'a1829f4e7900'</span>
<span class="n">down_revision</span> <span class="o">=</span> <span class="bp">None</span>
<span class="n">branch_labels</span> <span class="o">=</span> <span class="bp">None</span>

<span class="kn">from</span> <span class="nn">alembic</span> <span class="kn">import</span> <span class="n">op</span>
<span class="kn">import</span> <span class="nn">sqlalchemy</span> <span class="k">as</span> <span class="n">sa</span>

<span class="k">def</span> <span class="nf">upgrade</span><span class="p">():</span>
    <span class="k">pass</span>

<span class="k">def</span> <span class="nf">downgrade</span><span class="p">():</span>
    <span class="k">pass</span>
</code></pre></div></div>

<p>The file comes with a docstring with the short description of the change,
revision ID, and creation date. Note that the comment also mentions <code class="language-plaintext highlighter-rouge">"Revises: "</code>,
which indicates the previous revision ID. It is obviously empty in our file
because we just created our first revision. If we created another revision in addition
to the one we just generated, the new revision script would have <code class="language-plaintext highlighter-rouge">"Revises: a1829f4e7900"</code>.
Further, the variable <code class="language-plaintext highlighter-rouge">down_revision</code> indicates the same thing.</p>

<p>Note that we also have two empty functions generated for us, namely <code class="language-plaintext highlighter-rouge">upgrade()</code> and <code class="language-plaintext highlighter-rouge">downgrade()</code>. The former allows us to add logic for the new schema change, while
the latter lets us add the logic to <em>revert</em> that new change in case of potential
problems down the line (e.g., during deployment to QA).</p>

<p>Let’s fill those functions with some concrete logic:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">upgrade</span><span class="p">():</span>
	<span class="c1"># Add new columns 'first_name' and 'last_name' to the table 'contact'
</span>	<span class="n">op</span><span class="p">.</span><span class="n">add_column</span><span class="p">(</span><span class="s">'contact'</span><span class="p">,</span> <span class="n">sa</span><span class="p">.</span><span class="n">Column</span><span class="p">(</span><span class="s">'first_name'</span><span class="p">,</span> <span class="n">sa</span><span class="p">.</span><span class="n">Text</span><span class="p">))</span>
	<span class="n">op</span><span class="p">.</span><span class="n">add_column</span><span class="p">(</span><span class="s">'contact'</span><span class="p">,</span> <span class="n">sa</span><span class="p">.</span><span class="n">Column</span><span class="p">(</span><span class="s">'last_name'</span><span class="p">,</span> <span class="n">sa</span><span class="p">.</span><span class="n">Text</span><span class="p">))</span>

	<span class="c1"># Split 'full_name' and move into 'first_name' and 'last_name'
</span>	<span class="n">results</span> <span class="o">=</span> <span class="n">op</span><span class="p">.</span><span class="n">execute</span><span class="p">(</span><span class="s">"SELECT id, full_name FROM contact"</span><span class="p">);</span>
	<span class="k">for</span> <span class="nb">id</span><span class="p">,</span> <span class="n">full_name</span> <span class="ow">in</span> <span class="n">results</span><span class="p">:</span>
		<span class="c1"># Logic to split the name and insert into table
</span>		<span class="n">first_name</span><span class="p">,</span> <span class="n">last_name</span> <span class="o">=</span> <span class="n">full_name</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">' '</span><span class="p">)</span>
		<span class="n">op</span><span class="p">.</span><span class="n">execute</span><span class="p">(</span><span class="sa">f</span><span class="s">"UPDATE contact SET first_name = </span><span class="si">{</span><span class="n">first_name</span><span class="si">}</span><span class="s"> WHERE id = </span><span class="si">{</span><span class="nb">id</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
		<span class="n">op</span><span class="p">.</span><span class="n">execute</span><span class="p">(</span><span class="sa">f</span><span class="s">"UPDATE contact SET last_name = </span><span class="si">{</span><span class="n">last_name</span><span class="si">}</span><span class="s"> WHERE id = </span><span class="si">{</span><span class="nb">id</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
	
	<span class="c1"># Finally, drop the column
</span>	<span class="n">op</span><span class="p">.</span><span class="n">drop_column</span><span class="p">(</span><span class="s">'contact'</span><span class="p">,</span> <span class="s">'full_name'</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">downgrade</span><span class="p">():</span>
	<span class="c1"># Add 'full_name' column back.
</span>	<span class="n">op</span><span class="p">.</span><span class="n">add_column</span><span class="p">(</span><span class="s">'contact'</span><span class="p">,</span> <span class="n">sa</span><span class="p">.</span><span class="n">Column</span><span class="p">(</span><span class="s">'full_name'</span><span class="p">,</span> <span class="n">sa</span><span class="p">.</span><span class="n">Text</span><span class="p">))</span>

	<span class="c1"># Join 'first_name' and 'last_name' into 'full_name'
</span>	<span class="n">results</span> <span class="o">=</span> <span class="n">op</span><span class="p">.</span><span class="n">execute</span><span class="p">(</span><span class="s">"SELECT id, first_name, last_name FROM contact"</span><span class="p">);</span>
	<span class="k">for</span> <span class="nb">id</span><span class="p">,</span> <span class="n">first_name</span><span class="p">,</span> <span class="n">last_name</span> <span class="ow">in</span> <span class="n">results</span><span class="p">:</span>
		<span class="c1"># Logic to split the name and insert into table
</span>		<span class="n">full_name</span> <span class="o">=</span> <span class="s">' '</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">first_name</span><span class="p">,</span> <span class="n">last_name</span><span class="p">)</span>
		<span class="n">op</span><span class="p">.</span><span class="n">execute</span><span class="p">(</span><span class="sa">f</span><span class="s">"UPDATE contact SET full_name = </span><span class="si">{</span><span class="n">full_name</span><span class="si">}</span><span class="s"> WHERE id = </span><span class="si">{</span><span class="nb">id</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
	
	<span class="c1"># Finally, drop 'first_name' and 'last_name' columns
</span>	<span class="n">op</span><span class="p">.</span><span class="n">drop_column</span><span class="p">(</span><span class="s">'contact'</span><span class="p">,</span> <span class="s">'first_name'</span><span class="p">)</span>
	<span class="n">op</span><span class="p">.</span><span class="n">drop_column</span><span class="p">(</span><span class="s">'contact'</span><span class="p">,</span> <span class="s">'last_name'</span><span class="p">)</span>
</code></pre></div></div>

<blockquote>
  <p>The snippet above only serves as a simplified example to showcase a schema revision script. It is not an optimal logic. Be careful when actually splitting full name into first and last names. In some cultures, there are no last names, or the last names may consist of multiple space-separate words. Lastly, make sure to <code class="language-plaintext highlighter-rouge">RESTART</code> your identity columns. We don’t want to accidentally cause an integer overflow in transaction ids.</p>
</blockquote>

<p>The <code class="language-plaintext highlighter-rouge">upgrade()</code> function above performs three important steps: (1) creates two columns,
(2) populates the two columns from an existing, older column, and (3) removes the older
column that is no longer needed. Not surprisingly, the <code class="language-plaintext highlighter-rouge">downgrade()</code> function is
the inverse operation of <code class="language-plaintext highlighter-rouge">upgrade()</code>: we add one column back, re-populate it from the two columns, and remove those two columns.</p>

<blockquote>
  <p>It’s best to keep migration scripts as small as possible. Even the snippet above could be split into multiple migration scripts. For example, in one revision, we can just add two columns. In the next one, we split the name into two parts and populate these two new columns. In the third and final revision, we get rid of the older column. Having smaller migration scripts allows users of the database to adjust their code without immediately causing breaking changes. More on <a href="https://www.martinfowler.com/articles/evodb.html">evolutionary database design</a> later.</p>
</blockquote>

<p>After creating the script, we can now run <code class="language-plaintext highlighter-rouge">alembic upgrade head</code> to apply the change in the database.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>alembic upgrade <span class="nb">head
</span>INFO  <span class="o">[</span>alembic.context] Context class PostgresqlContext.
INFO  <span class="o">[</span>alembic.context] Will assume transactional DDL.
INFO  <span class="o">[</span>alembic.context] Running upgrade None -&gt; a1829f4e7900
</code></pre></div></div>

<p>In case something goes terribly wrong, we can also revert that revision:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>alembic downgrade <span class="nt">-1</span>
INFO  <span class="o">[</span>alembic.context] Context class PostgresqlContext.
INFO  <span class="o">[</span>alembic.context] Will assume transactional DDL.
INFO  <span class="o">[</span>alembic.context] Running downgrade a1829f4e7900 -&gt; None
</code></pre></div></div>

<p>Interestingly (and actually quite importantly), Alembic runs the above revision script 
within a transaction. Transactions are atomic in SQL databases (e.g., PostgreSQL), so if
something fails within the transaction, it will be automatically rollbacked to the previous
state to keep the database consistent. That’s amazing!</p>

<h2 id="benefits-of-migration-tools">Benefits of migration tools</h2>

<p>With the help of tools such as Alembic, teams can seamlessly work together
on the schema changes and not be worried of corrupting the database state.
Such tools provide developers with a nice view of schema changes as scripts,
which makes it easier for them to keep track of, as well as, perform code reviews 
together. Migration tools are also a great addition to continuous integration and 
delivery (CI/CD) pipelines, which can run automated migrations in different 
environments such as Development and Production.</p>

<p>If you would like to learn more about applying Agile methodologies to databases,
check out the following articles:</p>

<ul>
  <li><a href="https://www.martinfowler.com/articles/evodb.html">Evolutionary Database Design by Martin Fowler</a> describes a novel (at the published time) approach for database change management.</li>
  <li>If you are PostgreSQL user, their <a href="https://www.postgresql.org/docs">documentation</a> is great along with the <a href="https://wiki.postgresql.org/wiki/Don't_Do_This">“Don’t Do This” best practices</a> wiki.</li>
</ul>

<h2 id="disadvantages">Disadvantages</h2>

<p>There is no silver bullet in software engineering. The same applies to using migration tools. Some of the biggest disadvantages are that:</p>

<ul>
  <li><strong>Tools can be expensive.</strong> Some of the tools come with great benefits … at a $-value 
cost. I have personally not used such tools, but perhaps they can provide greater benefits
such as a separate UI for managing migrations.</li>
  <li><strong>It is unclear what schema looks like at a given point.</strong> If schema evolves
rapidly, there could be a rapid flow of new revision scripts being created and applied to the database. These add more complexity and make it difficult for developers to determine what the schema looks like at a given point of time.</li>
  <li><strong>If something does not work, we need a new revision script</strong>. If some migration script is found to have some logical issues (a.k.a. bugs) after it was already applied, another migration script is likely required to fix the issue, since modifying existing scripts could lead to corrupt database state.</li>
</ul>

<h2 id="final-thoughts">Final thoughts</h2>

<p>Tools such as Alembic are great for managing change in database schema. As everything
else in life, they come with pros and cons that software development teams need to
consider before adapting such tools in their development lifecycle. In general, 
changing database schema can be a tricky thing and cause a nerve-wracking experience,
but it becomes much simpler and smoother with the right migration tools 
and processes (CI/CD) at hand.</p>

<h2 id="ps-wanna-give-alembic-a-try">P.S. Wanna give Alembic a try?</h2>

<p>I created a simple Dockerfile that will let you play with Alembic.
This assumes that you have <a href="https://www.docker.com/">Docker</a> installed locally.</p>

<div class="language-Dockerfile highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Use an official PostgreSQL image as the base image</span>
<span class="k">FROM</span><span class="s"> postgres:latest</span>

<span class="k">ENV</span><span class="s"> PYENV="/app/alembic_env"</span>

<span class="c"># Install necessary packages for Python and Alembic</span>
<span class="k">RUN </span>apt-get update <span class="o">&amp;&amp;</span> <span class="se">\
</span>    apt-get <span class="nb">install</span> <span class="nt">-y</span> python3 python3-pip python3-venv <span class="o">&amp;&amp;</span> <span class="se">\
</span>    <span class="nb">mkdir</span> <span class="nt">-p</span> <span class="s2">"</span><span class="nv">$PYENV</span><span class="s2">"</span> <span class="o">&amp;&amp;</span> <span class="se">\
</span>    python3 <span class="nt">-m</span> venv <span class="s2">"</span><span class="nv">$PYENV</span><span class="s2">"</span> <span class="o">&amp;&amp;</span> <span class="se">\
</span>    <span class="nb">.</span> <span class="nv">$PYENV</span>/bin/activate <span class="o">&amp;&amp;</span> <span class="se">\
</span>    pip3 <span class="nb">install </span>alembic

<span class="c"># Set environment variables for PostgreSQL</span>
<span class="k">ENV</span><span class="s"> POSTGRES_USER myuser</span>
<span class="k">ENV</span><span class="s"> POSTGRES_PASSWORD mypassword</span>
<span class="k">ENV</span><span class="s"> POSTGRES_DB mydb</span>

<span class="c"># Expose the PostgreSQL port</span>
<span class="k">EXPOSE</span><span class="s"> 5432</span>
</code></pre></div></div>

<p>Copy and paste the contents above to your local computer. Then, build and run the image.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>docker build <span class="nt">-t</span> postgres_alembic:latest <span class="nb">.</span> <span class="o">&amp;&amp;</span> <span class="se">\</span>
	docker run <span class="nt">-it</span> postgres_alembic:latest /bin/bash
</code></pre></div></div>]]></content><author><name></name></author><category term="tech" /><category term="databases" /><category term="postgresql" /><category term="git" /><summary type="html"><![CDATA[Guide to managing database changes with confidence.]]></summary></entry><entry><title type="html">Deriving a good starting word for Wordle</title><link href="https://oneturkmen.github.io/math/tech/2022/03/03/analysis-wordle.html" rel="alternate" type="text/html" title="Deriving a good starting word for Wordle" /><published>2022-03-03T04:00:00+00:00</published><updated>2022-03-03T04:00:00+00:00</updated><id>https://oneturkmen.github.io/math/tech/2022/03/03/analysis-wordle</id><content type="html" xml:base="https://oneturkmen.github.io/math/tech/2022/03/03/analysis-wordle.html"><![CDATA[<p>Back in December of 2021, I got enthusiastic about
<a href="https://www.nytimes.com/games/wordle/index.html">Wordle</a>, a game of guessing a
word. In a bit more detail, it is a game where you have 6 attemps to guess a
5-letter target word. What got me hooked into the game is probability. I wanted to
find answer to probability questions, such as “what would be the probability of
guessing a word from the 1st attempt?”.</p>

<p>To answer the question above, we first need to get as many 5-letter words as
possible. In other words, we need to build a <em>population</em> of 5-letter words to
be able to calculate precise probabilities of word occurrences, whether complete
(guessing entire target word) or partial (guessing one or more character of a
target word). While it may be difficult to get a complete data set of all 5-letter
words in English, we can use some approximate datasets, such as the one with
5757 5-letter words from <a href="https://www-cs-faculty.stanford.edu/~knuth/sgb.html">the Stanford
GraphBase</a>.</p>

<p>Probability of guessing a target word on the 1st try is <code class="language-plaintext highlighter-rouge">1 / (number of all
5-letter words)</code>. Evidently, it is a very low probability of success.
What we can look for instead is <em>good starting word(s)</em>. Those are the words
that will maximise our chance of guessing a target word in 6 (or hopefully fewer) attempts.</p>

<h2 id="good-starter-words">Good starter words</h2>

<p>We can use the following logic to derive a good starter word:</p>

<ol>
  <li>Calculate frequency of all letters (i.e., number of unique occurrences in words; do not double count). Sum all frequency counts.</li>
</ol>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Given a list of words,
</span><span class="s">"aaa"</span>
<span class="s">"abb"</span>
<span class="s">"abc"</span>
<span class="s">"cba"</span>

<span class="c1"># we get the corresponding dictionary of frequencies:
</span><span class="p">{</span>
    <span class="s">"a"</span><span class="p">:</span> <span class="mi">4</span><span class="p">,</span>  <span class="c1"># 'a' occurs in 4 words, and so on.
</span>    <span class="s">"b"</span><span class="p">:</span> <span class="mi">3</span><span class="p">,</span>
    <span class="s">"c"</span><span class="p">:</span> <span class="mi">2</span><span class="p">,</span>
<span class="p">}</span>
<span class="c1"># and sum of all frequencies is:
</span><span class="n">total</span> <span class="o">=</span> <span class="mi">4</span> <span class="o">+</span> <span class="mi">3</span> <span class="o">+</span> <span class="mi">2</span> <span class="o">=</span> <span class="mi">9</span>
</code></pre></div></div>

<ol>
  <li>Calculate <em>weighted</em> letter “coverage”, which shows the weighted percentage of letters (from alphabet of all words) covered by a word. The idea here is that we want to cover as many <em>unique</em> letters as possible for maximum diversity. The diversity helps us maximize likelihood of hitting at least one letter. We want a good starting word, not the “one-and-done” kind.</li>
</ol>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="s">"aaa"</span>
<span class="c1"># ==&gt; take unique letter 'a'
# ==&gt; freq['a'] / n
# ==&gt; 4/9 = 44%
# Coverage: 44%
</span>
<span class="s">"abb"</span>
<span class="c1"># ==&gt; take unique letters 'a', 'b'
# ==&gt; (freq['a'] / n) + (freq['b'] / n)
# ==&gt; (4/9) + (3/9) = 77%
# Coverage: 77%
</span>
<span class="s">"abc"</span>
<span class="c1"># ==&gt; take unique letters 'a', 'b', 'c'
# ==&gt; (freq['a'] / n) + ... + (freq['c'] / n)
# ==&gt; (4/9) + (3/9) + (2/9) = 100%
# Coverage: 100%
</span>
<span class="s">"cba"</span>
<span class="c1"># ==&gt; take unique letters 'a', 'b', 'c'
# ==&gt; (freq['c'] / n) + ... + (freq['a'] / n)
# ==&gt; (4/9) + (3/9) + (2/9) = 100%
# Coverage: 100%
</span></code></pre></div></div>

<ol>
  <li>Sort by coverage in descending order.</li>
  <li>Drink some coffee and enjoy a chocolatine.</li>
</ol>

<p>Hold on for a sec. <code class="language-plaintext highlighter-rouge">"abc"</code> and <code class="language-plaintext highlighter-rouge">"cba"</code> have the same coverages (of 100%). Which one is better? <em>Can</em> one be better than another?</p>

<h5 id="taking-positions-into-consideration">Taking positions into consideration</h5>

<p>In case of having two equally “good” words, we should consider proportions of each letter <em>in each position</em> (0th to 4th indices)
of each of the two words.</p>

<ol>
  <li>We look at each position “vertically” (over all <em>k</em>-letter words in a very large vocabulary) and 
calculate proportion (or “coverage”) of each letter in a given position. 
For example, with the four 3-letter words above, we will have the following positioned coverages:</li>
</ol>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Position 1 ('aaac'):
</span><span class="p">{</span>
    <span class="s">"a"</span><span class="p">:</span> <span class="s">"75%"</span><span class="p">,</span>
    <span class="c1"># "b": "0%"  (not shown explicitly)
</span>    <span class="s">"c"</span><span class="p">:</span> <span class="s">"25%"</span>
<span class="p">}</span>

<span class="c1"># Position 2 ('abbb'):
</span><span class="p">{</span>
    <span class="s">"a"</span><span class="p">:</span> <span class="s">"25%"</span><span class="p">,</span>
    <span class="s">"b"</span><span class="p">:</span> <span class="s">"75%"</span><span class="p">,</span>
<span class="p">}</span>

<span class="c1"># Position 3 ('abca'):
</span><span class="p">{</span>
    <span class="s">"a"</span><span class="p">:</span> <span class="s">"50%"</span><span class="p">,</span>
    <span class="s">"b"</span><span class="p">:</span> <span class="s">"25%"</span><span class="p">,</span>
    <span class="s">"c"</span><span class="p">:</span> <span class="s">"25%"</span>
<span class="p">}</span>
</code></pre></div></div>

<p>We can also make an observation that the sum of relative frequencies at each position sums up to 100%.</p>

<ol>
  <li>Given that <em>k</em> is a fixed number of letters/positions (it is <code class="language-plaintext highlighter-rouge">3</code> in this case), then for each word, for each letter, get the positional coverage, e.g.:</li>
</ol>

<table>
  <thead>
    <tr>
      <th>Word</th>
      <th>Position 1</th>
      <th>Position 2</th>
      <th>Position 3</th>
      <th>Total</th>
      <th>Total (relative = total / k)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>“abc”</td>
      <td>75%</td>
      <td>75%</td>
      <td>25%</td>
      <td>175%</td>
      <td>58.33%</td>
    </tr>
    <tr>
      <td>“cba”</td>
      <td>25%</td>
      <td>75%</td>
      <td>50%</td>
      <td>150%</td>
      <td>50%</td>
    </tr>
  </tbody>
</table>

<p>Voila. We found the best word of the two: <code class="language-plaintext highlighter-rouge">"abc"</code>. Congratulations, <code class="language-plaintext highlighter-rouge">"abc"</code>! :tada:</p>

<p>Actually, not quite. If you are diligent enough, you would see that <code class="language-plaintext highlighter-rouge">"abb"</code> has total (relative) coverage of 58.33%, i.e., the same as <code class="language-plaintext highlighter-rouge">"abc"</code>.
The problem stems from the repeating characters.</p>

<table>
  <thead>
    <tr>
      <th>Word</th>
      <th>Position 1</th>
      <th>Position 2</th>
      <th>Position 3</th>
      <th>Total</th>
      <th>Total (relative = total / k)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>“abc”</td>
      <td>75%</td>
      <td>75%</td>
      <td>25%</td>
      <td>175%</td>
      <td>58.33%</td>
    </tr>
    <tr>
      <td>“abb”</td>
      <td>75%</td>
      <td>75%</td>
      <td>25%</td>
      <td>175%</td>
      <td>58.33%</td>
    </tr>
  </tbody>
</table>

<p>Why? Because, as letters start to repeat, the probability of hitting an existing letter from the target word becomes lower.
For example, <code class="language-plaintext highlighter-rouge">"abb"</code> has only two distinct letters, namely <code class="language-plaintext highlighter-rouge">"a"</code> and <code class="language-plaintext highlighter-rouge">"b"</code>, whereas <code class="language-plaintext highlighter-rouge">"abc"</code> has three. So, statistically speaking, <code class="language-plaintext highlighter-rouge">"abc"</code> has a higher
probability of hitting some or all letters correct.</p>

<p>You may wonder why then the two words <code class="language-plaintext highlighter-rouge">"abb"</code> and <code class="language-plaintext highlighter-rouge">"abc"</code> have the same relative coverage. The answer is that we should <em>not</em> be counting
positional coverage of a letter if we already saw it before. The following example shows the correct calculations:</p>

<table>
  <thead>
    <tr>
      <th>Word</th>
      <th>Position 1</th>
      <th>Position 2</th>
      <th>Position 3</th>
      <th>Total</th>
      <th>Total (relative = total / k)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>“abc”</td>
      <td>75%</td>
      <td>75%</td>
      <td>25%</td>
      <td>175%</td>
      <td>58.33%</td>
    </tr>
    <tr>
      <td>“abb”</td>
      <td>75%</td>
      <td>75%</td>
      <td>(skip)</td>
      <td>150%</td>
      <td>50%</td>
    </tr>
  </tbody>
</table>

<p>Notice how we skip adding the proportion of the letter <code class="language-plaintext highlighter-rouge">"b"</code> from position 3 because we already processed it in position 2.
This begs the question of what position should we choose if a letter occurs multiple times (i.e., in different positions). 
We just have to be <em>greedy</em>: for a given letter, we have to choose a position (and one only) that has the maximal proportion.
By choosing the maximal proportion, we maximize the end value of the total (relative) coverage.</p>

<p>Alternatively, we can just ignore the words with repeating characters before returning a set of good starter words
(ordered by total relative coverage). Note that we still use those words to calculate (relative) positional proportion (coverage)
of each letter.</p>

<p>Here is a list of top-10 starter words,
along with their total (relative) coverage (based on the dataset of ~5K 5-letter words):</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span>
    <span class="s">'cares'</span><span class="p">:</span> <span class="mf">0.16803890915407332</span><span class="p">,</span>
    <span class="s">'bares'</span><span class="p">:</span> <span class="mf">0.1677609866249783</span><span class="p">,</span>
    <span class="s">'cores'</span><span class="p">:</span> <span class="mf">0.16737884314747262</span><span class="p">,</span>
    <span class="s">'bores'</span><span class="p">:</span> <span class="mf">0.1671009206183776</span><span class="p">,</span>
    <span class="s">'pares'</span><span class="p">:</span> <span class="mf">0.16616293208268193</span><span class="p">,</span>
    <span class="s">'tares'</span><span class="p">:</span> <span class="mf">0.16581552892131318</span><span class="p">,</span>
    <span class="s">'canes'</span><span class="p">:</span> <span class="mf">0.1657807886051763</span><span class="p">,</span>
    <span class="s">'pores'</span><span class="p">:</span> <span class="mf">0.16550286607608128</span><span class="p">,</span>
    <span class="s">'banes'</span><span class="p">:</span> <span class="mf">0.16550286607608128</span><span class="p">,</span>
    <span class="s">'cones'</span><span class="p">:</span> <span class="mf">0.16512072259857566</span><span class="p">,</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Aaand, we are done. Or so I hope.</p>

<h2 id="general-analysis">General Analysis</h2>

<p>Here I was just playing with letter frequencies. Check it out.</p>

<p>It’s interesting that <code class="language-plaintext highlighter-rouge">"s"</code> is the most frequent letter
in positions 1 and 5 (0 and 4, resp., if you are a techie).</p>

<h4 id="general-letter-frequency">General letter frequency</h4>

<p>Frequency of letters in all 5-letter words (counting duplicate letters within each word):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
    'a': 2348, 'b': 715, 'c': 964, 'd': 1181,
    'e': 3009, 'f': 561, 'g': 679, 'h': 814,
    'i': 1592, 'j': 89,  'k': 596, 'l': 1586,
    'm': 843,  'n': 1285,'o': 1915,'p': 955,
    'q': 53,   'r': 1910,'s': 3033,'t': 1585,
    'u': 1089, 'v': 318, 'w': 505, 'x': 139,
    'y': 886,  'z': 135
}
</code></pre></div></div>

<h4 id="letter-frequency-at-each-position">Letter frequency at each position</h4>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
    0: {
        'a': 296, 'b': 432, 'c': 440, 'd': 311,
        'e': 129, 'f': 318, 'g': 279, 'h': 239,
        'i': 74,  'j': 73,  'k': 91,  'l': 271,
        'm': 298, 'n': 118, 'o': 108, 'p': 386,
        'q': 39,  'r': 268, 's': 724, 't': 376,
        'u': 75,  'v': 109, 'w': 228, 'x': 4,
        'y': 47,  'z': 24
    },
    1: {
        'a': 930, 'b': 32,  'c': 82,  'd': 43,
        'e': 660, 'f': 12,  'g': 24,  'h': 271,
        'i': 673, 'j': 4,   'k': 29,  'l': 360,
        'm': 71,  'n': 168, 'o': 911, 'p': 113,
        'q': 10,  'r': 456, 's': 40,  't': 122,
        'u': 534, 'v': 27,  'w': 81,  'x': 33,
        'y': 65,  'z': 6
    },
    2: {
        'a': 605, 'b': 128, 'c': 184, 'd': 178,
        'e': 397, 'f': 87,  'g': 139, 'h': 39,
        'i': 516, 'j': 8,   'k': 90,  'l': 388,
        'm': 209, 'n': 410, 'o': 484, 'p': 169,
        'q': 4,   'r': 475, 's': 248, 't': 280,
        'u': 313, 'v': 121, 'w': 98,  'x': 67,
        'y': 68,  'z': 52
    },
    3: {
        'a': 339, 'b': 99,  'c': 210, 'd': 218,
        'e': 1228,'f': 100, 'g': 176, 'h': 73,
        'i': 284, 'j': 4,   'k': 243, 'l': 365,
        'm': 188, 'n': 386, 'o': 262, 'p': 196,
        'q': 0,   'r': 310, 's': 257, 't': 447,
        'u': 154, 'v': 61,  'w': 70,  'x': 5,
        'y': 41,  'z': 41
    },
    4: {
        'a': 178, 'b': 24,  'c': 48,  'd': 431,
        'e': 595, 'f': 44,  'g': 61,  'h': 192,
        'i': 45,  'j': 0,   'k': 143, 'l': 202,
        'm': 77,  'n': 203, 'o': 150, 'p': 91,
        'q': 0,   'r': 401, 's': 1764,'t': 360,
        'u': 13,  'v': 0,   'w': 28,  'x': 30,
        'y': 665, 'z': 12
    }
}
</code></pre></div></div>]]></content><author><name></name></author><category term="math" /><category term="tech" /><category term="math" /><category term="probability" /><category term="wordle" /><summary type="html"><![CDATA[Basic analysis of 5-letter words. Inspired by Wordle.]]></summary></entry><entry><title type="html">Reading *paid* Financial Times articles for free</title><link href="https://oneturkmen.github.io/tech/2020/02/23/ft-subscription-for-free.html" rel="alternate" type="text/html" title="Reading *paid* Financial Times articles for free" /><published>2020-02-23T00:34:14+00:00</published><updated>2020-02-23T00:34:14+00:00</updated><id>https://oneturkmen.github.io/tech/2020/02/23/ft-subscription-for-free</id><content type="html" xml:base="https://oneturkmen.github.io/tech/2020/02/23/ft-subscription-for-free.html"><![CDATA[<p><strong>Note:</strong> FT disabled the app so you cannot read the news for free anymore.
You can still access the archived versions though, with the link provided at
the end of this article. This incident happened back in August 2018.</p>

<figure>

  <picture>
    <!--<source media="(max-width: 480px)" srcset="/assets/img/ft_medium-480.webp" />
    <source media="(max-width: 800px)" srcset="/assets/img/ft_medium-800.webp" />
    <source media="(max-width: 1400px)" srcset="/assets/img/ft_medium-1400.webp" />
    -->

    <!-- Fallback to the original file -->
    <img class="img-fluid rounded z-depth-1" src="/assets/img/ft_medium.png" />

  </picture>

</figure>

<h4 id="paid-financial-times--for-free">Paid Financial Times … for free</h4>

<p>I once stumbled upon an article from Financial Times, an online newspaper
with emphasis on business and economic news.</p>

<p><a href="https://medium.com/ft-product-technology/making-a-request-to-the-financial-times-b2119a2f422d">The article</a>,
named <em>What happens when you visit FT.com?</em>, describes what happens in the background
when one makes a request to <a href="https://www.ft.com">ft.com</a>. Besides many interesting
technical details, I noted that FT used <a href="https://www.heroku.com">Heroku</a> for hosting its services.</p>

<p>If you have ever used Heroku, you know that apps get public <code class="language-plaintext highlighter-rouge">&lt;NAME&gt;.herokuapp.com</code> domain by default.
For example, say I register a Heroku app and name it <strong>apple</strong>. When I deploy my <strong>apple</strong> app (e.g.,
a RESTful web server), I will be able to access it on <code class="language-plaintext highlighter-rouge">https://apple.herokuapp.com</code>.</p>

<p>The moment I saw a word <em>Heroku</em> in their article, I wondered if I could randomly
guess any of their services’ URL. Out of curiosity, I tried a few. Surprisingly, one of the URLs (namely, <code class="language-plaintext highlighter-rouge">financialtimes.herokuapp.com</code>)
worked; I managed to access the same contents as in <code class="language-plaintext highlighter-rouge">ft.com</code>.
What was more exciting is that I could access paid articles <strong>for free</strong>.
In other words, one could read all articles, paid or not, without any subscription whatsoever.</p>

<figure>

  <picture>
    <!--<source media="(max-width: 480px)" srcset="/assets/img/ft_bypass-480.webp" />
    <source media="(max-width: 800px)" srcset="/assets/img/ft_bypass-800.webp" />
    <source media="(max-width: 1400px)" srcset="/assets/img/ft_bypass-1400.webp" />
    -->

    <!-- Fallback to the original file -->
    <img class="img-fluid rounded z-depth-1" src="/assets/img/ft_bypass.png" />

  </picture>

</figure>

<p>Being not sure how serious this issue is, I nevertheless hoped to get some
bounty for finding out a “backdoor” that let anyone read paid articles for free
(in other words, authorization could be bypassed). After reporting the issue,
even though I got nothing back (*sigh*), the URL link now does not render any FT
content and the issue seem to be resolved.</p>

<p>It may have happened that one or more engineering staff members forgot to remove that app. Who knows,
maybe it was used for testing (e.g., black-box system testing) or something else.</p>

<p><strong>P.S.</strong> You can access archived versions of the website from 2017 <a href="https://web.archive.org/web/20171101000000*/http://financialtimes.herokuapp.com/">at
archive.org</a>.</p>]]></content><author><name></name></author><category term="tech" /><category term="ft" /><category term="auth-bypass" /><summary type="html"><![CDATA[Learn how you can (or rather, could) read articles on Financial Times for free!]]></summary></entry><entry><title type="html">Debugging Node.js apps with Docker and VS Code</title><link href="https://oneturkmen.github.io/tech/2018/08/22/debugging-nodejs.html" rel="alternate" type="text/html" title="Debugging Node.js apps with Docker and VS Code" /><published>2018-08-22T17:00:00+00:00</published><updated>2018-08-22T17:00:00+00:00</updated><id>https://oneturkmen.github.io/tech/2018/08/22/debugging-nodejs</id><content type="html" xml:base="https://oneturkmen.github.io/tech/2018/08/22/debugging-nodejs.html"><![CDATA[<h4 id="prerequisites">Prerequisites</h4>

<p>There are some things that should be installed before we get started:</p>

<ul>
  <li>Docker</li>
  <li>Node JS</li>
  <li>Visual Studio Code (a.k.a VS Code)</li>
</ul>

<h4 id="keywords">Keywords</h4>

<ul>
  <li><strong>Host</strong> - this is your computer where you are working. In computer networking terms (roughly saying), it is a computer that communicates with other computers.</li>
  <li><strong>Docker image</strong> - the set of layers/instructions you describe to run (it is more like a sequence, where the order of commands matter).</li>
  <li><strong>Docker container</strong> - instance of your <strong>image</strong>. Roughly saying, it is like an instance of some “class” (OOP).</li>
</ul>

<h3 id="getting-started">Getting Started</h3>

<p>Clone this project and let’s get started!</p>

<h3 id="it-all-starts-with-npm-init">It all starts with <em>npm init</em></h3>

<p>Imagine that you are working on a computer where Node.js is not installed. One way to proceed with initialization of your project is to install Node.js locally using your package manager (e.g. apt) and then proceed with <code class="language-plaintext highlighter-rouge">npm init</code>. However, there is a cooler, more portable, and more cross-platform way of doing this; where, no version conflicts occur, no manual explicit configurations are needed to be set up with changing OS environment, etc. In other words, <del>heaven</del> Docker!</p>

<p>Docker is a container management service. The keywords of Docker are <strong>develop</strong>, <strong>ship</strong> and <strong>run</strong> anywhere. The whole idea of Docker is for developers to easily develop applications, ship them into containers which can then be deployed anywhere. What a brilliant and lovely idea for the DevOps workflow.</p>

<p>I assume you have latest Docker installed. Docker uses images (check the definition above) to run containers which are, roughly saying, isolated processes that share the same OS kernel. Note that that Docker containers are <strong>NOT</strong> magical, lightweight VMs! If you are interested how Docker Containers work behind the scenes, take a look at the following <a href="https://www.youtube.com/watch?v=sK5i-N34im8">talk</a> given by Jérôme Petazzoni at DockerCon EU.</p>

<p>Let’s initialize our project by using the latest Node.js image. The following command runs an interactive bash terminal, which lets us access the container with Node.js installed in it, and binds a current directory of the host to the <code class="language-plaintext highlighter-rouge">/app</code> directory in the container, which lets us persist our files (e.g. package.json, etc.).</p>

<figure class="highlight"><pre><code class="language-bash" data-lang="bash">docker run <span class="nt">-it</span> <span class="nt">-v</span> <span class="si">$(</span><span class="nb">pwd</span><span class="si">)</span>:/app node /bin/bash</code></pre></figure>

<p>You will immediately notice that your terminal’s hostname has changed to something like <code class="language-plaintext highlighter-rouge">root@b95028b5a79c:/#</code>. Congrats! Now, you are in a container with Node.js present inside! How cool is that, huh? :)</p>

<p>Now, in order to access our files and initialize the project, open the <code class="language-plaintext highlighter-rouge">/app</code> folder and run <code class="language-plaintext highlighter-rouge">npm init</code>:</p>

<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nb">cd</span> /app     <span class="c"># opens app folder</span>
npm init    <span class="c"># initialize the Node.js project (creates a package.json)</span></code></pre></figure>

<p>You can also run <code class="language-plaintext highlighter-rouge">ls</code> command to see what’s in the current directory. You will see that you have everything there from the current directory of the host.</p>

<p>Now run <code class="language-plaintext highlighter-rouge">exit</code> to exit from the container. Since you have attached a volume using a <code class="language-plaintext highlighter-rouge">-v</code> option when running a container, your package.json could be seen in your current directory in the host.</p>

<p>Congratulations! So far, you have managed to initialize a Node.js project using Docker (images and running containers) without actually installing Node.js in your computer. It is quite lovely, isn’t it?</p>

<h3 id="dockerfile-and-docker-compose">Dockerfile and docker-compose</h3>

<p>If your app uses MongoDB as a database service, you already have 2 services interacting (i.e. API &lt;-&gt; DB). If you want to dockerize your multi-service application, then you have to define 2 images for those services using Dockerfiles (one per service).</p>

<p>Docker-compose is another Docker tool which lets us manage multi-container applications. With Docker-compose, it is simpler to manage and scale your services. Docker-compose works almost the same way as the <code class="language-plaintext highlighter-rouge">docker</code> command; instead of providing a Dockerfile, you can configure your services by creating the <code class="language-plaintext highlighter-rouge">docker-compose.yml</code> file. Though in a bit different way, you can configure the same way as in a Dockerfile: mounting volumes, running commands, pulling images, etc.</p>

<figure>

  <picture>
    <!--<source media="(max-width: 480px)" srcset="/assets/img/docker_compose_example-480.webp" />
    <source media="(max-width: 800px)" srcset="/assets/img/docker_compose_example-800.webp" />
    <source media="(max-width: 1400px)" srcset="/assets/img/docker_compose_example-1400.webp" />
    -->

    <!-- Fallback to the original file -->
    <img class="img-fluid rounded z-depth-1" src="/assets/img/docker_compose_example.png" />

  </picture>

</figure>

<p>For instance, take a look at the image above. We can define a single Dockerfile for the API, and define the configs for Mongo just inside the Compose file. Why don’t we create Dockerfiles for it? Indeed, you could do so and configure your DB (e.g. create users, roles, priviliges, etc. upon initialization). However, in this case, we don’t need to configure anything, we just need the MongoDB service running in the respective container.</p>

<p><strong>NOTE:</strong> We would use Dockerfile for the Node.js API service because we have to <strong>build</strong> the app first, i.e. install its dependencies, transpile (if we use Typescript, etc.). This is how Dockerfile for the Node.js would look like:</p>

<figure class="highlight"><pre><code class="language-dockerfile" data-lang="dockerfile"><span class="c"># use Node.js version latest</span>
<span class="k">FROM</span><span class="s"> node</span>

<span class="c"># create app folder in the container (not the host)</span>
<span class="k">RUN </span><span class="nb">mkdir</span> <span class="nt">-p</span> /app

<span class="c"># sets the working directory inside the container (where RUN/CMD commands will be executed)</span>
<span class="k">WORKDIR</span><span class="s"> /app</span>

<span class="c"># copies package.json from the current directory into the /app folder inside the container</span>
<span class="k">COPY</span><span class="s"> package.json /app</span>

<span class="c"># runs "npm install" command inside the container</span>
<span class="k">RUN </span><span class="o">[</span><span class="s2">"npm"</span>, <span class="s2">"install"</span><span class="o">]</span>

<span class="c"># copy the node_modules and the rest of the files into /app</span>
<span class="k">COPY</span><span class="s"> . /app</span></code></pre></figure>

<p>So, what have we done? We wrote a sequence of instructions which defines your image. We pulled Node (version 9) from the <a href="https://docs.docker.com/docker-hub/repos/">Docker Hub</a>, created a directory <code class="language-plaintext highlighter-rouge">/app</code> inside the container, we “told” Docker to work with the <code class="language-plaintext highlighter-rouge">/app</code> directory, and copied everything (i.e. package.json, src folders, readme, etc.) from the <strong>current folder of the host machine</strong> into the <strong><code class="language-plaintext highlighter-rouge">/app</code> folder in the container</strong>. And, ultimately, we ran the <code class="language-plaintext highlighter-rouge">npm install</code> command inside the container, so we get the dependencies from the package.json installed.</p>

<p>Now, let’s create the <code class="language-plaintext highlighter-rouge">docker-compose.yml</code> file where we will define set the configurations (env variables, volumes, networks, etc.) for the services in our architecture. Here is an example of how to define these:</p>

<figure class="highlight"><pre><code class="language-yml" data-lang="yml"><span class="na">version</span><span class="pi">:</span> <span class="s2">"</span><span class="s">3"</span>          <span class="c1"># use version Compose version 3</span>
<span class="na">services</span><span class="pi">:</span>             <span class="c1"># our services</span>
  <span class="na">api</span><span class="pi">:</span>
    <span class="na">build</span><span class="pi">:</span> <span class="s">.</span>          <span class="c1"># use Dockerfile from current directory at build time</span>
    <span class="na">volumes</span><span class="pi">:</span>          <span class="c1"># volumes are there to let us persist data when containers are exited</span>
      <span class="pi">-</span> <span class="s">.:/app</span>        <span class="c1"># bind a current directory of the host to the /app directory in the container</span>
    <span class="na">depends_on</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s">database</span>      <span class="c1"># not started until "example-mongo" service is started</span>
    <span class="na">networks</span><span class="pi">:</span>         <span class="c1"># let's us be discoverable and reachable by other services in the same network</span>
      <span class="pi">-</span> <span class="s">api-net</span>       <span class="c1"># join "api-net" network</span>
    <span class="na">ports</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s">3000:3000</span>     <span class="c1"># bind port 3000 on host to port 3000 on container</span>
    <span class="na">command</span><span class="pi">:</span> <span class="s">npm run start</span>  <span class="c1"># execute the following command when the image is running, e.g. run the Node server</span>

  <span class="na">database</span><span class="pi">:</span>
    <span class="na">image</span><span class="pi">:</span> <span class="s">mongo</span>      <span class="c1"># if tag is not specified, gets latest image (e.g. MongoDB image)</span>
    <span class="na">environment</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s">MONGO_INITDB_ROOT_USERNAME=admin</span>      <span class="c1"># set the root username to "admin"</span>
      <span class="pi">-</span> <span class="s">MONGO_INITDB_ROOT_PASSWORD=admin123</span>   <span class="c1"># set the root password to "admin123"</span>
    <span class="na">volumes</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s">./data:/data/db</span>   <span class="c1"># persist data from mongodb</span>
    <span class="na">networks</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s">api-net</span>       <span class="c1"># makes "database" reachable (via hostname) by other services in the same network</span>
    <span class="na">ports</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s2">"</span><span class="s">27017:27017"</span> <span class="c1"># bind port</span></code></pre></figure>

<p>Note that you could leverage <em>links</em> as well, though I personally love <em>networks</em> because of their simplicity: you can reach other services in the same network just by using their service name (in our case, they are <code class="language-plaintext highlighter-rouge">api</code> or <code class="language-plaintext highlighter-rouge">database</code>). One could also define <code class="language-plaintext highlighter-rouge">working directory</code> inside the compose file (by using <code class="language-plaintext highlighter-rouge">working_dir</code> field) instead of specifying it in the Dockerfile.</p>

<p>For more details on Docker Compose, I would recommend the following <a href="https://docs.docker.com/compose/compose-file/">Compose file Reference</a></p>

<h3 id="typescript-and-nodejs">Typescript and Node.js</h3>

<p>Typescript is a programming language that brings us an optional static type-checking and latest ECMAScript features. By using Typescript, you can leverage the power of OOP, i.e. interfaces, classes, inheritance, polymorphism etc. I would personally recommend to everyone, especially to those who come from the Java/C# side and are just starting out with Javascript. <code class="language-plaintext highlighter-rouge">.ts</code> files are compiled to <code class="language-plaintext highlighter-rouge">.js</code> files, meaning that Typescript is compiled to Javascript. So, in the end, you end up with Javascript anyway :)</p>

<p>In order to get started with Typescript, we have to install <code class="language-plaintext highlighter-rouge">typescript</code> module via npm. We can do this by running our Node container again. Do not forget to attach a volume so the change in package.json is actually saved on our host:</p>

<figure class="highlight"><pre><code class="language-bash" data-lang="bash">docker run <span class="nt">-it</span> <span class="nt">-v</span> <span class="si">$(</span><span class="nb">pwd</span><span class="si">)</span>:/app node /bin/bash    <span class="c"># access the Node container</span>
<span class="nb">cd</span> /app                                         <span class="c"># get into /app folder</span>
npm <span class="nb">install </span>typescript <span class="nt">--save-dev</span>               <span class="c"># install and save as development dependency</span></code></pre></figure>

<p>At the same time, let’s install <code class="language-plaintext highlighter-rouge">express</code> and the type for it to run the Express server! <strong>Note</strong> that I will not be using MongoDB in this case (though you should experiment and try yourself!).</p>

<figure class="highlight"><pre><code class="language-bash" data-lang="bash">npm <span class="nb">install </span>express <span class="nt">--save</span>
npm <span class="nb">install</span> @types/express <span class="nt">--save-dev</span>
<span class="nb">exit</span>        <span class="c"># exit the container</span></code></pre></figure>

<p>As there is already some boilerplate code defined in <code class="language-plaintext highlighter-rouge">src/server.ts</code> file, let’s change the package.json so that we first compile the code from TS to JS, and then run it! <strong>Make sure you are out of the container.</strong></p>

<figure class="highlight"><pre><code class="language-json" data-lang="json"><span class="nl">"build"</span><span class="p">:</span><span class="w"> </span><span class="s2">"tsc"</span><span class="err">,</span><span class="w">   </span><span class="err">/*</span><span class="w"> </span><span class="err">Transpile</span><span class="w"> </span><span class="err">to</span><span class="w"> </span><span class="err">JS</span><span class="w">  </span><span class="err">*/</span><span class="w">
</span><span class="nl">"start"</span><span class="p">:</span><span class="w"> </span><span class="s2">"npm run build &amp;&amp; node ./dist/server.js"</span><span class="w">  </span><span class="err">/*</span><span class="w"> </span><span class="err">Build</span><span class="w"> </span><span class="err">and</span><span class="w"> </span><span class="err">run</span><span class="w"> </span><span class="err">the</span><span class="w"> </span><span class="err">server</span><span class="w"> </span><span class="err">*/</span></code></pre></figure>

<p>We also need to create a <code class="language-plaintext highlighter-rouge">tsconfig.json</code> file that configures the Typescript compiler. More details on TS compilation configurations, check <a href="https://github.com/Microsoft/TypeScript-Node-Starter#typescript-node-starter">this link</a> out. Here is our example:</p>

<figure class="highlight"><pre><code class="language-json" data-lang="json"><span class="p">{</span><span class="w">
    </span><span class="nl">"compileOnSave"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span><span class="w">
    </span><span class="nl">"compilerOptions"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
        </span><span class="nl">"outDir"</span><span class="p">:</span><span class="w"> </span><span class="s2">"dist"</span><span class="p">,</span><span class="w">
        </span><span class="nl">"module"</span><span class="p">:</span><span class="w"> </span><span class="s2">"commonjs"</span><span class="p">,</span><span class="w">
        </span><span class="nl">"target"</span><span class="p">:</span><span class="w"> </span><span class="s2">"es6"</span><span class="p">,</span><span class="w">
        </span><span class="nl">"noImplicitAny"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span><span class="w">
        </span><span class="nl">"moduleResolution"</span><span class="p">:</span><span class="w"> </span><span class="s2">"node"</span><span class="p">,</span><span class="w">
        </span><span class="nl">"sourceMap"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span><span class="w">
        </span><span class="nl">"baseUrl"</span><span class="p">:</span><span class="w"> </span><span class="s2">"."</span><span class="p">,</span><span class="w">
        </span><span class="nl">"skipLibCheck"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span><span class="w">
        </span><span class="nl">"paths"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
            </span><span class="nl">"*"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
                </span><span class="s2">"node_modules/*"</span><span class="p">,</span><span class="w">
                </span><span class="s2">"/src/types/*"</span><span class="w">
            </span><span class="p">]</span><span class="w">
        </span><span class="p">}</span><span class="w">
    </span><span class="p">},</span><span class="w">
    </span><span class="err">//</span><span class="w"> </span><span class="err">includes</span><span class="w"> </span><span class="err">all</span><span class="w"> </span><span class="err">typescript</span><span class="w"> </span><span class="err">files</span><span class="w">
    </span><span class="nl">"include"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
        </span><span class="s2">"./src/**/*.ts"</span><span class="p">,</span><span class="w">
    </span><span class="p">],</span><span class="w">
    </span><span class="err">//</span><span class="w"> </span><span class="err">excludes</span><span class="w"> </span><span class="err">the</span><span class="w"> </span><span class="err">folder</span><span class="w"> </span><span class="err">containing</span><span class="w"> </span><span class="err">the</span><span class="w"> </span><span class="err">compiled</span><span class="w"> </span><span class="err">JS</span><span class="w"> </span><span class="err">files</span><span class="w"> 
    </span><span class="nl">"exclude"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w"> </span><span class="s2">"./dist"</span><span class="p">]</span><span class="w">
</span><span class="p">}</span></code></pre></figure>

<p>Now, let’s create a <code class="language-plaintext highlighter-rouge">docker-compose.yml</code> file with which we will run a Node.js server. Here, we will not use a Dockerfile; we will instead configure the API service straight in the Compose yaml file.</p>

<figure class="highlight"><pre><code class="language-yaml" data-lang="yaml"><span class="na">version</span><span class="pi">:</span> <span class="s2">"</span><span class="s">3"</span>          <span class="c1"># use version Compose version 3</span>
<span class="na">services</span><span class="pi">:</span>             <span class="c1"># our services </span>
  <span class="na">api</span><span class="pi">:</span>
    <span class="na">image</span><span class="pi">:</span> <span class="s">node</span>       <span class="c1"># use latest node image</span>
    <span class="na">working_dir</span><span class="pi">:</span> <span class="s">/app</span> <span class="c1"># set the working directory to /app</span>
    <span class="na">volumes</span><span class="pi">:</span>          <span class="c1"># volumes are there to let us persist data when containers are exited</span>
      <span class="pi">-</span> <span class="s">.:/app</span>        <span class="c1"># bind a current directory of the host to the /app directory in the container</span>
    <span class="na">ports</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s">3000:3000</span>     <span class="c1"># bind port 3000 on host to port 3000 on container</span>
    <span class="na">command</span><span class="pi">:</span> <span class="s2">"</span><span class="s">npm</span><span class="nv"> </span><span class="s">run</span><span class="nv"> </span><span class="s">start"</span></code></pre></figure>

<p>Before running the container, we have to install our dependencies. You may not have Node.js in your computer, or you may have a different version, so let’s run a Node container (of latest version) and install our dependencies (do not forget to attach a volume):</p>

<figure class="highlight"><pre><code class="language-bash" data-lang="bash">docker run <span class="nt">-it</span> <span class="nt">-v</span> <span class="si">$(</span><span class="nb">pwd</span><span class="si">)</span>:/app node /bin/bash
<span class="c"># we are inside the container</span>
npm <span class="nb">install
exit</span> </code></pre></figure>

<p>To run the actual service, type the following command (the <code class="language-plaintext highlighter-rouge">-f</code> option specifies the file to run) to run the server:</p>

<figure class="highlight"><pre><code class="language-bash" data-lang="bash">docker-compose <span class="nt">-f</span> docker-compose.yml up</code></pre></figure>

<p><strong>Note if you are getting ERROR:</strong> If you are getting the <code class="language-plaintext highlighter-rouge">ERROR: Error processing tar file(exit status 1): unexpected EOF</code>, run the following commands:</p>

<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nb">cd</span> ..       <span class="c"># get out to the parent directory </span>
<span class="nb">sudo chown</span> <span class="nt">-R</span> <span class="si">$(</span><span class="nb">whoami</span><span class="si">)</span> nodejs-debugging/  <span class="c"># this gives you and Docker the rights to the nodejs-debugging folder</span></code></pre></figure>

<p>If you see <code class="language-plaintext highlighter-rouge">Listening on port 3000!</code> message, yay! Open a browser and type <code class="language-plaintext highlighter-rouge">localhost:3000</code>, and you will see the server is up. You can also try opening the routes, i.e. <code class="language-plaintext highlighter-rouge">/greet</code> and <code class="language-plaintext highlighter-rouge">/time</code>.</p>

<h3 id="debugging-with-visual-studio-code">Debugging with Visual Studio Code</h3>

<p>Visual Studio Code (VS Code) has a built-in debugging support for Node.js runtime and can debug any languages that are transpiled to JavaScript.</p>

<p>Since the VS Code Node.js debugger communicates to the Node.js runtimes through wire protocols, the set of supported runtimes is determined by all runtimes supporting the wire protocols:</p>

<ul>
  <li><strong>legacy:</strong> the original <a href="https://github.com/buggerjs/bugger-v8-client/blob/master/PROTOCOL.md">V8 Debugger Protocol</a> which is currently supported by older runtimes.</li>
  <li><strong>inspector:</strong> the new <a href="https://chromedevtools.github.io/debugger-protocol-viewer/v8/">V8 Inspector Protocol</a> is exposed via the <code class="language-plaintext highlighter-rouge">--inspect flag</code> in Node.js versions &gt;= 6.3. It addresses most of the limitations and scalability issues of the legacy protocol.</li>
</ul>

<p>As we are running a server from a Docker container, we have to attach a <em>remote</em> debugger. We need to add a <strong>launch configuration</strong> to the <code class="language-plaintext highlighter-rouge">.vscode</code> folder, i.e. <code class="language-plaintext highlighter-rouge">launch.json</code>. Here is an example of the launch configuration file:</p>

<figure class="highlight"><pre><code class="language-json" data-lang="json"><span class="p">{</span><span class="w">
    </span><span class="nl">"version"</span><span class="p">:</span><span class="w"> </span><span class="s2">"0.2.0"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"configurations"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
        </span><span class="p">{</span><span class="w">
            </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"node"</span><span class="p">,</span><span class="w">
            </span><span class="nl">"request"</span><span class="p">:</span><span class="w"> </span><span class="s2">"attach"</span><span class="p">,</span><span class="w">
            </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Remote Debugging"</span><span class="p">,</span><span class="w">
            </span><span class="nl">"address"</span><span class="p">:</span><span class="w"> </span><span class="s2">"0.0.0.0"</span><span class="p">,</span><span class="w">
            </span><span class="nl">"port"</span><span class="p">:</span><span class="w"> </span><span class="mi">9229</span><span class="p">,</span><span class="w">
            </span><span class="nl">"localRoot"</span><span class="p">:</span><span class="w"> </span><span class="s2">"${workspaceFolder}/dist"</span><span class="p">,</span><span class="w">
            </span><span class="nl">"remoteRoot"</span><span class="p">:</span><span class="w"> </span><span class="s2">"/app/dist"</span><span class="p">,</span><span class="w">
            </span><span class="nl">"protocol"</span><span class="p">:</span><span class="s2">"inspector"</span><span class="w">
        </span><span class="p">}</span><span class="w">
    </span><span class="p">]</span><span class="w">
</span><span class="p">}</span></code></pre></figure>

<p>We set a <em>remote root</em> to be path in the container, where our program lives.</p>

<p>Now, in order to add remote debugging, we have to add another script to package.json:</p>

<figure class="highlight"><pre><code class="language-json" data-lang="json"><span class="nl">"debug"</span><span class="p">:</span><span class="w"> </span><span class="s2">"npm run build &amp;&amp; node --inspect-brk=0.0.0.0:9229 ./dist/server.js"</span></code></pre></figure>

<p>The <code class="language-plaintext highlighter-rouge">debug</code> script will build (i.e. transpile the TS code) project and will start Node runtime in debugging mode accessible remotely on port <code class="language-plaintext highlighter-rouge">9229</code> (remember the port we specified above?).</p>

<p>Let’s create another Compose file, which we will use for running the server in the debug mode:</p>

<figure class="highlight"><pre><code class="language-yaml" data-lang="yaml"><span class="na">version</span><span class="pi">:</span> <span class="s2">"</span><span class="s">3"</span>          <span class="c1"># use version Compose version 3</span>
<span class="na">services</span><span class="pi">:</span>             <span class="c1"># our services </span>
  <span class="na">api</span><span class="pi">:</span>
    <span class="na">image</span><span class="pi">:</span> <span class="s">node</span>       <span class="c1"># use latest node image</span>
    <span class="na">working_dir</span><span class="pi">:</span> <span class="s">/app</span> <span class="c1"># set the working directory to /app</span>
    <span class="na">volumes</span><span class="pi">:</span>          <span class="c1"># volumes are there to let us persist data when containers are exited</span>
      <span class="pi">-</span> <span class="s">.:/app</span>        <span class="c1"># bind a current directory of the host to the /app directory in the container</span>
    <span class="na">ports</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s">3000:3000</span>     <span class="c1"># bind port 3000 on host to port 3000 on container</span>
      <span class="pi">-</span> <span class="s">9229:9229</span>     <span class="c1"># bind port 9229 for debugging</span>
    <span class="na">command</span><span class="pi">:</span> <span class="s2">"</span><span class="s">npm</span><span class="nv"> </span><span class="s">run</span><span class="nv"> </span><span class="s">debug"</span></code></pre></figure>

<p>The only differences are that we are now running in <strong>debug mode</strong> and we attached extra <strong>port for debugging (9229)</strong>.</p>

<p>Type the following command in the terminal so you can run the debug server:</p>

<figure class="highlight"><pre><code class="language-bash" data-lang="bash">docker-compose <span class="nt">-f</span> docker-compose.debug.yml up</code></pre></figure>

<p>And you will see the following message</p>

<blockquote>
  <p>api_1  | Debugger listening on ws://0.0.0.0:9229/f3cdf4f2-4685-21e6-8c31</p>
</blockquote>

<p>Yay! Now, as the debugger is listening on <code class="language-plaintext highlighter-rouge">0.0.0.0:9229</code>, we have to start debugging via VS Code. If you click <code class="language-plaintext highlighter-rouge">CTRL + SHIFT + D</code> keys, the “debug” mode will open. You will see the “Remote Debugging” slider upper-left corner and the button (looks like green triangle). E.g.:</p>

<figure>

  <picture>
    <!--<source media="(max-width: 480px)" srcset="/assets/img/debugger_view-480.webp" />
    <source media="(max-width: 800px)" srcset="/assets/img/debugger_view-800.webp" />
    <source media="(max-width: 1400px)" srcset="/assets/img/debugger_view-1400.webp" />
    -->

    <!-- Fallback to the original file -->
    <img class="img-fluid rounded z-depth-1" src="/assets/img/debugger_view.png" />

  </picture>

</figure>

<p>Be courageous and click on the green triangle button. Congratulations! You have just started debugging your app using VS Code and Docker containers!</p>

<h3 id="bonus-round-simplifying-debugging-further-">Bonus round! Simplifying debugging further …</h3>

<p>Launching container manually, and proceeding to <em>Debug</em> view on VS Code may be a bit of a daunting and annoying task. However, we can <strong>automate</strong> that and make our lives a bit easier: let’s make it so that we can start debugging by just clicking <em>F5</em> key! :squirrel:</p>

<p>Let’s first create the <strong>tasks</strong> file inside the <code class="language-plaintext highlighter-rouge">.vscode</code> directory, where our configurations for the VS Code reside. Copy-paste the following into the <code class="language-plaintext highlighter-rouge">tasks.json</code>:</p>

<figure class="highlight"><pre><code class="language-json" data-lang="json"><span class="p">{</span><span class="w">
    </span><span class="nl">"version"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2.0.0"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"tasks"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
        </span><span class="p">{</span><span class="w">
            </span><span class="nl">"label"</span><span class="p">:</span><span class="w"> </span><span class="s2">"launch-debug-container"</span><span class="p">,</span><span class="w">
            </span><span class="nl">"command"</span><span class="p">:</span><span class="w"> </span><span class="s2">"docker-compose -f docker-compose.debug.yml up"</span><span class="p">,</span><span class="w">
            </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"shell"</span><span class="p">,</span><span class="w">
            </span><span class="nl">"group"</span><span class="p">:</span><span class="w"> </span><span class="s2">"build"</span><span class="p">,</span><span class="w">
            </span><span class="nl">"isBackground"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="w">
        </span><span class="p">}</span><span class="w">
    </span><span class="p">]</span><span class="w">
</span><span class="p">}</span></code></pre></figure>

<p>Tasks help you automate building and testing your app. Whenever you start debugging, they may help you build your app first. There may be many tasks per launch configuration, so you can automate anything that annoys you and the steps you constantly repeat whenever you start debugging.</p>

<p>In the task above, we label it as “launch-debug-container” and make it execute a command to start the containers specified in the <code class="language-plaintext highlighter-rouge">docker-compose.debug.yml</code> file.</p>

<p>Now, how do we perform a task when we actually <em>launch</em> our debugging? We have to adjust <code class="language-plaintext highlighter-rouge">launch.json</code> by adding another field in our “Remote Debugging” configuration:</p>

<figure class="highlight"><pre><code class="language-json" data-lang="json"><span class="p">{</span><span class="w">
  </span><span class="err">/*</span><span class="w"> </span><span class="err">...</span><span class="w"> </span><span class="err">*/</span><span class="w">
  </span><span class="nl">"preLaunchTask"</span><span class="p">:</span><span class="w"> </span><span class="s2">"launch-debug-container"</span><span class="w">
  </span><span class="err">/*</span><span class="w"> </span><span class="err">...</span><span class="w"> </span><span class="err">*/</span><span class="w">
</span><span class="p">}</span></code></pre></figure>

<p>By giving a <em>label</em> from <code class="language-plaintext highlighter-rouge">tasks.json</code> to the <em>preLaunchTask</em> property, our task will be executed first before launching our debugger. <strong>Note</strong> that I also change timeout to 60 seconds (default is 10 sec), as the containers <code class="language-plaintext highlighter-rouge">docker-compose.debug.yml</code> take some time to start.</p>

<p>In addition, we want our containers to be stopped and removed after the debugging. If you don’t want to do so, then you are done! Otherwise, if you don’t like the containers still running after you have finished debugging, let’s add another task that will execute a command to stop and remove containers after the debugging session.</p>

<p>Add the following to the list of tasks in your <code class="language-plaintext highlighter-rouge">tasks.json</code>:</p>

<figure class="highlight"><pre><code class="language-json" data-lang="json"><span class="p">{</span><span class="w">
    </span><span class="nl">"label"</span><span class="p">:</span><span class="w"> </span><span class="s2">"end-debug-container"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"command"</span><span class="p">:</span><span class="w"> </span><span class="s2">"docker-compose -f docker-compose.debug.yml down"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"shell"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"group"</span><span class="p">:</span><span class="w"> </span><span class="s2">"build"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"isBackground"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="w">
</span><span class="p">}</span></code></pre></figure>

<p>This will turn the containers down whenever you finish the debugging process. Let’s also adjust the launch configuration in the <code class="language-plaintext highlighter-rouge">launch.json</code> by adding a <code class="language-plaintext highlighter-rouge">postDebugTask</code> property:</p>

<figure class="highlight"><pre><code class="language-json" data-lang="json"><span class="p">{</span><span class="w">
  </span><span class="err">/*</span><span class="w"> </span><span class="err">...</span><span class="w"> </span><span class="err">*/</span><span class="w">
  </span><span class="nl">"postDebugTask"</span><span class="p">:</span><span class="w"> </span><span class="s2">"end-debug-container"</span><span class="w">
  </span><span class="err">/*</span><span class="w"> </span><span class="err">...</span><span class="w"> </span><span class="err">*/</span><span class="w">
</span><span class="p">}</span></code></pre></figure>

<p>Now is the moment … Just click <em>F5</em> and your debugging starts auto<em>magically</em>. If you exit debugging, you will see the containers terminating. Good job!</p>

<p><strong>Note:</strong> if you are getting <code class="language-plaintext highlighter-rouge">The specified task cannot be tracked</code> error, click the <em>Debug anyway</em> button and your debugger will start.</p>

<p><img src="https://media.makeameme.org/created/phew-thank-goodness.jpg" alt="squirrel" /></p>

<h3 id="thank-you--feedback">Thank you + Feedback</h3>

<p>Thank you and good job for reading the entire tutorial! It’s been a long markdown to read; anyway, you have just learned something new which will help you a lot debugging your Node.js apps and become a better developer!</p>

<p>Constructive feedback is always welcome via <em>issues</em> on this repo.</p>

<p><strong>PS.</strong> If something does not work, or if you have any problems, please open an issue in this repo and I will do my best to help you asap.</p>

<h3 id="references">References</h3>

<ul>
  <li><a href="https://code.visualstudio.com/docs/nodejs/nodejs-debugging">Node.js Debugging in VS Code</a></li>
  <li><a href="https://github.com/Microsoft/vscode-recipes/tree/master/Docker-TypeScript">Debugging TypeScript in a Docker Container</a></li>
  <li><a href="https://www.tutorialspoint.com/docker/docker_compose.htm">Docker - Compose</a></li>
</ul>]]></content><author><name></name></author><category term="tech" /><category term="node-js" /><category term="docker" /><category term="vs-code" /><summary type="html"><![CDATA[Learning to debug Node.js apps with Docker and VS Code.]]></summary></entry></feed>