|||

Coding for non coding departments.

EDA process

This is going to be the first of a series of blogpost where I showcase why you should consider learning some coding language if you are an intermediate / advanced Excel user. I will try to explain where I am coming from, why Excel works so well, but most importantly where it breaks. In the next instalments I will present small chunks of working code that you can replicate on your machines and slowly build a knowledge library.

Excel is gold

In my subjective opinion, Microsoft Excel is the best program ever written for personal computers.

You can automate calculations with formulas, create summaries with Pivot tables, illustrate results with quick graphs, and even more advanced tasks by leveraging array formulas and VBA scripts. It’s a program for all users and from the simplest cases, up to the more complex.

This is exactly its main attraction. In 2022, it’s understood by almost everyone. This is what your colleagues, or upper management expects from you when you have to present a report with numbers. This is the format of data reports that the business users can download from the data warehouse.

Excel meme

But can we do better?

However, I have one rule to decide wether Excel is the right tool for the task.

If I can grab a coffee on the loading screen of my Excel file, Excel is the wrong tool for the job.

I have worked in the past with files upward of 120 MB with a fairly good machine for the time, and every time I had to recalculate the formulas (press F9) I was sitting on the edge of my chair. This is not fun, nor productive. There is a great chance that the file is going to break mid way and not only you loose all progress but you might corrupt the file as well, rendering it useless. This means that every time before opening the file I had to create a copy of it to ensure that I won’t lose the state of the previous calculations. I was saving all these backups” at our cloud storage and I received a message from IT two weeks into the project stating that, storing 2 GB every day on the cloud is not feasible” - go figure.

The limitations

The main limitation of Excel is that it can hold only 1 million rows of data. While this might sound like an absurdly large number for most users, the truth is that this is more like a theoretical upper limit, than a reality. In day to day tasks I would say that you can store up to 30,000 rows with 10-15 columns of formulas - much less impressive. Any more that that and we start to fall under the category grab a coffee while it’s loading”.

However, there is a greater, much more important limitation of Excel that an average user can overlook. It’s written for Windows OS. While it can run on MacOS as well, the two operating systems are fundamentally different, so two working linked Excel files on Windows won’t work on any other operating system. Why is this important?

The vast majority of servers worldwide, run Linux or Linux-based operating systems.

This means that you cannot put your Excel routine on a server! The computer engineers, almost always use servers to run their code scripts because it can run there 24-7 without human supervision. Imagine how easy it would be to have your excel files running at a prescheduled rate (once per day - week - month) without any input from you and without using the resource of your personal laptop. Well this is exactly what programmers do, but replace excel files” with coding scripts”. You might need a whole business day to:

  1. Feed the new data to your model,
  2. Run the formulas,
  3. Double-check the output and,
  4. Report the results

The correct mindset

If you run the calculations on the server they will run overnight and your working day starts with this report done. Hence you have more time to worry about more elegant tasks such as interpret the numbers, suggest new strategies or create more advanced reports. This is the power of coding for pragmatic users. Not speed, efficiency, or ease of use. You can effectively triple your work by writing scripts that run on your business laptop after your working hours. I used to run a very big routine overnight that resulted in having the final report in the morning (link here). At this point the company has hired two employees for the price of my salary and I can easily negotiate an increase.

Funnily enough, when computer engineers run scripts that do calculations on the server they often call them jobs”, so they say sentences like, I run this job in the morning”, and they mean that they pressed play” on this specific script. Very productive!


Pre-requirements for this series

I’ve seen countless blogs, articles and youtube videos that are showing you very simple examples such as writing Hello World!” on the cell A1 of an Excel file using Python. Then, they call it a day and tell you to go explore on your own. Frankly I find this a bit ridiculous and not helping.

In this series I will try to illustrate with mini examples how you could perform everyday tasks with data using code (no big introduction on the next post). This has the name data-wrangling in the data-science community. Your vlookup, sum, average, if, sumifs, your replace, substitute, left, mid, right, and your today, time, dateif, all fall under this definition. There are many more data operations that don’t have a specific Excel function that are considered data-wrangling. I would hope that after following and understanding a few examples, you would be able to connect them like lego bricks to create your own complicated reports. It’s about building the intuition of why coding is powerful one example at a time.

Last thing to clarify

In this series I won’t be using VBA and instead I am opting to use the R programming language.

  1. Why not VBA?

VBA is a bit too old and a bit too difficult to read and write. In addition to that, you can find almost 200,000 question per month on stack-overflow about VBA and 2,000,000 questions about Python. That means that you are 10 times more likely to find the solution to your problem there if you are using Python.

  1. So why not Python then?

Python is a fantastic choice. However I am making an extra assumption that you are not very familiar with writing and executing code, googling error messages, setting up your environments etc. Essentially I am choosing R because it has a more gradual learning curve. Python has a great community, however it also has some gatekeepers that can make it difficult to progress. In my opinion the community surrounding R is the best of any programming language and the resources are extra helpful for beginners.

Tools
  1. Download R - here. This is the language itself.
  2. Download R-studio (free version) - here. This is the environment where you will develop and run your coding scripts.
  3. Download the package tidiverse, dplyr and tidyr from R-studio.

Helpfull links:

TLDR

  1. Excel is good, but not always.
  2. Coding is better because it frees your time.
  3. I will show you small examples in R that you can recreate on your machines.
  4. R is easier for beginners that Python because of its gradual learning curve.
  5. After some examples you will be confident to build coding reports on your own, as easily as building excel reports.
Up next My attempt at demystifying causality From wide to long formatted data Suppose that you have the following dataset. We have the columns City and Product and then a bunch of columns of
Latest posts Basic setup and libraries From wide to long formatted data Coding for non coding departments. My attempt at demystifying causality How to explore data meaningfully Giovanni Pierluigi da Palestrina Excel and python