Dan Gustafsson

Using old-style CLI to handle large files

One shortcoming of R, is that sometimes memory runs out. I was unfortunate to have several really large text file with data, and no data base to put them in. Instead of stopping dead, I resorted to Windows CLI. With the plain COPY command, it takes just a few seconds to concatenate fairly large files. With FINDSTR /V the unnecessary rows I accidentally included were quickly dispatched. Using GAWK feels like medieval incantations, but once you have a macro written, it is quite fast on text. SED is also quite a nifty tool (but I haven’t used it that much). With GAWK it is actually possible to do sort-merge joins. There are plenty of R packages to handle large files, like ‘feather’, and ‘RSQLite’, but sometimes it’s simply quicker to go medieval.

So, there is a significant speed-increase when resorting to basic console tools with large matrix-like text-files. My database guru Kimball (does anyone read him anymore?) actually has a few chapters on handling ETL flows with console tools, but I NEVER thought I’d need that. So in conclusion, I guess, Data Science is like a box of chocolates…

Using Redshift regular expression (in Sweden)

How to search and replace wrongly encoded strings in Redshift, with R

To replace parts of a string in the database Redshift, function REGEXP_REPLACE is the tool (Familiar with regexp?) However, sometimes we end up with :

"Tillägg" => "tillÃ¤gg"

In order to find and replace, the quickest and most trouble free way has been this.

Find and replace problem strings using R and Redshift

To search for weird characters to replace, we need their numerical values. Use utf8ToInt().

my_string %>% str_sub(5, 6) %>% utf8ToInt()   #   [1] 195 182

To replace with åäö in Redshift SQL, use function CHR . (Table for unicode numbers)

Unicode for ä :       00E4
ä as numerical :      utf8ToInt("\u00e4") # 228
Double check... :     intToUtf8(228)      # "ä"
Redshift SQL check:   SELECT CHR(228)     # "ä".

To simplify the conversions, I made a small function (not yet finished but it’s w-i-p)

replace_in_table <- function(targettable = "mytable",
targetcol = "mycol"){

cat("1..")
sql_query_sender(
  glue("
    UPDATE {targettable}
    SET {targetcol} = 
      REGEXP_REPLACE ( {targetcol} , CHR(195)||CHR(164), CHR(228) )
    WHERE {targetcol} LIKE '%'||CHR(195)||CHR(164)||'%'")
  )

cat("2..")
sql_query_sender(
  glue("
    UPDATE {targettable}
    SET {targetcol} = 
      REGEXP_REPLACE ( {targetcol} , '<f6>', 'ö' )
    WHERE {targetcol} LIKE '%<%'"))
cat("done.\n")
}

For (1) we put “ä” back into the string where CHR(195)||CHR(164) is. Using CHR(228) ensures we put “ä” back.

For (2) it seems that “glue” prevents this specific error:

ERROR:  String contains invalid or unsupported UTF8 codepoints. Bad UTF8 hex sequence

So, the encoding and behind-the-scenes handling of strings in R is beyond me, most of the time.

Thus, prevent the weird characters to enter the data base in the first place. Happy replacing!

Word and Excel alliance

As a side task, I have automated a few reports using Excel and Word. Nothing special, just a pro memoria for next time.

Here is how to set a formula in a cell, using standard matrix notation
Range("D4").FormulaR1C1 = "=R3C2*10"
Range("D4").FormulaR1C1 = "=R[-1]C[-2]*10"
wkbDest.Sheets(destination_sheet).Cells(i, 13).Formula = "=R[0]C[-1]+R[0]C[-2]+R[0]C[-3]+R[0]C[-4]+R[0]C[-5]+R[0]C[-6]+R[0]C[-7]+R[0]C[-8]+R[0]C[-9]+R[0]C[-10]+R[0]C[-11]"

Copy and paste Charts and Tables into Word
The word document is just a placeholder with links to Excel tables and diagrams. Avoid copy-paste! The operational risk is just gigantic.

Use Alt+F9 to see a text representation of links in Word, so you can know which objects are dynamic. Break those links programmatically in VBA, by iterating over the collection of shapes in Word. After that, re-save document, that is now updated from the Excel.

A picture link works well if the source Excel document IS OPEN. Caution: Updating might go into infinite loop, if the Excel is large.

Find (pesky) formulas in a sheet
Press F5, select Special… , press Formulas and then OK. This selects any cells with formulas.

The curse of large Excel files
When you reach around 10 000 rows of data, generally, all VLOOKUP and formulas and referenced external files, will take considerable time. With organic growth of the business, this is a gradual coming problem.

If Excel sheet updating is heavy, then copy needed data. Using only values, there will be no updating of expressions etc.
wkbDest.Sheets(destination_sheet).Range("A1:T134").Value = _
wkbSource.Sheets(source_sheet).Range("A1:T134").Value
Memo: Do not even try growing with Excel. Go for better software, and admit complexity increasing, even if the tasks are not complicated (i.e reporting).
Memo: Never add to existing files that are large, keep separate data storage files that only updates data. Have a separate file that reads/copies from the data files to a reporting and calculation file. This practice prevents slow and breaking Excel files, that has both calculations and data together (which is a problem as the size grows).

Filling formulas by dragging – in VBA!
Range("C2:C2").Formula = "=B2*$I$2"
Range("C2:C2").Select
Call Selection.AutoFill(Destination:=Range("C2:C" & endrownr), Type:=xlFillDefault)

Finding rightmost cell in a row
Seems the Range object is not reliable enough.

Dim lColumn As Long
lColumn = ws.Cells(1, Columns.Count).End(xlToLeft).Column

This will find the last row in a set of data/any column. Range can be used.
endrownr = wkbSource.Sheets(source_sheet).Range("A1").End(xlDown).Row

LastRow = wkbSource.Sheets(source_sheet).Cells(wkbSource.Sheets(source_sheet).Rows.Count, col_loanid).End(xlUp).Row

This will find a specific value in a column:
rownr = wkbSource.Sheets(source_sheet).Columns("A:A").Find(what:=last_date, LookAt:=xlWhole).Row

Working with Arrays
Dim and then redim to set correct size, with
use “Option Base 1” to set the 0-th element of an array.

Dim dest_loans As Variant ReDim dest_loans(1 To lastrow, 5)

For j = LBound(temp_array, 1) To i - 1 ... Next j

Dates arithmetic to find last months last day
Dim last_month As Date
last_month = Date - Day(Date)

Copy the cell formatting only, to other cells
wkbDest.Sheets(destination_sheet).Range(wkbDest.Sheets(destination_sheet).Cells(1, lColumn), wkbDest.Sheets(destination_sheet).Cells(100, lColumn)).Copy
wkbDest.Sheets(destination_sheet).Range(wkbDest.Sheets(destination_sheet).Cells(1, lColumn + 1), wkbDest.Sheets(destination_sheet).Cells(100, lColumn + 1)).PasteSpecial xlPasteFormats

Appendix : Some general code to handle files

'------------------------------------------------------------------------
' Auxilliary function
' Check open workbook
'------------------------------------------------------------------------

Function IsWorkBookOpen(sWB)
On Error GoTo NotOpen
Workbooks(sWB).Activate
IsWorkBookOpen = True
Exit Function
NotOpen:
IsWorkBookOpen = False
End Function

Dim wkbDest As Workbook
Dim wkbSource As Workbook
Dim destination_sheet As String
Dim source_sheet As String

Dim source_file As String
Dim source_path As String

'--------------------------------------------------------------
' setup data (external file)
'--------------------------------------------------------------

' Path to source data
source_file = "LR Consolidated Deposit Flows.xls"
source_path = "L:\Reports\LR to BM 2 days\"

' Save the active workbook reference
Set wkbDest = ActiveWorkbook

' Check open source file
ret = IsWorkBookOpen(source_file)
If ret = False Then
Set wkbSource = Workbooks.Open(source_path & source_file)
Else
Set wkbSource = Workbooks(source_file)
End If

' Get data from sheets based on last day in month backwards one year
destination_sheet = "LR 2 dep flow"
source_sheet = "Daily Deposits"

wkbDest.Sheets(destination_sheet).Range("A1:T134").Value = _
wkbSource.Sheets(source_sheet).Range("A1:T134").Value

Feeling respected is nice, but not productive?

Everyone wants their opinion to matter. Especially if they feel ignored despite being knowledgeable. But how can strong opinions hinder an organisation in achieving goals and ambitions (as well as frustrate any analyst on the way)?

Some clients want their solution strategy implemented, so they know it will work. They will specify the “what” and “how” simultaneously. However, if the client is not in a position to solve the problem then the “how” might be a wild stab in the dark, and might be so wrong that the analytical effort fails. Or it might be an attempt to control the outcome of the analysis, even purposefully dooming it to failure.

The real problem with opinion is that when it is not a grounded theory, and applied too frequently/hardheaded, it will create friction as the decision makers’ world changes in the light of hard data. This kind of dogmatic opinion, when contradicted by facts, will urge clients to plausibly explain “away” the contradiction, and not learn from it. Embarrassment will prevent the subsequent use of the solution (and maybe future collaboration with the client). This is of course unproductive.

The way around? Ensure that the solution meets the client’s purpose: answer the “why”-question. “Why”ensures that the solution is indeed a piece of the larger puzzle, and will fit into that larger problem/solution. The “why” steers away from the “how to solve”-discussion, and definitely overshadows the “what to solve”. Sometimes even the digging into the “why” will change the “what”.

This situation is closely connected to the organisational “mentality” and the psychology of individual clients. As an analyst, consider changing one or both, as neither can change overnight! The learning experience is our foremost asset, so don’t waste time somewhere if circumstances are not right for learning.

Personal cluster analysis : Analyzing yourself by analyzing your environment

Great minds think alike. Or, like-minded thinking is great.

Knowing the real me
Analyzing yourself is equally important to analyzing data. Simply because we analysts never work in vacuum: there is always an organisation, co-workers, managers and clients that are vital for making work happen. To see ourselves from a 360-perspective, maybe we can try a cluster analysis!

Use a Cluster-based approach
A cluster analysis groups similar objects. Chances are that your working team is a cluster, based on similarities. That means that _you_ are similar to your colleagues. And probably also similar to your manager/s, who hired you. So, look around to spot yourself! Any similarities between your colleagues applies to you, too (probably).

Use some popular list as a similarity metric
A list of career warning signs can be found in this post , and can be summarized as “mental attitude to work, life and colleagues”. If your co-workers and managers fit the list of shortcomings, then chances are you’re right there along with them =)

How clusters behave have implications for the individual
If your own ethics does not rhyme with those of your team mates, chances are that you will be passed over for a lot of career development. Maybe, just maybe, similarities actually matter the most for career, respect and salary. (Some angry commentators in the above post fume over “managing up”, “kissing *ss” and “playing politics”)

This only means that some people share ethics, are “in the game”, playing along, etc. Others are outsiders and end up on the receiving end of the blame game, being passed over, or outright fired for mysterious reasons (Some angry commentators in the above post have examples of all these, perhaps sentiment-labeled “bitter”).

Summary of the usefulness of this cluster-based approach
The lack of similarities between you and your environment then becomes a leading indicator of your own future. Take it as a measurement to help you find clusters that belong to you, and vice versa. Most of the time only the end result matters, so being part of a successful organisation that matches your ethics is a prerequisite for your own successful career!

Quick connection from Hive to S3

The quickest way for me to move something out from Hadoop to S3 is to create a table located on an S3-bucket and then copy the data.

First create the table you want:

CREATE TEMPORARY TABLE stage.temporary1057 AS SELECT ... FROM database.table;

Create the S3-table, located on your bucket:

CREATE TABLE stage.temporary1057_s3 LIKE stage.temporary1057 LOCATION 's3a://bucket/folder'; ALTER TABLE stage.temp041726_s3 SET TBLPROPERTIES('EXTERNAL'='TRUE'); ALTER TABLE stage.temp041726_s3 SET SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ('field.delim' = '\t'); ALTER TABLE stage.temp041726_s3 SET SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ('serialization.format' = '\t'); INSERT OVERWRITE TABLE stage.temporary1057_s3 SELECT * FROM stage.temporary1057;

For me, only s3a-prefix works, previously s3n worked. Unclear why at the moment.

It would be nice to have a CREATE straight into S3, for temporary tables. Maybe too accident-prone.

For me, the main reason is to get stuff into Excel, to inspect and look at it, in a condensed form. If anyone has a way to print table-excerpts in Hive with fixed-width display, or some other way of looking at data clearly, let me know!

Reinventing the Data Scientist wheel

What is causality? And how do I get it? This, and many other answers, are posted beautifully here.

In short
The jump from data to decision means aiming to answer the (general) question of “what causes what”. This question of “causality” is answered best by a well-rounded team including analysts and researchers.

What happens when statistical skills are missing?
So, what happens if you forget the “statistics” part in the Data Science balance? The above linked blog cuts the hype to restore balance between data-technology and statistics. At statistics school, they teach design of experiments as the golden rule to causality. And as a counter-measure to false learning from too “data-centric” learning. These skills are available in form of basic statistical skills such as statistical tests, Simpsons paradox and more “tricky” concepts like Power.

Interpreters of findings can fill the gap
Proficiency with both IT and Statistics as well as business skills are needed to be successful, and learning all this takes lots of time. If the only analysts available are computer science majors (who usually are mostly interested in clever programming), who else is on the scene to help decision makers/managers/CxO:s understand what causes what? “Interpreters” are needed, they must not be expert statisticians, programmers, or business experts, but be somewhere in between to communicate findings.

Some final remarks
While Data Scientists might be catching up on causality, there is a whole host of professional veterans available. Most notably in the life sciences, way ahead for decades in finding fact-based correlations that hold over time, over geographies, over ethnicity, and in many other different contexts. We can dig into their treasure chest at our own leisure to learn their methods.

I remain fascinated (and slightly grumpy-fied) by how the constant reinventing of the “hottest profession in the 21st century” (Data Science) ends up discovering/reinventing age-old questions (and solutions) of basic research. Let’s begin from “standing on the shoulders of giants” before charging into the hottest future on the market!

Learning by Doing, part 2

The field of Data Science is a young field. So, with no paved road to future success, where are we going? As analysts, we should use facts, analysis and domain knowledge to perform a viable forecast for ourselves, if we are to be serious in forecasting the reality for others! With the history of other historically nascent industries as data, let’s find connections to Data Science. (and hope history will repeat itself!)

I will here connect Data Science with James Bessen’s conclusions (author of “Learning by Doing”[isbn 978-0-300-19566-8]). He argues convincingly about tech fields in general. Using his conclusions, I will try to predict possible scenarios for the young tech field of Data Science.

What will happen to the analytics/Data Science community in a few years?

The 3 observations below are directly connected to Bessen’s observation about where a technology is in its “life-cycle”. Specifically, these observations mean that Data Science is in it’s starting phase (not so surprisingly…). Below that is some educated guesstimates of coming decades.

What are some signs of an immature market?

1: [Education] Analytics are mostly done by “educated professionals”, as opposed to vocationally trained workers, which is a sign of an immature technology. “Educated professionals” adapt constantly to new technology since their education is “learning how to learn”.

2: [Narrow markets] Skills are more-or-less domain specific. It is still very hard to move from the insurance businesses to online retail. Bessen notes how hard it was moving from one type of weavers’ mill to another type, in the first decades of the industrial revolution.

3: [Sharing] Competition is non-existent, as immature markets still expand. Therefore, knowledge sharing and open source has a large community. This non-competition benefits workers AND companies. The early home computer clubs helped Microsoft and Apple founders develop, as an example.

Forecasts of where the market will be in the next decade/s

So, when Data Science/analytics technology matures, where is it headed the next decades? (NB: the word “technology” is both software/hardware and the associated skills)

Forecast 1: [More vocational training] More analysts with vocational training, and specific skills. Today, a teenager can run a big weather simulation on Amazons elastic cloud from her laptop. So, that extra college year won’t be valuable to the employer. Universities adapts to narrow learning-by-doing programs, OR become more theoretical/research oriented. (probably both will happen, and universities/colleges/highschools will complement each-other).

Forecast 2: [Wages might develop] Wages will go up more. Or not. It depends if the analytical problems (and solutions) are transferable between companies. Say you develop a method/model to improve profits for shoe retail stores, but (e.g.) airplane companies won’t pay you for that. But, other shoe retailers will, and at an increased salary. I’m a bit undetermined as to what will happen here.

Forecast 3: [Less sharing] Cooperation online will diminish. After all, with a saturated market, my specific skills are worth more if fewer people know about them. Even to the point that it might be the only way for me to get a job: sharing only means stiffer competition.

Forecast 4: [More patents] Patented technology will be even more of a big deal, as companies do not want to share, either. More non-disclosure agreements and non-compete clauses. (This is seen today, but not as severe, at least here in Sweden).

Forecast 5: [Licensing] Professional licensing will create entry barriers to protect analysts already on the market. The statisticians already have their own system of “chartered statistician”. Why not a “chartered Data Scientist”, or whatever the title will be in the future. Sadly, according to Bessen, a license gives a minimum quality level but does not improve overall service quality.

Forecast 6: [Prices down, salaries up] The overall price of analytics will go down. As the market stops growing, the crunch will create cost-efficiency as a way to make profit. The (vocational) analysts will get a bigger share of profits, as said above, but will also be able to produce so much more. With mature technology, the output and consumption rate is higher as we produce more and buy more at a lower price.

Confounding factors that can prove Bessen wrong
Each of the 3 Data Science competencies (programming/statistics/domain skills) have journeyed from immature to mature technology. Domain skills is a bit wague admittedly, but any technology dependent area go through the same stages. What is interesting/confounding is even if all the 3 competencies have matured individually, Data Science have not started out as a mature application. The field of analytics is more than the sum of it’s parts, and has it’s own development cycle.

General conclusion (please revisit this around 2041)
The mature market for analysts will see consumers and service providers do more analytics at a higher rate. Specially trained roles will streamline the analytical process. More lateral movement of workers between industries, and higher salaries. Competition increases, and entry barriers to the profession will go up.

Now, to be fair, some of the forecasts above is/might be already in progress. I have not seen them radicalised yet, though. So I am, just as you, eagerly anticipating the future, interested to see where we all will end up professionally!

Antonovsky re-formulated for work-life

This post has nothing to do with analytics per se, but it can still make you a better analyst.

How does motivation and concentration camps have anything in common?

Cruited dot com is a Swedish start-up that helps clients sharpen CVs and letters and in general improve job-searching. And recently quoted in a monthly magazine about “motivation at work”:

“motivation at work”
Autonomy (feeling of control, I can do it)
Relations (working with people I like)
Performance (I know how to do it, and it’s important to employer)
Meaning (this is worthwhile for society)

I recognized this list, but from somewhere else: The sociologist Antonovsky was impressed by how Israeli women coped with growing old. Some were concentration camp survivors, and showed a remarkable resilience to life’s hardships. These survivors had less stress and better health, due to what Antonovsky formulated as “sense of coherence”:

“sense of coherence”
Comprehensibility (I can understand what is going on, and somewhat predict it, too)
Manageability (I have skills, support, resources to control things)
Meaningfulness (Things in my life are interesting, meaningful and worthwhile to care about)

Look at the correspondence between “motivation” and “coherence”:

Cruited (“Motivation”)	Antonovsky (“Sense of coherence”)
Autonomy	Comprehensibility + Manageability
Relations	[no correspondence]**
Performance	Manageability
Meaning	Meaningfulness

**The lack of relations in Antonovskys work might be an artefact of how the study was conducted, or it was put into “Manageability” where relations are thought of as “support”. I don’t know, yet.

How come the same factors helping people survive concentration camps also help to experience motivation at work? Any questionmarks??

The answer to motivation and a healthy life (even when not in labour camp)
My answer: We humans might have a relativistic view of “hardship”, and react and respond no different to different scales of “stress”/suffering/hardship. We use the same mechanisms to stay healthy and keep on living.

At work, hardship affects our “motivation” to keep going on working. In the concentration camp, hardship affects our “coherence” to keep going on living. Thus, the 2 lists describe the same underlying human response mechanisms to stress/hardship.

This realisation should make us a little more empathetic. Threats are always relative in actual danger, but perceived according to everything else in your life. Since we live sheltered lives (at least here in Sweden) small threats are perceived as large.

Conclusion for the analyst
So, do you want to be more motivated? And live longer when in labour camp? Improve the 3 or 4 factors above, and understand what they mean for you, personally! There are benefits to be had from improvements in our approach to suffering. Both in your work motivation, and in your life quality.

Move data from one Excel file to another with precision

There are more than one way to command Excel to obey. A recent problem was to automate some basic cut-n-paste sequences between Excel files and run some VBA-snippets. After a few miserable hours, I found a sure way to move data with precision between a source and destination.

Seems Excel cares about the active application/window, when moving data around. Transporting data within a file is simpler than between files. Transfers across open Excel files can cause weird system errors. Also, share global variables between functions to activate the right window.

The code to use for copying between 2 workbooks (macro code is in destination file):

Windows(wkbSource.Name).Activate
wkbSource.Sheets("For Front Page").Select
Set shttocopy = wkbSource.ActiveSheet
' Table 1
range_from = "A1:E3"
range_dest = "A1:E3"
wkbDest.Worksheets("Front Page Data").Range(range_dest).Value = shttocopy.Range(range_from).Value

Maybe this can be simplified even more?