Tuesday, July 29, 2025

line end Windows versus others

 Windows users that introduce their files in Unix (Linux, Mac) systems sometimes have lots of unexpected problems due to end of line codings. You need to think that a file is just a long stream in which it happens that some hidden characters represent end-of-line. The end of lines is represented with different conventions in the Windows (DOS) and Unix (Linux, Mac) worlds as can be seen here:

https://stackoverflow.com/questions/1552749/difference-between-cr-lf-lf-and-cr-line-break-types

Windows uses CR LF and Unix uses LF. OLd Mac uses CR but modern versions use LF.


How can I see these differences?

- the Windows editor Notepad++ shows them

- the Mac editor Visual Studio Code shows its existence here: 




- vim shows the type when you open the file:


- the file command shows it too

$ file data.txt

data.txt: ASCII text, with CRLF line terminators

$ file extractDescendants.py 

extractDescendants.py: Python script, ASCII text executable

because the last one doesn't say CRLF, then it's LF so it's Unix format.


How can I convert among formats?

Most often you want to convert from Windows (CRLF) to Linux (LF) . 

From memory, Notepad++ allows you to save in Linux format.

In Visual Studio Code, click on the bottom sign "CRLF" and see appear a menu that allows to change it:


In the command line:

many servers have the programs flip and dos2unix that allow to do that:

$ flip -h       


Usage: flip [-t|-u|-d|-m] filename[s]

   Converts ASCII files between Unix, MS-DOS/Windows, or Macintosh newline formats


   Options: 

      -u  =  convert file(s) to Unix newline format (newline)

      -d  =  convert file(s) to MS-DOS/Windows newline format (linefeed + newline)

      -m  =  convert file(s) to Macintosh newline format (linefeed)

      -t  =  display current file type, no file modifications


$ dos2unix -h

dos2unix 6.0.3 (2013-01-25)

Usage: dos2unix [options] [file ...] [-n infile outfile ...]

if you don't have any of this, you can install flip from here:

https://ccrma.stanford.edu/~craig/utility/flip/ 

This is how we compiled it for MacOs:

g++ -ansi -O3 flip.cpp -o flip

You only need to use flip -u myfile and it will modify in-place my file to Unix format. (You DON'T need to use flip -m because that Mac format is obsolete now.) 

For instance:

$ file mydata.txt

mydata.txt: ASCII text, with CRLF line terminators

$ flip -u mydata.txt

$ file mydata.txt

mydata.txt: ASCII text





  

Tuesday, March 25, 2025

handling python files in parallel

ar=[str(x) for x in range(0,5)]

print(*ar)

handles = [open('file'+x, 'w') for x in ar] # not using generator

#https://stackoverflow.com/questions/1747817/create-a-dictionary-with-comprehension

myHandles=dict(zip(ar,handles))

for i in range(1,100):

    myf=str(i%5)

    print("a",i,file=myHandles[myf])

Wednesday, November 27, 2024

quick filter in awk

 make a list of animals in col2 of 1st file, then print 2nd file if animals ar NOT in the list


awk 'FNR==NR{inn[$2]; next}  !($2 in inn)' test5 relGenApprox

Friday, October 18, 2024

python like awk

 so I want to put Python in a pipe where it reads from std in , does something, writes to std out:


$ cat test.py

#!/usr/bin/env python3

import sys

# https://stackoverflow.com/questions/1450393/how-do-i-read-from-stdin

for n,line in enumerate(sys.stdin):

    a=line.split()

    if n%5==0:

        sys.stdout.write(line)

Friday, March 1, 2024

extract UPG allocation from renumf90 log

 so you want to know from the output of renumf90 (say ren.log) how many animals were allocated to each UPG?

The output looks like this:

 Unknown parent group allocation

 Equation   Group       #Animals

 59815785       1   21349

...

 Max group = 424; Max UPG ID = 59816208

...

Use this in the command line:

$ sed -n '/allocation/,/Max\ group/p' ren.log | awk '$1 ~ /^[0-9]+$/' 

e.g.

 59815785         21349

 59815786       2    4615

explanations:

find between two patterns:

https://askubuntu.com/a/849016

check if the 1st column is a (positive) integer: 

https://stackoverflow.com/questions/28878995/check-if-a-field-is-an-integer-in-awk



Tuesday, January 23, 2024

Dictionary of arrays in Julia and breed fractions

 So in Julia I want to create a dictionary (hash table or associative array) of arrays to store breed composition in dairy cattle. It took me a while but I found out how to declare it:

julia> a=Dict{String,Array{Float64,1}}()

Imagine the following pedigree

A 0 0 Holstein
B 0 0 Jersey
C A B
D A C

then this Julia script computes breed fractions

#=
A 0 0 Holstein
B 0 0 Jersey
C A B
D A C
=#

breedcomp=Dict{String,Array{Float64,1}}()

#purebred founders
breedcomp["A"]=[1.0,0.0]
breedcomp["B"]=[0.0,1.0]
#rest of pedigree
breedcomp["C"]=0.5*(breedcomp["A"]+breedcomp["B"])
breedcomp["D"]=0.5*(breedcomp["A"]+breedcomp["C"])

display(breedcomp)


Dict{String, Vector{Float64}} with 4 entries:

  "B" => [0.0, 1.0]

  "A" => [1.0, 0.0]

  "C" => [0.5, 0.5]

  "D" => [0.75, 0.25]

Monday, October 23, 2023

SNP effects from Single Step GBLUP with APY

 APY is a technique that allows representing the inverse of a genomic relationship matrix in a sparse format by choosing a "core" of animals (here and here). The authoritative guide to APY is Bermann et al. 2022: this paper

One of the key aspects of genomic models is the ability of estimating SNP effects and then apply them to newly genotyped animals, what is know as Indirect Predictions. Matias and I found out that it's easier than we thought as shown here.

Among many other things Bermann et al. show that one can write indirect predictions of "non-core" animals as (I use the original equation numbering)

eq. (10) : $latex \mathbf{{u}_n}= \mathbf{Z}_n \mathbf{Z}'_c (\mathbf{Z}_c \mathbf{Z}'_c)^{-1} \mathbf{Z}_c  \mathbf{{a}} + \boldsymbol{\xi} $latex

where $latex \mathbf{{a}} $latex are SNP effects and $latex \boldsymbol{\xi} $latex is an error term that does not depend on $latex \mathbf{{u}_c} $latex. 

we obtain SNP effects from eq. 21 in Bermann et al:

$latex \mathbf{\hat{a}}=k \mathbf{Z}'_c \mathbf{G}_{cc}^{-1} \mathbf{\hat{u}}_{c}  $latex

we plug that into (10) and expand:

$latex \mathbf{\hat{u}_n}= \mathbf{Z}_n \mathbf{Z}'_c (\mathbf{Z}_c \mathbf{Z}'_c)^{-1} \mathbf{Z}_c \mathbf{\hat{a}} =k \mathbf{Z}_n \mathbf{Z}'_c (k \mathbf{Z}_c \mathbf{Z}'_c)^{-1} \mathbf{Z}_c \mathbf{\hat{a}} $latex

we substitute for $latex \mathbf{\hat{a}} $latex:

$latex \mathbf{\hat{u}_n} =k \mathbf{Z}_n \mathbf{Z}'_c (k \mathbf{Z}_c \mathbf{Z}'_c)^{-1} \mathbf{Z}_c k \mathbf{Z}'_c \mathbf{G}_{cc}^{-1} \mathbf{\hat{u}}_{c}=k \mathbf{Z}_n \mathbf{Z}'_c \mathbf{G}_{cc}^{-1} \mathbf{\hat{u}}_{c} =  \mathbf{Z}_n \mathbf{\hat{a}} $latex

which is the original eq. 21. This is because $latex \hat{a} $latex  is part of the  column space of  $latex \mathbf{Z}'_c $latex, then $latex \mathcal{P}\mathbf{\hat{a}} = \mathbf{\hat{a}} $latex.

Finally, obtaining Indirect Predictions from APY is deadly simple and intuitive:

  • $latex \mathbf{\hat{a}}=k \mathbf{Z}'_c \mathbf{G}_{cc}^{-1} \mathbf{\hat{u}}_{c}  $latex
  • $latex \mathbf{\hat{u}_n} =  \mathbf{Z}_n \mathbf{\hat{a}} $latex