csv_tools

csv_columns_summary.py

usage: csv_columns_summary.py [-h] [-v] [--columns COLUMN[,COLUMN...]]
                              [--function {all,count,sum,avg,min,max,std,median}]
                              [--count STRING] [--output FILE]
                              CSV_FILE

statistics summary for each colunm with CSV format

positional arguments:
  CSV_FILE              files to read, if empty, stdin is used

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  --columns COLUMN[,COLUMN...]
                        columns to do function, default=all columns that have
                        numeric values
  --function {all,count,sum,avg,min,max,std,median}
                        function to do
  --count STRING        count of given value. if given, '--function' is
                        ignored.
  --output FILE         path of output file, default=stdout

example:
  csv_columns_summary.py --function=sum --columns=A,B test1.csv
  csv_columns_summary.py --count=1 test1.csv
  csv_columns_summary.py --function=all --columns=A,B test1.csv| csvlook -I
| columns | count | sum | avg                | min | max | std                | median |
| ------- | ----- | --- | ------------------ | --- | --- | ------------------ | ------ |
| A       | 2.0   | 5.0 | 1.6666666666666667 | 0.0 | 4.0 | 2.0816659994661326 | 1.0    |
| B       | 2.0   | 7.0 | 2.3333333333333335 | 0.0 | 5.0 | 2.5166114784235836 | 2.0    |


csv_combine.py

usage: csv_combine.py [-h] [-v]
                      [--mode {first,bigger,smaller,add,sub,mul,div,mod,pow,lt,le,gt,ge,ne,eq,function}]
                      [--function EXP] [--prologe CODE;[CODE;CODE;...]]
                      [--boolean_by_number] [--output_file FILE]
                      CSV_FILE CSV_FILE_or_VALUE

complement the defect of datas with csv datas, element-wise

positional arguments:
  CSV_FILE              first csv file to complement
  CSV_FILE_or_VALUE     second csv file or scalar float value

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  --mode {first,bigger,smaller,add,sub,mul,div,mod,pow,lt,le,gt,ge,ne,eq,function}
                        combine mode
  --function EXP        lambda function for function mode
  --prologe CODE;[CODE;CODE;...]
                        pieces of python code to pre-load, for use in
                        expression of '--function'.
  --boolean_by_number   for logical results, use 1/0 instead of True/False
  --output_file FILE    path of output file,default=stdout

remark:
  This processes columns that have only numeric values.

  The bigger mode, smaller mode and etc are available for columns that have numeric values, others are done by combine_first.
  At the compare modes(lt,gt,and so on), NaN leads into False as result, always.

  About function that was given by '--function', that has two arguments of pandas.Series. see document of pandas.DataFrame.combine:
    https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.combine.html#pandas.DataFrame.combine
  When you use '--function', you may use '--prologe' to load external module:
    ex. '--prologe="import numpy as np;"' or '--prologe="import your_module;"'

    | mode     | numeric only | description                                   |
    | -------- | ------------ | --------------------------------------------- |
    | first    |              | combine_first                                 |
    | bigger   | O            | select bigger                                 |
    | smaller  | O            | select smaller                                |
    | function | O            | to do given function                          |
    | add      | O            | adding                                        |
    | sub      | O            | subtructing                                   |
    | mul      | O            | multipling                                    |
    | div      | O            | dividing                                      |
    | mod      | O            | modulo                                        |
    | pow      | O            | power                                         |
    | lt       | O            | results of 'less than' are True/False         |
    | le       | O            | results of 'less than equal' are True/False   |
    | gt       | O            | results of 'greater than' are True/False      |
    | ge       | O            | results of 'grater than equal' are True/False |
    | ne       | O            | results of 'not equal' are True/False         |
    | eq       | O            | results of 'equal' are True/False             |

  Second argument is a path of a second csv file or scalar float value.
  When scala value is given, second csv will be created with the same shape as first csv and have given value in all elements.
  Because of inefficiently, you should not use scala value as second argument for big csv data.

example:
  csv_combine.py --mode bigger t1.csv t2.csv| csvlook -I
  | A   | B   | C | D |
  | --- | --- | - | - |
  | 1.0 | 1.0 | 1 | a |
  | 3.0 | 4.0 | 5 | b |
  | 5.0 | 5.0 | 3 | c |
  | 6.0 |     | 4 | d |

  csv_combine.py --mode smaller t1.csv t2.csv| csvlook -I
  | A   | B   | C | D |
  | --- | --- | - | - |
  | 0.0 | 0.0 | 0 | a |
  | 2.0 | 2.0 | 0 | b |
  | 3.0 | 5.0 | 3 | c |
  | 6.0 |     | 4 | d |

  csv_combine.py --mode function --function "lambda s1,s2: s1 if s1.sum() > s2.sum() else s2" t1.csv t2.csv |csvlook -I
  | A   | B   | C | D |
  | --- | --- | - | - |
  | 0.0 | 1.0 | 1 | a |
  | 3.0 | 4.0 | 0 | b |
  | 5.0 | 5.0 | 3 | c |
  | 6.0 |     | 4 | d |

  csvlook -I t1.csv
  | A | B | C | D |
  | - | - | - | - |
  | 1 | 0 | 1 | a |
  | 2 | 2 | 0 | b |
  | 3 |   | 3 |   |
  |   |   | 4 | d |

  csvlook -I t2.csv
  | A | B | C | D |
  | - | - | - | - |
  | 0 | 1 | 0 | a |
  | 3 | 4 | 5 | b |
  | 5 | 5 |   | c |
  | 6 |   |   |   |


csv_correlation.py

usage: csv_correlation.py [-h] [-v] [--mode {auto,cross,partial}]
                          [--nlags INT] [--sampling INT] [--na_value FLOAT]
                          [--output FILE]
                          CSV_FILE COLUMN[,COLUMN...]

evaluate cross/auto correlation between columns

positional arguments:
  CSV_FILE              files to read, if empty, stdin is used
  COLUMN[,COLUMN...]    columns to do

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  --mode {auto,cross,partial}
                        correlation mode
  --nlags INT           maximum number of lags,default=100
  --sampling INT        number of step for rows to do sample
  --na_value FLOAT      value to fill for NaN,default=0
  --output FILE         path of outut file, default=stdout

remark:
  In results, prefix 'ac_' means result of auto-correlation, prefix 'cc_' means result of cross-correlation.
  prefix 'ac_ci_l' and 'ac_cf_u' mean lower adn upper level of 95% confidence interval for no-correlation.

  For '--mode=cross', nlags means number of records in output. If you want all records from cross-correlation, set nlags=0.

example:
  csv_correlation.py test_correlation.csv COL_0000,COL_0001
  csv_correlation.py test_correlation.csv COL_0000|head|csvlook -I
| index | ac_COL_0000            | ac_ci_l_COL_0000     | ac_ci_u_COL_0000    |
| ----- | ---------------------- | -------------------- | ------------------- |
| 0     | 1.0                    | 1.0                  | 1.0                 |
| 1     | 0.07018334701195321    | -0.06840703542301456 | 0.20877372944692096 |
| 2     | -0.031686282848955645  | -0.17095764718861142 | 0.10758508149070015 |
| 3     | 0.06474599533914761    | -0.07466376744626992 | 0.20415575812456516 |
| 4     | -0.07187457627030945   | -0.21186070987120875 | 0.06811155733058984 |
| 5     | -0.0032344642148376297 | -0.14392762546855406 | 0.13745869703887878 |
| 6     | -0.02065593286982406   | -0.16135052234546393 | 0.12003865660581581 |
| 7     | 0.03586195035334148    | -0.10489087472112046 | 0.1766147754278034  |
| 8     | 0.05144023922871804    | -0.08948797647058224 | 0.19236845492801832 |

  csv_correlation.py --mode=cross test_correlation.csv COL_0000,COL_0001,COL_0002
  csv_correlation.py --mode=cross --sampling=2 test_correlation.csv COL_0000,COL_0001,COL_0002|head|csvlook -I
| index | cc_COL_0000_COL_0001  | cc_COL_0000_COL_0002  | cc_COL_0001_COL_0002  |
| ----- | --------------------- | --------------------- | --------------------- |
| 0     | -0.07832200979116527  | -0.0361744688777645   | -0.0383920692904824   |
| 1     | -0.0584437912103411   | 0.02924305243353182   | 0.0014283173939956392 |
| 2     | -0.004634262357976521 | 0.009863911035045745  | 0.0700311412765593    |
| 3     | 0.08549028836897214   | 0.145849764523322     | -0.07607057576313002  |
| 4     | 0.06411213630824009   | -0.009025766566690439 | -0.043516364265988865 |
| 5     | 0.053392718054984536  | -0.04056558651200204  | 0.09684720026396708   |
| 6     | -0.10900425075345083  | -0.022320478554022246 | 0.0459223360399405    |
| 7     | 0.014787564430673562  | -0.02152087172311092  | -0.00444335370431942  |
| 8     | -0.02779304465519147  | -0.027346286491568755 | 0.12710493528359032   |


csv_crosstable.py

usage: csv_crosstable.py [-h] [-v] [--values COLUMN]
                         [--aggregator {sum,min,max,mean,median,prod,count_nonzero}]
                         [--row_names NAME[,NAME...]]
                         [--column_names NAME[,NAME...]]
                         [--normalize {all,index,column}]
                         [--suppress_all_zero] [--output_file FILE]
                         CSV_FILE ROW_COLUMN[,ROW_COLUMN...]
                         COLUMN[,COLUMN...]

make cross-matching table from csv file

positional arguments:
  CSV_FILE              files to read, if empty, stdin is used
  ROW_COLUMN[,ROW_COLUMN...]
  COLUMN[,COLUMN...]

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  --values COLUMN       name of column for value
  --aggregator {sum,min,max,mean,median,prod,count_nonzero}
                        aggregator function for '--values', default='sum'
  --row_names NAME[,NAME...]
                        names of rows in output,default=names of given rows
  --column_names NAME[,NAME...]
                        names of columns in output,default=names of given
                        columns
  --normalize {all,index,column}
                        normalized mode, see 'remark'
  --suppress_all_zero   suppress outputing columns whose elements are all NaN
  --output_file FILE    path of output file,default=stdout

remark:

  For '--normalize', following are available.
   ‘all’     : normalize over all values.
   ‘index’   : normalize over each row.
   ‘columns’ : normalize over each column.
  see 'pandas.crosstab https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html'

example:

# input data for example
cat test_crosstable.csv|csvlook -I
| COL001 | COL002 | COL003 | COL004 | COL005 |
| ------ | ------ | ------ | ------ | ------ |
| A      | a      | 1      | 2      | Z      |
| B      | b      | 1      | 3      | X      |
| C      | c      | 1      | 4      | Y      |
| D      | d      | 1      | 5      | Z      |
| A      | b      | 1      | 6      | V      |
| B      | c      | 1      | 7      | W      |
| C      | d      | 1      | 8      | S      |
| D      | a      | 1      | 9      | T      |

csv_crosstable.py test_crosstable.csv COL001 COL002|csvlook -I
| COL001 | a | b | c | d |
| ------ | - | - | - | - |
| A      | 1 | 1 | 0 | 0 |
| B      | 0 | 1 | 1 | 0 |
| C      | 0 | 0 | 1 | 1 |
| D      | 1 | 0 | 0 | 1 |

csv_crosstable.py test_crosstable.csv --values=COL004 COL001 COL002|csvlook -I
| COL001 | a   | b   | c   | d   |
| ------ | --- | --- | --- | --- |
| A      | 2.0 | 6.0 |     |     |
| B      |     | 3.0 | 7.0 |     |
| C      |     |     | 4.0 | 8.0 |
| D      | 9.0 |     |     | 5.0 |

csv_crosstable.py test_crosstable.csv --values=COL004 --aggregator=prod COL001 COL002,COL005
COL002,a,a,a,a,a,a,a,b,b,b,b,b,b,b,c,c,c,c,c,c,c,d,d,d,d,d,d,d
COL005,S,T,V,W,X,Y,Z,S,T,V,W,X,Y,Z,S,T,V,W,X,Y,Z,S,T,V,W,X,Y,Z
COL001,,,,,,,,,,,,,,,,,,,,,,,,,,,,
A,,,,,,,2.0,,,6.0,,,,,,,,,,,,,,,,,,
B,,,,,,,,,,,,3.0,,,,,,7.0,,,,,,,,,,
C,,,,,,,,,,,,,,,,,,,,4.0,,8.0,,,,,,
D,,9.0,,,,,,,,,,,,,,,,,,,,,,,,,,5.0

# with '--suppress_all_zero'
csv_crosstable.py test_crosstable.csv --values=COL004 --aggregator=prod --suppress_all_zero COL001 COL002,COL005
COL002,a,a,b,b,c,c,d,d
COL005,T,Z,V,X,W,Y,S,Z
COL001,,,,,,,,
A,,2.0,6.0,,,,,
B,,,,3.0,7.0,,,
C,,,,,,4.0,8.0,
D,9.0,,,,,,,5.0


csv_dummy.py

usage: csv_dummy.py [-h] [-v] [--output FILE] [--quote]
                    [--mode {rand,int,hex,header,arbitrarily}]
                    [--headers FILE]
                    ROWS [COLS]

generate dummy data of csv

positional arguments:
  ROWS                  number of data rows
  COLS                  number of columns

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  --output FILE         output file
  --quote               quoteing value of each cell
  --mode {rand,int,hex,header,arbitrarily}
                        value mode of each cell: hex={ir:04x}{ic:04x},
                        rand=random, ind=continus integer, header=definition
                        in headers file
  --headers FILE        list file that has names of columns: csv format as one
                        records or each name per line

remark:
  when '--mode=header' was given, you can use 'np.random.*'. 
  see "Random Generator  NumPy v1.19 Manual https://numpy.org/doc/stable/reference/random/generator.html"

  In header mode, 
  'index' means serial number with 0-base, 'int' means ic+ncols+ir, 
  'rand' means uniform random number in [0,1], 'random.*' means using function in np.random.*,.
  'datetime' means time string of now()+(ir+ic) seconds or frequentry time string with 'datetime: