Giter VIP home page Giter VIP logo

Comments (4)

arlugones avatar arlugones commented on September 21, 2024

Tested the diagnose_category function also on Linux, using R 3.6 and got the same issue.

from dlookr.

choonghyunryu avatar choonghyunryu commented on September 21, 2024

Thank you. Alain

The diagnose_category function returns a tbl_df object. This object, unlike the data.frame object, prints only a few observations on the screen.

You can query the results of all categorical variables in several ways:

> library(dlookr)
> library(nycflights13)
>
> str(flights)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 336776 obs. of 19 variables:
$ year : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
$ month : int 1 1 1 1 1 1 1 1 1 1 ...
$ day : int 1 1 1 1 1 1 1 1 1 1 ...
$ dep_time : int 517 533 542 544 554 554 555 557 557 558 ...
$ sched_dep_time: int 515 529 540 545 600 558 600 600 600 600 ...
$ dep_delay : num 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
$ arr_time : int 830 850 923 1004 812 740 913 709 838 753 ...
$ sched_arr_time: int 819 830 850 1022 837 728 854 723 846 745 ...
$ arr_delay : num 11 20 33 -18 -25 12 19 -14 -8 8 ...
$ carrier : chr "UA" "UA" "AA" "B6" ...
$ flight : int 1545 1714 1141 725 461 1696 507 5708 79 301 ...
$ tailnum : chr "N14228" "N24211" "N619AA" "N804JB" ...
$ origin : chr "EWR" "LGA" "JFK" "JFK" ...
$ dest : chr "IAH" "IAH" "MIA" "BQN" ...
$ air_time : num 227 227 160 183 116 150 158 53 140 138 ...
$ distance : num 1400 1416 1089 1576 762 ...
$ hour : num 5 5 5 5 6 5 6 6 6 6 ...
$ minute : num 15 29 40 45 0 58 0 0 0 0 ...
$ time_hour : POSIXct, format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
>
> # only 10 rows - first variable
> diagnose_category(flights)
# A tibble: 33 x 6
variables levels N freq ratio rank

1 carrier UA 336776 58665 17.4 1
2 carrier B6 336776 54635 16.2 2
3 carrier EV 336776 54173 16.1 3
4 carrier DL 336776 48110 14.3 4
5 carrier AA 336776 32729 9.72 5
6 carrier MQ 336776 26397 7.84 6
7 carrier US 336776 20536 6.10 7
8 carrier 9E 336776 18460 5.48 8
9 carrier WN 336776 12275 3.64 9
10 carrier VX 336776 5162 1.53 10
# … with 23 more rows
>
> # all rows - all variables, this tbl_df
> diagnose_category(flights) %>%
+ print(n = 40)
# A tibble: 33 x 6
variables levels N freq ratio rank

1 carrier UA 336776 58665 17.4 1
2 carrier B6 336776 54635 16.2 2
3 carrier EV 336776 54173 16.1 3
4 carrier DL 336776 48110 14.3 4
5 carrier AA 336776 32729 9.72 5
6 carrier MQ 336776 26397 7.84 6
7 carrier US 336776 20536 6.10 7
8 carrier 9E 336776 18460 5.48 8
9 carrier WN 336776 12275 3.64 9
10 carrier VX 336776 5162 1.53 10
11 tailnum NA 336776 2512 0.746 1
12 tailnum N725MQ 336776 575 0.171 2
13 tailnum N722MQ 336776 513 0.152 3
14 tailnum N723MQ 336776 507 0.151 4
15 tailnum N711MQ 336776 486 0.144 5
16 tailnum N713MQ 336776 483 0.143 6
17 tailnum N258JB 336776 427 0.127 7
18 tailnum N298JB 336776 407 0.121 8
19 tailnum N353JB 336776 404 0.120 9
20 tailnum N351JB 336776 402 0.119 10
21 origin EWR 336776 120835 35.9 1
22 origin JFK 336776 111279 33.0 2
23 origin LGA 336776 104662 31.1 3
24 dest ORD 336776 17283 5.13 1
25 dest ATL 336776 17215 5.11 2
26 dest LAX 336776 16174 4.80 3
27 dest BOS 336776 15508 4.60 4
28 dest MCO 336776 14082 4.18 5
29 dest CLT 336776 14064 4.18 6
30 dest SFO 336776 13331 3.96 7
31 dest FLL 336776 12055 3.58 8
32 dest MIA 336776 11728 3.48 9
33 dest DCA 336776 9705 2.88 10
>
> # all rows - all variables, this data.frame
> diagnose_category(flights) %>%
+ data.frame()
variables levels N freq ratio rank
1 carrier UA 336776 58665 17.4195905 1
2 carrier B6 336776 54635 16.2229494 2
3 carrier EV 336776 54173 16.0857662 3
4 carrier DL 336776 48110 14.2854598 4
5 carrier AA 336776 32729 9.7183291 5
6 carrier MQ 336776 26397 7.8381476 6
7 carrier US 336776 20536 6.0978217 7
8 carrier 9E 336776 18460 5.4813882 8
9 carrier WN 336776 12275 3.6448559 9
10 carrier VX 336776 5162 1.5327696 10
11 tailnum 336776 2512 0.7458964 1
12 tailnum N725MQ 336776 575 0.1707366 2
13 tailnum N722MQ 336776 513 0.1523268 3
14 tailnum N723MQ 336776 507 0.1505452 4
15 tailnum N711MQ 336776 486 0.1443096 5
16 tailnum N713MQ 336776 483 0.1434188 6
17 tailnum N258JB 336776 427 0.1267905 7
18 tailnum N298JB 336776 407 0.1208518 8
19 tailnum N353JB 336776 404 0.1199610 9
20 tailnum N351JB 336776 402 0.1193672 10
21 origin EWR 336776 120835 35.8799321 1
22 origin JFK 336776 111279 33.0424377 2
23 origin LGA 336776 104662 31.0776302 3
24 dest ORD 336776 17283 5.1318978 1
25 dest ATL 336776 17215 5.1117063 2
26 dest LAX 336776 16174 4.8025988 3
27 dest BOS 336776 15508 4.6048412 4
28 dest MCO 336776 14082 4.1814144 5
29 dest CLT 336776 14064 4.1760696 6
30 dest SFO 336776 13331 3.9584175 7
31 dest FLL 336776 12055 3.5795306 8
32 dest MIA 336776 11728 3.4824334 9
33 dest DCA 336776 9705 2.8817374 10
>
> # top 3 levels for each categorical variables
> diagnose_category(flights, top = 3)
# A tibble: 12 x 6
variables levels N freq ratio rank

1 carrier UA 336776 58665 17.4 1
2 carrier B6 336776 54635 16.2 2
3 carrier EV 336776 54173 16.1 3
4 tailnum NA 336776 2512 0.746 1
5 tailnum N725MQ 336776 575 0.171 2
6 tailnum N722MQ 336776 513 0.152 3
7 origin EWR 336776 120835 35.9 1
8 origin JFK 336776 111279 33.0 2
9 origin LGA 336776 104662 31.1 3
10 dest ORD 336776 17283 5.13 1
11 dest ATL 336776 17215 5.11 2
12 dest LAX 336776 16174 4.80 3
>

from dlookr.

arlugones avatar arlugones commented on September 21, 2024

Perfectly understood! Yet I thought the result of diagnose_category would be summarized in a similar way to that of diagnose_numeric. Obviously it has more sense the way it works now.

from dlookr.

choonghyunryu avatar choonghyunryu commented on September 21, 2024

There are differences in how aggregate categorical and numeric data is aggregated.
I will consider what additional information I should provide.

from dlookr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.