# DataFrame
**Repository Path**: dr.wei/DataFrame
## Basic Information
- **Project Name**: DataFrame
- **Description**: C++ DataFrame for statistical, Financial, and ML analysis -- in modern C++ using native types, continuous memory storage, and no pointers are involved
- **Primary Language**: Unknown
- **License**: BSD-3-Clause
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 1
- **Forks**: 0
- **Created**: 2021-09-26
- **Last Updated**: 2022-03-23
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
[](https://ci.appveyor.com/project/hosseinmoein/dataframe)
[](https://travis-ci.com/hosseinmoein/DataFrame)


[](https://isocpp.org/std/the-standard )
[](https://app.codacy.com/manual/hosseinmoein/DataFrame?utm_source=github.com&utm_medium=referral&utm_content=hosseinmoein/DataFrame&utm_campaign=Badge_Grade_Dashboard)
[](https://mybinder.org/v2/gh/hosseinmoein/DataFrame/master)
## [*DataFrame Documentation / Code Samples*](https://htmlpreview.github.io/?https://github.com/hosseinmoein/DataFrame/blob/master/docs/HTML/DataFrame.html)
This is a C++ analytical library that provides interface and functionality similar to packages/libraries in Python and R. For example, you could compare this to Pandas or R data.frame.
You could slice the data in many different ways. You could join, merge, group-by the data. You could run various statistical, summarization, financial, and ML algorithms on the data. You could add your custom algorithms easily. You could multi-column sort, custom pick and delete the data. And more …
DataFrame also includes a large collection of analytical algorithms in form of visitors. These are from basic stats such as Mean, Std Deviation, Return, … to more involved analysis such as Affinity Propagation, Polynomial Fit, Fast Fourier transform of arbitrary length … including a good collection of trading indicators. You could also easily add your own algorithms.
For basic operations to start you off, see [Hello World](test/hello_world.cc). For a complete list of features with code samples, see [documentation](https://htmlpreview.github.io/?https://github.com/hosseinmoein/DataFrame/blob/master/docs/HTML/DataFrame.html).
I have followed a few principles in this library:
1. Support any type either built-in or user defined without needing new code
2. Never chase pointers ala `linked lists`, `std::any`, `pointer to base`, ..., including `virtual functions`
3. Have all column data in continuous memory space. Also, be mindful of cache-line aliasing misses between multiple columns
4. Never use more space than you need ala `unions`, `std::variant`, ...
5. Avoid copying data as much as possible
6. Use multi-threading but only when it makes sense
7. Do not attempt to protect the user against `garbage in`, `garbage out`
[DateTime](docs/DateTimeDoc.pdf)
DateTime class included in this library is a very cool and handy object to manipulate date/time with nanosecond precision and multi timezone capability.
---
### Performance
There is a test program [_dataframe_performance_](test/dataframe_performance.cc) that should give you a sense of how this library performs. As a comparison, there is also a Pandas [_pandas_performance_](test/pandas_performance.py) script that does exactly the same thing.
dataframe_performance.cc uses DataFrame async interface and is compiled with gcc (10.3.0) compiler with -O3 flag. pandas_performance.py is ran with Pandas 1.3.2, Numpy 1.21.2 and Python 3.7 on Xeon E5-2667 v2. What the test program roughly does:
1. Generate ~1.6 billion timestamps (second resolution) and load them into the DataFrame/Pandas as index.
2. Generate ~1.6 billion random numbers for 3 columns with normal, log normal, and exponential distributions and load them into the DataFrame/Pandas.
3. Calculate the mean of each of the 3 columns.
Result:
```bash
$ python3 test/pandas_performance.py
Starting ... 1629817655
All memory allocations are done. Calculating means ... 1629817883
6.166675403767268e-05, 1.6487168460770107, 0.9999539627671375
1629817894 ... Done
real 5m51.598s
user 3m3.485s
sys 1m26.292s
$ Release/bin/dataframe_performance
Starting ... 1629818332
All memory allocations are done. Calculating means ... 1629818535
1, 1.64873, 1
1629818536 ... Done
real 3m34.241s
user 3m14.250s
sys 0m25.983s
```
The Interesting Part:
1. Pandas script, I believe, is entirely implemented in Numpy which is in C.
2. In case of Pandas, allocating memory + random number generation takes almost the same amount of time as calculating means.
3. In case of DataFrame ~90% of the time is spent in allocating memory + random number generation.
4. You load data once, but calculate statistics many times. So DataFrame, in general, is about ~11x faster than parts of Pandas that are implemented in Numpy. I leave parts of Pandas that are purely in Python to imagination.
5. Pandas process image at its peak is ~105GB. C++ DataFrame process image at its peak is ~56GB.
---
[DataFrame Test File](test/dataframe_tester.cc)
[DataFrame Test File 2](test/dataframe_tester_2.cc)
[Heterogeneous Vectors Test File](test/vectors_tester.cc)
[Date/Time Test File](test/date_time_tester.cc)
---
[Contributions](docs/CONTRIBUTING.md)
[License](License)
---
### Installing using CMake
```bash
mkdir [Debug | Release]
cd [Debug | Release]
cmake -DCMAKE_BUILD_TYPE=[Debug | Release] ..
make
make install
cd [Debug | Release]
make uninstall
```
### Package managers
If you are using _Conan_ to manage your dependencies, add `dataframe/x.y.z@` to your requires, where `x.y.z` is the release version you want to use. _Conan_ will acquire DataFrame, build it from source in your computer, and provide CMake integration support for your projects. See the [_Conan_ docs](https://docs.conan.io/en/latest/) for more information.
Sample `conanfile.txt`:
```text
[requires]
dataframe/1.18.0@
[generators]
cmake
```