Commit b8ceb786 authored by Мария Григорьева's avatar Мария Григорьева
Browse files

Add new file

parent b3273048
joint_affiliations.py
---------------------
This test module calculates the number of papers published
by all pairs of institutes.
1) Takes data from the ElasticSearch - returns a list of joint affiliation
records for each document.
For example (affiliations are represented as records IDs):
[{'903034', '903194', '904471'},
{'903032', '904045'},
{'912417', '906878', '903031', '902955'},
{'903034', '910554', '908795', '912109', '903297', '903844', '902780'}]
Each line represents all affiliations of the document.
2) Get the set of all unique affiliations ID.
3) Then a new zero square matrix is generated. Shape of this matrix is
the number of unique affiliations in the dataset.
4) At the next step for each list of joint affiliations the matrix is
updated - add 1 to each cell representing as the paired combination
of the affiliations.
5) Upper triangle of this matrix is taken (including diagonal values)
and converted to the dense matrix representation:
id | level_0 | level_1 | count
--------------------------------------------------------------
1 | affiliation ID X | affiliation ID Y | number of papers
2 | affiliation ID X1 | affiliation ID Y1 | number of papers
...
Remove all records where count == 0.0
6) Takes the array with geolocation for each affiliation and add latitude
and longitude to dense matrix.
The result:
,level_0,level_1,count,lat_0,lon_0,lat_1,lon_1
0,903034,903034,2.0,55.7,37.5166667,55.7,37.5166667
1,904471,903194,1.0,54.0747574,61.567051,54.86594010000001,37.2165716
2,903194,903194,1.0,54.86594010000001,37.2165716,54.86594010000001,37.2165716
3,903297,902780,1.0,45.052094,7.681456,56.75519179999999,37.139720700000005
4,903297,903844,1.0,45.052094,7.681456,55.574380700000006,42.0201812
5,903297,903297,1.0,45.052094,7.681456,45.052094,7.681456
6,912417,902955,1.0,55.574380700000006,42.0201812,55.755786,37.617633000000005
7,912109,902780,1.0,61.52401,105.31875600000001,56.75519179999999,37.139720700000005
8,912109,903844,1.0,61.52401,105.31875600000001,55.574380700000006,42.0201812
9,912109,903297,1.0,61.52401,105.31875600000001,45.052094,7.681456
10,912109,912109,1.0,61.52401,105.31875600000001,61.52401,105.31875600000001
11,908795,902780,1.0,55.709235,37.542545600000004,56.75519179999999,37.139720700000005
12,908795,903844,1.0,55.709235,37.542545600000004,55.574380700000006,42.0201812
13,908795,903297,1.0,55.709235,37.542545600000004,45.052094,7.681456
14,908795,912109,1.0,55.709235,37.542545600000004,61.52401,105.31875600000001
15,908795,908795,1.0,55.709235,37.542545600000004,55.709235,37.542545600000004
16,904471,904471,1.0,54.0747574,61.567051,54.0747574,61.567051
17,903844,902780,1.0,55.574380700000006,42.0201812,56.75519179999999,37.139720700000005
18,910554,902780,1.0,56.73202020000001,37.166897399999996,56.75519179999999,37.139720700000005
19,910554,903844,1.0,56.73202020000001,37.166897399999996,55.574380700000006,42.0201812
Note:
If the "level_0" and "level_1" are the same affiliation IDs, the "count"
value means the total number of papers published by this affiliation
(institute).
So, besides connections we have also values for each institute.
\ No newline at end of file
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment